So you want to Version Control Configuration?

REFERENCE
8 min read

Here at DoltHub, we've had a lot of success with our "So you want..." series of blog posts helping people find Dolt when they are looking for it. Dolt is a lot of things. Dolt is a version controlled database, a Git database, Git for data, data version control, an immutable database, and a decentralized database.

Dolt can also version control configuration. What other tools can do this? What do you mean by configuration? When does configuration become data? Isn't this just configuration management? This blog answers all these questions and surveys a few helpful products in the space, including Dolt.

Version Control

Version control is a software system that allows multiple people to collaborate efficiently on a set of files. Version control is most often used to manage source code. Key features that allow for efficient collaboration include:

  • Checkpoints: Also called commits. Save the state of your files for comparison with later versions. Annotated with author and other metadata.
  • Log: Show the history of revisions.
  • Diff: Compare past revisions to show a human readable set of changes.
  • Branch: Create a new copy that can evolve independently from other copies.
  • Merge: Create a version from two other versions by performing a (usually) line-based merge.
  • Revert: Quickly rollback unwanted changes.

These features allow hundreds or even thousands of people to work independently on the same files and share their changes to those files efficiently.

Additionally, the world's most popular version control system, Git, allows you to version control files in a decentralized fashion. No centralized server is required to collaborate. Git implements the concept of a remote. To interact with a remote you clone, fetch, push, and pull.

As you'll see later, version control is no longer limited to just files. Database tables and the data in them can now be version controlled.

Configuration

Configuration in the context of this blog is a set of "magic" numbers and strings that define how software operates. How does configuration become configuration? What are the stages software goes through where configuration becomes a mess? How does this relate to version control?

Evolution

Let's start at the beginning and show the evolution of configuration in a software project and weave in some version control commentary.

Configuration Evolution

Variables

Configuration starts as a humble variable. You assign a variable to a number or string value to define your program's behavior. How may items are allowed in this list? What is the default value of this string? How many times do I perform this loop? In the beginning these values are defined as variables in your program.

Note, your configuration variables start off version controlled because they are in your source code. If you want to change a configuration variable, it gets reviewed like source code using version control diff functionality. If you want to figure out what a value was in the past, you can look at an older version of the code. If you and a colleague need it to be different values as you develop, branches are used to manage these two concurrent values. Generally, configuration is version controlled from inception.

Constants

Usually, during the first code review of your software, upon seeing a variable defining a configuration value, a colleague will note something like "Magic number. Make it a constant." Unlike variables, constants cannot be changed as the program runs. They are usually defined at the top of the file. Defining configuration as constants is best practice because of this extra constraint.

Again, these constants are version controlled because they live in your source code. Conveniently, by custom, they probably all live at the top of your file and thus, it's easy to manage that section of your code in version control.

Constants File

Eventually as your software gets larger, some of these constants must be shared amongst a number of files. At this point, it's common to move constants to their own file and have that file included as a dependency in other files in your project: constants.go, constants.java, etc.

You'll sense a trend here. Your constants file is still managed along with your source code in version control.

Configuration File

Once your software actually has users, some of these constants defining the behavior of your software need to be able to be set by the user. How many concurrent connections are allowed? Where on the filesystem does your program put the files? Dark mode or no? Your configuration file is born.

Moreover, if you are running a service like a website, all the values needed to deploy the service are usually stored as configuration. The software that runs your software needs configuration too.

Common formats for configuration files are YAML, JSON, and XML. These files define key value pairs at their root. But you can also define configuration hierarchy, grouping like values under keys. Configuration can get quite complicated especially for complicated applications. For instance, for the DoltHub website, we currently have 48,125 lines of YAML configuration defined across hundreds of files.

It is common for configuration files to be stored in version control along with source code. We do it. This works up to a point and until recently, if you wanted to version control configuration, this was your only option.

Data

You start to run into a few problems as your configuration gets really large. First, GitHub, the most common Git hosting platform, does not allow single files above 100MB because Git starts to struggle with files this large. So, if you need more configuration than that, you need to start breaking it up into multiple files. Second, most configuration formats are unordered. The order in which you define configuration does not matter. However, file version control is line ordered. So, you must sort configuration in the same order every time you commit to version control or your version control system thinks the configuration is different when it may not be. Lastly, as configuration gets that large, you probably want to query it to read or change values, not open it up in a file editor.

At a certain point your configuration ceases to be configuration and becomes data. You want the values of all this stuff stored in a database, not a set of files. However, until recently, you could not version control databases. So, you stuck large files in S3 and lost fine-grained version control or you stuck the values in a MySQL or Postgres database and lost your ability to version control.

We now have version controlled databases for really large sets of configuration you want to query and version control. What a time to be alive! Keep reading to see your options at the end of this article.

When does configuration become data?

But Tim, my configuration is fine in Git. Consider yourself lucky! Your software has not outgrown files in Git configuration. Consider the following use cases where configuration has out-scaled files selected from a sample of Dolt's customers.

Video Games

At present, treating configuration as data is most prevalent in game development. Modern games have gigabytes of magic numbers and strings that dramatically effect game play: drop and spawn rates, character and equipment strength, map configuration, dialog, etc. Game developers and designers are constantly tweaking this configuration, often in conflicting ways. Branches, merges, and diffs across video game configuration that has grown larger than files can handle is very useful.

Machine Learning

Another domain where what traditionally could be called configuration has become data is machine learning. Labels, training, and test data all are building blocks of modern machine learning models. Version controlling this data first creates model reproducibility. Additionally, some machine learning practitioners have started using traditional branch, merge and diff for faster model development and explainability.

Domain Specific Problems

Beyond video games and machine learning, certain domains produce configuration at scale. Here at DoltHub, we've seen power grid designs, cell biology, insurance plan rules, and materials for online learning all cross the threshold from configuration to data.

Isn't this just Configuration Management?

No. Configuration management is something different. Configuration management is usually about deploying software in a repeatable way. Let's look at a few open source configuration management tools to see the difference.

Configuration Management Products

Ansible

Tagline
Automation for everyone
Initial Release
February 2012
GitHub
https://github.com/ansible/ansible

Ansible can be used to execute the same command for a list of servers from the command line. You can also use it to automate tasks using "playbooks" written into a YAML file. It's extremely simple and can be used by technical and non-technical folks alike. You can and probably should version control your Ansible artifacts but Ansible is not for configuration version control.

Chef

Tagline
A powerful automation platform
Initial Release
January 2009
GitHub
https://github.com/chef/chef

Chef uses recipes written in Ruby to run jobs on your infrastructure. Recipes describe resources that should be in a particular state. Chef can run in client/server mode or in a standalone configuration named chef-solo. Again, you can and probably should version control your chef recipes but Chef is not for configuration version control.

Puppet

Tagline
Infrastructure Automation at Enterprise Scale
Initial Release
2005
GitHub
https://github.com/puppetlabs/puppet

Puppet has been around a long time. Puppet usually works in a client-server architecture, and an agent communicates with the server to fetch configuration instructions. Puppet uses a custom declarative language or Ruby to describe the system configuration. It is organized in modules, and manifest files contain the desired-state goals to keep everything as required. Puppet uses the push model by default, and the pull model can be configured. Again, you can and probably should version control your Puppet modules and manifests but Puppet is not configuration version control.

Configuration Version Control Products

That said, there are products available to version control configuration.

Configuration Version Control

Git

Tagline
Fast, scalable, distributed revision control system
Initial Release
April 3, 2005
GitHub
https://github.com/git/git (mirror)

Git is the worldwide standard for file version control. It's likely most people using this article are already versioning configuration files in Git. As mentioned earlier, as your configuration files get large or numerous, Git starts to have some limitations. You can only put files less than 100MB on GitHub. Your configuration needs to be sorted the same way or you get unwanted diffs. It can be difficult to find the configuration value or set of configuration values you care about. If you aren't seeing those limitations, just continue to use Git.

If you are seeing those limitations, read on. You need a version controlled database.

Terminus DB

Tagline
Making Data Collaboration Easy
Initial Release
October 2019
GitHub
https://github.com/terminusdb/terminusdb

TerminusDB has full data versioning capability but offers a graph database interface using a custom query language called Web Object Query Language (WOQL). WOQL is schema optional. TerminusDB also has the option to query JSON directly, similar to MongoDB, giving users a more document database style interface.

TerminusDB may be a natural fit for configuration version control if your configuration is already structured as JSON files.

Dolt

Tagline
Git for Data
Initial Release
August 2019
GitHub
https://github.com/dolthub/dolt

Dolt is like Git and MySQL had a baby. Dolt implements Git-style version control on database tables instead of files.

Dolt is a natural fit for configuration version control. We've made the case (humorously) before. Transform your configuration files into relational tables. Get all the power of SQL to query your large configuration sets and keep the Git version control you know and love.

Dolt supports a JSON column type which is often used to store the unstructured parts of configuration. Dolt scales to hundreds of Gigabytes so you have plenty of room to grow.

We're a little biased but we think there is one choice for versioning large sets of configuration and that's Dolt. Curious to learn more? Come chat with us on our Discord.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.