So you want Data Version Control?

April 27, 2022

5 min read

There is one name in the data version control space I'm truly jealous of: DVC, short for "Data Version Control". Data version control is only a small part of what DVC does. DVC is a tool to version code and data in machine learning pipelines.

There are other tools that version control data. However, Google doesn't seem to think so. We have new technology to do true version control (ie. logs, diffs, and merges) on large scale data. This blog attempts to survey the space and give you a better picture of what's out there.

What do you mean by Data Version Control?

What do you mean by data? Do you mean data in files or data in tables? Do you mean unstructured data like images or text from web pages? Do you mean CSV tables or JSON blobs? Do you mean big data like time series log entries? Do you mean relational databases? If relational, do you care about schema or just data (or vice versa)? Do you mean data transformations, like exist in data pipelines? Do you have an application in mind? Data for machine learning (i.e. labeled data)? Data for visualizations and reports? Data for a software application?

What do you mean when you say version control? Which parts of version control do you care about? Do you care about rollback? Diffs? Lineage, i.e. who changed what and when? Branch/merge? Sharing changes with others? Do you mean a content-addressed version of the above with all the good distributed qualities that solution provides?

As you can see “Data Version Control” quickly gets complicated. Let's parse through the offerings that could be “Data Version Control” and make sense of who is building what.

Who could be “Data Version Control”?

When researching for this blog entry we encountered a number of products with some relationship to “Data Version Control”. We narrowed the list to six products across four general categories. We apologize if we missed you. Send me an email at tim@dolthub.com and I'll follow up. For each product selected, we'll provide an introduction to what the product does so you can judge for yourself whether it fits your definition of “Data Version Control”.

Data Version Control

The products fell into four general categories:

Version Control
Data Pipeline Version Control
Versioned Data Lakes
Version Controlled Databases

Version Control

Git

Tagline: "Fast, scalable, distributed revision control system"
Initial Release: April 3, 2005
GitHub: https://github.com/git/git (mirror)

Git and GitHub contain a lot of data. Committing CSV or JSON files, or even SQLite database files, to Git for data versioning and distribution is very popular. GitHub advertises over 73M users making it a very powerful network.

There are a couple constraints. No file on GitHub can be bigger than 100MB. You can use git-lfs to get around this limitation but then you lose fine-grained diffs. Diffs on CSV and JSON are of marginal utility in Git anyway. The data needs to be sorted in the same way on commit to get any utility at all. Conflict resolution happens at the line level. There is no built-in concept of schema.

Despite these constraints, one could make a very convincing argument that Git and GitHub are “Data Version Control”. However, using Git for data is not the right tool for the job, like using a hammer to fasten a screw.

Data Pipeline Version Control

Pachyderm

Tagline: "Reproducible Data Science at Scale!"
Initial Release: May 5, 2016
GitHub: https://github.com/pachyderm/pachyderm

Pachyderm is a data pipeline versioning tool. In the Pachyderm model, data is stored as a set of content-addressed files in a repository. Pipeline code is also stored as a content-addressed set of code files. When pipeline code is run on data, Pachyderm models this as a sort of merge commit, allowing for versioning concepts like branching and lineage across your data pipeline.

We find the Pachyderm model extremely intriguing from a versioning perspective. Modeling data plus code as a merge commit is very clever and produces some very useful insights.

This type of versioning is useful in many machine learning applications. Often, you are transforming images or large text files into different files using code, for instance, making every image in a set the same dimensions. You may want to reuse those modified files in many different pipelines and only do work if the set changes. If something goes awry, you want to be able to debug what changed to cause the issue. If you're running a large-scale data pipeline on files, like is common in machine learning, Pachyderm is for you.

DVC (Data Version Control)

Tagline: "Git for Data & Models"
Initial Release: May 4, 2017
GitHub: https://github.com/iterative/dvc

Similar to Pachyderm, DVC versions data pipelines. Unlike Pachyderm, DVC does not have its own execution engine. DVC is a wrapper around Git that allows for large files (like git-lfs) and versioning code along with data. It also comes with some friendly pipeline hooks like visualizations and reproduce commands. Most of the documentation has a machine learning focus.

DVC seems lighter weight and more community-driven than Pachyderm. Pachyderm seems more enterprise focused. If you are looking for data pipeline versioning, without having to adopt an execution engine, check out DVC.

Versioned Data Lakes

LakeFS

Tagline: "Git-like capabilities for your object storage"
Initial Release: Aug 3, 2020
GitHub: https://github.com/treeverse/lakeFS

LakeFS defines a new category: the versioned data lake. Data lakes are a relatively new term referring usually to unstructured or semi-structured data stored in large cloud storage systems like S3 and GCS. Data Lakes exist in contrast to Data Warehouses which are structured and SQL based.

LakeFS sits in front of your cloud storage and adds data versioning to the data in your lake. You get commits, branches, and rollback. Merge is supported but conflicts are detected at the file level and are "up to the user to resolve". User defined merge strategies are on the roadmap.

A file in this case can be quite large, more like a dataset than a single row of data. Data is shared at the file level between commits, but a new version of the data means a new version of the file so storage can grow quite big with multiple versions.

LakeFS is relatively new, launched in 2020, and the sponsoring company Treeverse is well funded. Expect more development from the company. We like what we see from a versioning perspective.

Version Controlled Databases

Terminus DB

Tagline: Making Data Collaboration Easy
Initial Release: October 2019
GitHub: https://github.com/terminusdb/terminusdb

TerminusDB has full schema and data versioning capability but offers a graph database interface using a custom query language called Web Object Query Language (WOQL). WOQL is schema optional. TerminusDB also has the option to query JSON directly, similar to MongoDB, giving users a more document database style interface.

The versioning syntax is exposed via TerminusDB Console or a command line interface. The versioning metaphors are similar to Git. You branch, push, and pull. See their how to documentation for more information.

TerminusDB is new but we like what we see. The company is very responsive, has an active Discord, and is well funded. If you think your database version control makes more sense in graph or document form, check them out.

Dolt

Tagline: It's Git for Data
Initial Release: August 2019
GitHub: https://github.com/dolthub/dolt

Dolt takes “Data Version Control” rather literally. Dolt implements the Git command line and associated operations on table rows instead of files. Data and schema are modified in the working set using SQL. When you want to permanently store a version of the working set, you make a commit. In SQL, Dolt implements Git read operations (ie. diff, log) as system tables and write operations (ie. commit, merge) as functions or stored procedures. Dolt produces cell-wise diffs and merges, making data debugging between versions tractable. That makes Dolt the only SQL database on the market that has branches and merges. You can run Dolt offline, treating data and schema like source code. Or you can run Dolt online, like you would PostgreSQL or MySQL.

As you can see "Data Version Control" takes many forms, not just the eponymous DVC. We're a little biased but we think adding branch and merge to a SQL database is also data version control. If you're interested in discussing the space come hang out on our Discord.

Blog