So You Want Git for Data?

March 6, 2020

8 min read

People have been asking for a Git and GitHub for data for a while. That thread on Stack Exchange is almost seven years old and is the number three Google search result for "git for data" (for me).

What is “Git for data” in practice? Many products have come to market with some relation to the "Git for data" theme. Dolt and DoltHub are our answer. This blog tries to unpack what the various products in the space offer as an answer to what “Git for data” means.

What do you mean by “Git for data”?

In 2020, Git and GitHub cover many aspects of the software development and data engineering life cycle. When you say “Git for data”, what do you mean?

What do you mean by Git?

Do you mean data versioning? If so, which parts of version control do you care about? Do you care about rollback? Diffs? Lineage, i.e. who changed what and when? Branch/merge? Sharing changes with others? Do you mean a content-addressed version of the above with all the good distributed qualities that solution provides? Do you care about some of the more esoteric version control features of Git like a staging area or multiple remotes?

Or are you thinking more of GitHub? Do you want an online data catalog? If so, is what you really want a thriving open data community akin to the open source community? Do you want to be able to collaborate remotely and asynchronously on private data projects? Do you want pull requests, i.e. integrated human review of data changes? Do you want to be able to create issues referring to certain changes or parts of the data?

What do you mean by Data?

Do you mean data in files or data in tables? Do you mean unstructured data like images or text from web pages? Do you mean CSV tables or JSON blobs? Do you mean big data like time series log entries? Do you mean relational databases? If relational, do you care about schema or just data (or vice versa)? Do you mean data transformations, like exist in data pipelines? Do you have an application in mind? Data for machine learning (i.e. labeled data)? Data for visualizations and reports? Data for a software application?

As you can see “Git for data” quickly gets complicated. Let's parse through the offerings that could be “Git for data” and make sense of who is building what.

Who could be “Git for data”?

When researching for this blog entry we encountered a number of products with some relationship to “Git for data”. We narrowed the list to nine. We apologize if we missed you. Send me an email at tim@dolthub.com and I'll follow up. For each product selected, we'll provide an introduction to what the product does so you can judge for yourself whether it fits your definition of “Git for data”.

Logos of some products that might be "Git for data"

The products fell into three general categories:

Data catalogs
Data pipeline versioning
Versioned databases

Data catalogs

Kaggle

Tagline: "The Home of Data Science"
Initial Release: April 2010
GitHub: https://github.com/Kaggle (core tool not open source)

Kaggle started by hosting machine learning competitions. The contest runner posts an open dataset, sets the terms of the contest, and receives model submissions. The winner of the contest receives a cash prize.

Kaggle was purchased by Google in 2017 and continues to operate as a standalone entity. It has evolved into a social network of sorts for data scientists, continuing to run contests, but also hosting public datasets and modeling code in the form of notebooks. The interface is beautiful. There is a thriving, vibrant community. Kaggle boasts of 19,000 public datasets and 200,000 public notebooks. Kaggle is the closest thing to "a vibrant open data community akin to the open source community".

The datasets are distributed as CSVs or JSON. Datasets are versioned in the sense that older versions are still available on Kaggle. So for tooling beyond data and model discovery, you are on your own, in a good way.

data.world

Tagline: "The Cloud-Native Data Catalog"
Initial Release: March 6, 2018
GitHub: https://github.com/datadotworld (core tool not open source)

data.world is an online tool for building data projects. As part of data projects there is data hosting. Some data hosted on data.world is public. The focus is at a higher level than the data itself. On data.world, you create a data project with data, documentation, queries, insights, etc. You can collaborate and share those projects with other people.

The interface is slick and modern. The public data catalog is a pretty robust source of government data. I created a test project and linked a couple datasets to it. At that point I was kind of lost as to what to do next. My instinct at that point was to download the data but data.world seemed to want me to do more on the platform. The company is pretty new and well-funded so I expect the tool to continue to evolve for the next few years.

Data is distributed in multiple formats. I could not locate a versioned dataset but there is an activity link on the dataset page. It looks like data versioning is not really a priority for data.world.

Quilt

Tagline: "Manage data like code"
Initial Release: September, 2019 (V3)
GitHub: https://github.com/quiltdata/quilt

I was first introduced to Quilt in the V2 era when I would have described Quilt as Yum, Homebrew, or NPM for data; as in, you could pull data packages locally from a central repository. The data was immutable and versioned, as in multiple versions of the packages existed in the central repository, but there was no diff or branch/merge.

Last year, Quilt released an open data portal for AWS S3 and described the reasoning in this Hacker News post. This launch seemed to indicate the shift from data versioning to data discovery on top of S3. It looks as if they used the data packaging technology as an additional user abstraction layer if their users want it.

I haven't used the new version of Quilt beyond searching for "Sehn" to see if I was anywhere in S3. Quilt says no. Check Quilt out if you are looking to add rigor and discovery to data you store in S3.

qri

Tagline: "You're invited to a data party!"
Initial Release: February 1, 2018
GitHub: https://github.com/qri-io/qri

Qri is a data catalog built on the distributed web. There is a command line tool as well as a desktop application to get access to and publish datasets. qri.cloud was recently released so you can browse datasets on the web. Qri structures data into a few components including the schema and data, but also metadata like a README. The transform component is particularly cool, allowing code that operates on the data to be versioned with the data itself. These all travel in one container that can be cloned and diffed.

Qri really seems to be chasing the “Git and GitHub for data” analogy pretty closely. I struggled with whether to put Qri in the data catalog or versioned database section of this blog. I settled on the data catalog section because fundamentally, Qri is not a database. It is a wrapper around a file-based data format. There is no query language and I don't think it handles multiple tables in the same dataset right now.

If you are a distributed web fan and have small datasets to share and version, check out Qri. The company is fairly new so expect more from them in the future.

Git

Tagline: "Fast, scalable, distributed revision control system"
Initial Release: April 3, 2005
GitHub: https://github.com/git/git (mirror)

Git and GitHub contain a lot of data. Committing CSV or JSON files, or even SQLite database files, to Git for data versioning and distribution is very popular. GitHub advertises over 30M users making it a very powerful network.

There are a couple constraints. No file on GitHub can be bigger than 100MB. You can use git-lfs to get around this limitation but then you lose fine-grained diffs. Diffs on CSV and JSON are of marginal utility in Git anyway. The data needs to be sorted in the same way on commit to get any utility at all. Conflict resolution happens at the line level. There is no built-in concept of schema.

Despite these constraints, one could make a very convincing argument that Git and GitHub are “Git and GitHub for data”. However, using Git for data is not the right tool for the job, like using a hammer to fasten a screw.

Data pipeline versioning

Pachyderm

Tagline: "Reproducible Data Science at Scale!"
Initial Release: May 5, 2016
GitHub: https://github.com/pachyderm/pachyderm

Pachyderm is a data pipeline versioning tool. In the Pachyderm model, data is stored as a set of content-addressed files in a repository. Pipeline code is also stored as a content-addressed set of code files. When pipeline code is run on data, Pachyderm models this as a sort of merge commit, allowing for versioning concepts like branching and lineage across your data pipeline.

We find the Pachyderm model extremely intriguing from a versioning perspective. Modeling data plus code as a merge commit is very clever and produces some very useful insights.

This type of versioning is useful in many machine learning applications. Often, you are transforming images or large text files into different files using code, for instance, making every image in a set the same dimensions. You may want to reuse those modified files in many different pipelines and only do work if the set changes. If something goes awry, you want to be able to debug what changed to cause the issue. If you're running a large-scale data pipeline on files, like is common in machine learning, Pachyderm is for you.

DVC (Data Version Control)

Tagline: "Git for Data & Models"
Initial Release: May 4, 2017
GitHub: https://github.com/iterative/dvc

Similar to Pachyderm, DVC versions data pipelines. Unlike Pachyderm, DVC does not have its own execution engine. DVC is a wrapper around Git that allows for large files (like git-lfs) and versioning code along with data. It also comes with some friendly pipeline hooks like visualizations and reproduce commands. Most of the documentation has a machine learning focus.

DVC seems lighter weight and more community-driven than Pachyderm. Pachyderm seems more enterprise focused. If you are looking for data pipeline versioning, without having to adopt an execution engine, check out DVC.

Versioned databases

Noms

Tagline: "The versioned, forkable, syncable database."
Initial Release: January 5, 2017
GitHub: https://github.com/attic-labs/noms

Noms is an open source database produced by Attic Labs. Attic Labs sold to Salesforce in January 2018. The open source project has not seen many updates since. Aaron Boodman, the founder and CEO, recently announced a new project that may or may not be Noms-based called Replicache. Dolt is based on Noms and includes modified Noms code so we have a lot of experience with it.

Noms implements a Merkle DAG that allows the database to be truly distributed, just like Git. The core storage engine is there. We tested and modified it extensively. The surrounding utilities like a query language and command-line exist but are a little limited. If you want to use Noms, you are probably going to have to build some functionality on top. But, if you're looking for core technology to fork and sync data in a distributed way, check it out.

Dolt

Tagline: "It's Git for data"
Initial Release: August 6, 2019
GitHub: https://github.com/dolthub/dolt

Dolt takes “Git for data” rather literally. Dolt implements the Git command line and associated operations on table rows instead of files. Data and schema are modified in the working set using SQL. When you want to permanently store a version of the working set, you make a commit. Dolt produces cell-wise diffs and merges, making data debugging between versions tractable. Effectively, the result is Git versioning on a SQL database. That makes Dolt the only SQL database on the market that has branches and merges. You can run Dolt offline, treating data and schema like source code. Or you can run Dolt online, like you would PostgreSQL or MySQL.

DoltHub is GitHub for Dolt. You set up DoltHub as your remote and you push Dolt repositories to it. There's a convenient data discovery interface as well as pull requests for data review. Repositories are permissioned so you don't get random people writing to your repository. DoltHub hosts public data for free and private data for a nominal fee.

We are biased but we think if you want Git for data, there is only one product that fits that label and that's Dolt.

Blog