The History of Data Exchange

IBM and General Electric invented the first databases in the early 1960s. It was only by the early 1970s that enough data had accumulated in databases that the need to transfer data between databases emerged. Enter the Comma Separated Values (CSV) file format, supported by the IBM Fortran compiler in 1972. Dump the contents of a table to a CSV, import it into another database. Sound familiar? That's because it's still the most common method of data distribution today.

Google Open Images, ImageNet, and US Department of Agriculture Nutritional Information are all examples of data distributed in this format. We have large corporations, academia, and government all distributing data on the internet in CSV format. Almost 50 years after invention CSV remains the standard for data exchange.

The next innovation in data exchange happened in the early days of the internet. On the internet, we were not exchanging whole tables of information. We had lightweight connected applications that needed access to single or few records to render to users. We needed a data format that could be transmitted via Application Programming Interfaces (APIs), the data exchange layer of the internet.

The ideal data exchange format used in APIs would represent small collections of information, potentially with hierarchical information. For example, we needed to be able to ship an object with a variable length list of tags associated with it. We did not want to make two API calls, one for the base object and one for the list of tags as it would be structured in a relational database.

Enter first Extensible Markup Language (XML) format in 1998 and soon after Javascript Serialized Object Notation (JSON) in 2001. XML was the first format of data exchange on the internet, inspired by it's cousin, Hypertext Markup Language (HTML). XML quickly went out of favor, mostly because of verbosity. The tags were often larger than the data payload. JSON was far less verbose but could only represent key-value pairs and arrays. JSON became the dominant way for APIs to communicate and JavaScript became the dominant language for building web applications.

There are now thousands of public and private APIs to facilitate all manner of data exchange on the internet. APIs are the middleware of the internet.

CSVs and JSON over API is the world of data exchange we live in today. However, we're in the midst of a generational shift in the way software is written, Software 2.0 as Andrej Karpathy calls it:

It turns out that a large portion of real-world problems have the property that it is significantly easier to collect the data (or more generally, identify a desirable behavior) than to explicitly write the program. In these cases, the programmers will split into two teams: software 2.0 programmers manually curate, maintain, massage, clean and label datasets; each labeled example literally programs the final system because the dataset gets compiled into Software 2.0 code via the optimization. Meanwhile, the 1.0 programmers maintain the surrounding tools, analytics, visualizations, labeling interfaces, infrastructure, and the training code.

Andrej goes on to suggest:

Is there space for a Software 2.0 Github? In this case repositories are datasets and commits are made up of additions and edits of the labels.

Andrej is suggesting a Git-like medium of data exchange.

Why did Git become the dominant mode of source code exchange? What properties of Git make it desirable as an exchange format? As we did with data exchange formats, it's useful to understand the history of version control in software in order to understand the present and potentially design the future.

Version control for code was invented at Bell Labs by Marc Rochkind in 1972. The primary features of version control, diff, branch, and merge, allow for efficient collaboration between multiple authors of code. Before version control, a human was tasked with integrating everyone's work into the code you wanted to compile. This job became a nightmare as the number of people changing code grew. Version control made it possible for hundreds and then thousands of people to collaborate on the same code.

My research between 1972 and 1986 is a little hazy. I assume software companies were using proprietary version control systems until Concurrent Versions Systems (CVS) was released in 1986. CVS was still in use when I entered software in the late 1990s but most big companies used Perforce, released in 1995, for scalability. CVS was primarily used for open source projects where a free, open format for sharing code was needed. Subversion, released in 2000, and Mercurial, released in 2005, had brief runs as potential threats to CVS and Perforce dominance. But the tide started to shift when Linus Torvalds moved his eponymous Linux open source project to his newly minted version control system, Git, in 2005.

Git had the benefit of being a truly distributed format for source code collaboration. All versions of the code were stored locally and you only needed to sync the changes with a remote server. This made diff, branch, and merge operations orders of magnitude faster than other version control systems. The clever Merkel DAG structure allowed branches to be pointers to commits instead of heavyweight objects like other version control systems. This means Git scales to a virtually unlimited number of branches making it reasonable for thousands of people to collaborate on the same code.

With the release of GitHub in 2008, collaboration on source code became internet-scale resulting in an explosion of open source software. This internet-scale collaboration happened because software engineers agreed on an open format for code collaboration, Git, and the world is better for it.

Not surprisingly, in the CSV powered data exchange world we see today, we see almost no collaboration on data. In data exchange, we are in the pre-version control era of software development. The open data community is small and the consumption pattern is publish, consume. There is very little actual collaboration where multiple editors are changing the same data.

Using data exchange and source code exchange histories as inspiration, we created Dolt. Dolt wraps Git-style versioning around a SQL database. The new software world where data looks like code requires a new set of tools that treat data like code. We need to branch, merge and diff data just like source code. The models we build with the data need to be reproducible from source. We need multiple, distributed editors of the same data making the data better on different branches, merging when they want to collaborate.

Dolt is many orders of magnitude more efficient as a format for distributing and using data than CSVs or APIs. You can see who changed the data, when it last changed, make a copy and still get updates from master via a merge. There is a SQL query interface available immediately after you get a copy of the data for exploration. Git > CVS. Dolt > CSV. Symmetry.

We need to collaborate on open data to make it better, just like with open source software. We need to collaborate at internet-scale. That is why we created DoltHub, a place on the internet to store, host, and collaborate on open data for free.

CSVs are the past and present of data exchange but Dolt and DoltHub are the future.