Tags and Data Releases in Dolt

FEATURE RELEASE
3 min read

Dolt is a SQL database with Git-like functionality. It allows you to branch, merge, diff and clone data sets, by combining the data structures and algorithms of a relational database with a distributed version control system. DoltHub is a place on the internet to share data sets created with Dolt. Downloading a dataset with data and schema versioned together is as easy as dolt clone dolthub/open-images.

What is a release?

We built Dolt because we believe the way that data is shared is broken. Data is distributed without structure or provenance, leaving consumers hoping they won't bring down their systems by relying on it. Distributed version control systems (DVS), and specifically Git, have revolutionized the way we collaborate on software within and across organizations. We want to create that same revolution for collaborating on data.

We've spent some time thinking about the meaning and importance of data releases. A software release represents a known version of a product. It's an incremental step from the previous release that's been tested and can be depended upon. We think data releases should embody the same spirit: a collection of data with a specific schema and known set of data points. Dolt's content-addressable storage format versions data and metadata together. Users with separate copies of a repository can have confidence they're working with the exact same data. The precision of a data release is arguably more important than that of a software release. Many executables can exhibit the same output, but any change to the set of points in a dataset can have a cascading effect in subsequent analysis. As the saying goes "Garbage In Garbage Out".

Data Releases

As an example, let's take a look at Google's Open Images data set. Open Images was first released in 2016 as standard machine learning research data set. Like other standard data sets, it's important not only as a resource in developing new ML models, but as a benchmark to compare different methods. In order to create a reliable comparison of different algorithms, it's important to use precisely the same dataset. This may in fact be an error prone process, given that the current distribution page has the data set spread out over dozens of files.

Using the Dolt version of the data is... a little simpler.

% dolt clone dolthub/open-images
7,350,758 of 7,350,758 chunks complete. 0 chunks being downloaded currently.

% dolt sql -q "select count(*) from images as of 'v5'"
+----------+
| COUNT(*) |
+----------+
| 9178275  |
+----------+

In order to create data releases, we followed the Git model of tagging commits. To create a tag, simply run dolt tag v5 head -m "Open Images version 5". The tag command creates an immutable reference to a commit. It's like a branch that can't be changed. Tags can be used for any desired purpose, but as with Git, they're mostly reserved for tagging releases. You can also list the previous tags in a repo and any tag messages:

% dolt tag -v
v5	qlvnt89q84a9ktne5iidut4c7d7pes8h
Tagger: Andy Arthur <andy@liquidata.co>
Date:   Mon Sep 14 17:24:34 +0000 2020

	Open Images version 5
...

v1	sqielk4k5j5vspbvbr4qkedpqifrmhj4
Tagger: Andy Arthur <andy@liquidata.co>
Date:   Sat Sep 12 00:45:18 +0000 2020

	Open Images version 1

Tags work just like any other Dolt ref, meaning you can use them to diff data:

% dolt diff v3 v5 label_dictionary
  diff --dolt a/label_dictionary b/label_dictionary
  --- a/label_dictionary @ 34553n79pnr4f9l44vljkqltjtiabj6t
  +++ b/label_dictionary @ sn63dfskqfeo6frkanrii5i5mk9h0ac8
  +-----+-------------+-----------------------------------+
  |     | LabelName   | ShortDescription                  |
  +-----+-------------+-----------------------------------+
  |  <  | /m/0119x1zy | Bun                               |
  |  >  | /m/0119x1zy | Bun (Food)                        |
  |  <  | /m/011_f4   | String instrument                 |
  |  >  | /m/011_f4   | Chordophone                       |
 ...
  |  <  | /m/0y8r     | Armored car                       |
  |  >  | /m/0y8r     | Armored car (Military)            |
  |  <  | /m/0zrthkd  | Brine                             |
  |  >  | /m/0zrthkd  | Brine (Food)                      |
  +-----+-------------+-----------------------------------+

And you can explore each version of the data on DoltHub. You can even use DoltHub's SQL API to query the data at a specific release:

% python3
>>> import requests
>>> owner, repo = 'dolthub', 'open-images'
>>> ref = 'v5'
>>> res = requests.get('https://dolthub.com/api/v1alpha1/{}/{}/{}'.format(owner, repo, ref))
>>> res.json()
{
    'query_execution_status': 'Success',
    'query_execution_message': '',
    'repository_owner': 'dolthub',
    'repository_name': 'open-images',
    'commit_ref': 'v5',
    'sql_query': 'SHOW TABLES;',
    'schema': [
        {'columnName': 'Table', 'columnType': 'String', 'isPrimaryKey': False}
    ],
    'rows': [
         {'Table': 'bounding_boxes'},
         {'Table': 'images'},
         {'Table': 'label_dictionary'},
         {'Table': 'labels'},
         {'Table': 'relationships'}
     ]
}

Conclusion

Relational database technology is a powerful tool for storing and analyzing data sets, but when it comes to moving data between machines, we're stuck in the stone age. The technology exists to collaborate on projects in a distributed manner. We built Dolt to make that technology available to you. Tags and Data Releases are another step in that journey. Your data should exist and be reliable where and when you need it. That's what Dolt is for.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.