Part I: Dolt API Surface Area

9 min read

Background

When DoltHub was founded it was called Liquidata. The goal was to bring liquidity to the data market. The founders realized that the pipes were broken: sending around CSV, JSON, and other formats was broken. The requirement to translate to and from formats without formal type systems at both ends of the "pipe" was an unacceptable degree of friction: Data Transfer Before Dolt

Searching around for an example of a good user experience for data transfer, the one we liked most was Git remotes:

$ git clone path/to/remote

Or if you already have the repository you can just as easily grab the latest changes:

$ git pull path/to/remote

But "Git for data" did not solve the friction we had identified. For one, CSV and JSON files can distributed via Git remotes. Furthermore, that doesn't solve for a primary source of friction identified above, which is translating data to and from formats such as CSV and JSON, and internal data storage formats.

SQL is by far the most ubiquitous way to describe and query data. Even where SQL is ultimately not a suitable interface for the end use of data, it is often used to define data pipelines that feed data into more specialized storage layouts. We concluded that if data arrived in a SQL queryable format, we could deliver a significantly streamlined user experience for data distribution: Data Transfer After Dolt

We chose to combine concepts from Git and SQL to create a version controlled database motivated by our desire to deliver frictionless data transfer. We imagined users cloning datasets, and spinning up a SQL server, all with a single tool. We called it Dolt.

API Surface Area

This decision gave our database API two key requirements:

  • a SQL query interface
  • Git-like version control semantics

These two goals are somewhat in tension. A relational database server uses SQL as an interface, often to a remote process. By contrast, Git version control uses a CLI acting mostly on local files, often with complex multi-step workflows. We also imagined that a core use-case for Dolt would be as a collaboration tool for building databases, meaning porting Git concepts to SQL was not an option.

Before digging into how we attempting to strike this balance, it's worth briefly investigating the API surface area of some popular relational databases.

Other Databases

Most users interact with relational databases via issuing SQL statements to a remote server process. However, not all tasks for operating traditional relational databases can be described in SQL. For example, tab completing the prefix pg on my Mac shows the following list of Postgres tools:

$ pg
pg_archivecleanup  pg_controldata     pg_isready         pg_restore         pg_test_timing
pg_basebackup      pg_ctl             pg_receivewal      pg_rewind          pg_upgrade
pg_checksums       pg_dump            pg_recvlogical     pg_standby         pg_waldump
pg_config          pg_dumpall         pg_resetwal        pg_test_fsync      pgbench

And likewise for the prefix mysql:

mysql
mysql                      mysql_upgrade              mysqldumpslow              mysqlshow
mysql.server               mysqladmin                 mysqlimport                mysqlslap
mysql_client_test          mysqlbinlog                mysqlpump                  mysqltest
mysql_config               mysqlcheck                 mysqlrouter                mysqltest_safe_process
mysql_config_editor        mysqld                     mysqlrouter_keyring        mysqlxtest
mysql_secure_installation  mysqld_multi               mysqlrouter_passwd
mysql_ssl_rsa_setup        mysqld_safe                mysqlrouter_plugin_info
mysql_tzinfo_to_sql        mysqldump                  mysqlsh

By splitting API surface area across a SQL query interface and set of shell scripts for automating common administrative tasks, these SQL databases implicitly identify two concerns:

  • querying data
  • administering the instance

The first bucket can be thought of as "online concerns," in the sense that the querying data concerns the contents of the database, and thus assumes the database is online and available. The second bucket can be thought of us "offline concerns," in the sense that they address the existence and availability of the database process.

Since Dolt is a database, it has to address both of these concerns, as well as exposing version control semantics.

Dolt

A Dolt instance is not just a server process running in a data center, but a full version history that can synced with a remote. Dolt's offline interface must allow users to administer their local database, while also exposing the git-style versioning tools that make that collaboration and sharing possible.

Once users have obtained data from a remote, however, they want to have a familiar SQL experience. That means the offline Dolt API must make it easy to transition from "administration", obtaining data and standing up the server, to a familiar experience of executing SQL queries via a connector.

To achieve these goals, we decided the offline experience would be best served by a user experience closely modeled on the Git CLI, with additional commands related to managing a running server. We also chose to implement the MySQL dialect, which means that Dolt can be used as a drop-in replacement for any existing application backed by MySQL.

CLI

Let's dive into the CLI. Running dolt dumps out the index of available commands:

$ dolt
Valid commands for dolt are
                init - Create an empty Dolt data repository.
              status - Show the working tree status.
                 add - Add table changes to the list of staged table changes.
               reset - Remove table changes from the list of staged table changes.
              commit - Record changes to the repository.
                 sql - Run a SQL query against tables in repository.
          sql-server - Start a MySQL-compatible server.
                 log - Show commit logs.
                diff - Diff a table.
               blame - Show what revision and author last modified each row of a table.
               merge - Merge a branch.
              branch - Create, list, edit, delete branches.
                 tag - Create, list, delete tags.
            checkout - Checkout a branch or overwrite a table from HEAD.
              remote - Manage set of tracked repositories.
                push - Push to a dolt remote.
                pull - Fetch from a dolt remote data repository and merge.
               fetch - Update the database from a remote data repository.
               clone - Clone from a remote data repository.
               creds - Commands for managing credentials.
               login - Login to a dolt remote host.
             version - Displays the current Dolt cli version.
              config - Dolt configuration.
                  ls - List tables in the working set.
              schema - Commands for showing and importing table schemas.
               table - Commands for copying, renaming, deleting, and exporting tables.
           conflicts - Commands for viewing and resolving merge conflicts.
             migrate - Executes a repository migration to update to the latest format.
         read-tables - Fetch table(s) at a specific commit into a new dolt repo
                  gc - Cleans up unreferenced data from the repository.
       filter-branch - Edits the commit history using the provided query.

Let's use the dolt clone command modeled off of Git to clone a database:

$ dolt clone dolthub/ip-to-country
cloning https://doltremoteapi.dolthub.com/dolthub/ip-to-country
32,832 of 32,832 chunks complete. 0 chunks being downloaded currently.
$ cd ip-to-country

That was quite familiar from Git, even the printing out of chunks. Now to query our data, we'll start the server:

$ dolt checkout -b test
Switched to branch 'test'
$ dolt sql-server
Starting server with Config HP="localhost:3306"|U="root"|P=""|T="28800000"|R="false"|L="info"

A Dolt SQL Server instance is now running on port 3306, using the ip-to-country database we just clone from DoltHub.

SQL

We chose SQL as a query interface because it is widely adopted, and MySQL in particular because of a rich ecosystem of connectors. We also wanted users to have the familiar Git-like experience of managing their data, and seamlessly transition to executing queries in a familiar language.

We can now see the magic of "just working" as we use the off-the-shelf MySQL connector to connect to Dolt and have a familiar SQL query interface:

~|>>  mysql --host=127.0.0.1 --user=root
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.9-Vitess

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> use ip_to_country;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed

mysql> select * from IPv4ToCountry limit 5;
+-----------+-----------+----------+--------------+--------------------+--------------------+-----------+
| IPFrom    | IpTo      | Registry | AssignedDate | CountryCode2Letter | CountryCode3Letter | Country   |
+-----------+-----------+----------+--------------+--------------------+--------------------+-----------+
| 763029248 | 763029503 | apnic    |   1438300800 | AU                 | AUS                | Australia |
| 763029504 | 763030527 | apnic    |   1436486400 | VN                 | VNM                | Viet Nam  |
| 763030528 | 763031551 | apnic    |   1436486400 | VN                 | VNM                | Viet Nam  |
| 763031552 | 763032575 | apnic    |   1436486400 | VN                 | VNM                | Viet Nam  |
| 763032576 | 763033599 | apnic    |   1436486400 | VN                 | VNM                | Viet Nam  |
+-----------+-----------+----------+--------------+--------------------+--------------------+-----------+
10 rows in set (0.02 sec)

Bringing it Together

We set out to build a database that provided version control features to solve for modern data infrastructure needs. This entailed combining two models. So far we showed a lightly augmented Git interface for the offline experience, and a generic SQL query interface for when Dolt is running a server process.

But to fully realize the value of our creation requires exposing version control in SQL. This where the real API challenge arises, as we try and figure out where to augment existing SQL functionality to work with an underlying commit graph, and where to implement functions that take care of concerns that don't fit in a declarative syntax.

A good example is AS OF syntax. This is a SQL standard feature that is there to expose "system versioning." MariaDB supports this when turned on, for timestamps. But since Dolt stores its version history in a commit graph, it understands not only timestamps, but also commit hashes, branch names, and refs such as HEAD~:

mysql>  SELECT * FROM IPv4ToCountry AS OF 'mh8o6k5bja6o3bmj8gflc4f9rdjclhcb' WHERE IPFrom = 521020416 AND IpTo = 521021439;
+-----------+-----------+----------+--------------+--------------------+--------------------+---------+
| IPFrom    | IpTo      | Registry | AssignedDate | CountryCode2Letter | CountryCode3Letter | Country |
+-----------+-----------+----------+--------------+--------------------+--------------------+---------+
| 521020416 | 521021439 | ripencc  |   1303084800 | RO                 | ROU                | Romania |
+-----------+-----------+----------+--------------+--------------------+--------------------+---------+
1 row in set (0.00 sec)

Other version control features are exposed via special system tables that can be incoporated into any SQL query. For example, to examine the commit log I can select from dolt_log:

mysql> select commit_hash, date, message from dolt_log limit 5;
+----------------------------------+-------------------------+----------------------------------------------------------+
| commit_hash                      | date                    | message                                                  |
+----------------------------------+-------------------------+----------------------------------------------------------+
| 2v3vrf8ta478a5e1h4aihaj73q7kh96a | 2020-12-13 01:30:27.421 | Update IP to Country for date 2020-12-13 01:29:50.478038 |
| mh8o6k5bja6o3bmj8gflc4f9rdjclhcb | 2020-12-12 01:31:21.972 | Update IP to Country for date 2020-12-12 01:30:45.344102 |
| bqtrbbrsgie7u4fsgph0t12j94ofefml | 2020-12-11 01:30:36.075 | Update IP to Country for date 2020-12-11 01:29:53.647746 |
| u8pnf1o0mus2d8mok0083t5lpj190j5f | 2020-12-10 01:30:29.088 | Update IP to Country for date 2020-12-10 01:29:52.085252 |
| kqjdel14k130as6jdjbrurtg9j8e93ie | 2020-12-09 20:02:43.082 | Update IP to Country for date 2020-12-09 20:02:07.846972 |
+----------------------------------+-------------------------+----------------------------------------------------------+
5 rows in set (0.02 sec)

For convenience, the entire set of CLI functionality is available in SQL with special functions. For example, you can check out a branch with a SELECT statement like so:

mysql> select dolt_checkout('some-branch')

This concludes our tour of Dolt's native APIs.

Higher Level APIs

Other relational database solutions do not generally provide higher level APIs outside of SQL. There exist many such APIs for SQL databases, but they are generally not part of the database itself. For example, SQL Alchemy is a library that provides Python native functions for manipulating various flavors of SQL database.

Dolt is a bit more complicated than existing SQL databases, merely because it has a slew of additional features that just make the possibility space larger. For example, there are combinations of common SQL queries and version control operations that users might regularly combine. We believed there existed a sufficient number of these to justify a higher level Python API designed to make it easier to integrate Dolt into existing data infrastructure.

In the next part of this post, tomorrow, we will do a deep dive on the latest iteration of Doltpy, our Python API for interacting with Dolt.

Conclusion

We reviewed how Dolt combines concepts from Git and MySQL, and where necessary innovates, to bring a relational database with novel features to market. If those features sound exciting to you, give Dolt a try, or get in touch with us. Our next post will show to seamlessly script data pipelines that incorporate Dolt using the latest version of Doltpy.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt