Access to Everything Through SQL

9 min read

When we started developing Dolt our vision was to deliver git functionality for data. Where git versions files, Dolt versions tables. We implemented table based diff and conflict logic and shipped the initial version. As we started to use Dolt to manage our own data, we started to see that the Git model could only be stretched so far for data. We needed to borrow more metaphors from databases. We needed to be able to query across the metadata and data in Dolt to unlock some of version control's power for data. So now, we are working to make all of Dolt's features accessible via SQL. While much of the work is still in the planning stage, we have rolled out 3 system tables that give SQL access to the Dolt log, diffs, and history. This article will explore these tables, and the power they have to answer questions that couldn't be answered before.

dolt_log

The first system table added to Dolt was dolt_log. By querying the dolt_log table you get to see the exact same data available from the dolt log command, except you have the ability to limit, filter, order by, join and or do anything else supported by SQL.

If I wanted to see any commits I made to a dataset after a certain date, I could use the following query:

SELECT *
FROM dolt_log
WHERE committer = "bheni" and date > "2019-04-01"
ORDER BY "date";
+----------------------------------+-----------+--------------------+-----------------------------------+---------------+
| commit_hash                      | committer | email              | date                              | message       |
+----------------------------------+-----------+--------------------+-----------------------------------+---------------+
| qi331vjgoavqpi5am334cji1gmhlkdv5 | bheni     | brian@liquidata.co | 2019-06-07 00:22:24.856 +0000 UTC | update rating |
| 137qgvrsve1u458briekqar5f7iiqq2j | bheni     | brian@liquidata.co | 2019-04-04 22:43:00.197 +0000 UTC | change rating |
| rqpd7ga1nic3jmc54h44qa05i8124vsp | bheni     | brian@liquidata.co | 2019-04-04 21:07:36.536 +0000 UTC | fixes         |
+----------------------------------+-----------+--------------------+-----------------------------------+---------------+

dolt_diff_$TABLENAME

The Dolt diff tables provide SQL access to the same data you get with the dolt diff command with the additional querying power of SQL. This type of functionality doesn’t exist in any other database, and it has many valuable use cases. If you are a consumer of data and you want to be able to validate a new version of that data, being able to query the diffs to find out what’s changed from the production version has tremendous value. Similarly, if you have a production issue due to a data change, with Dolt you can rollback, and then query the diff table to debug what caused your issue.

Taking the dolthub/wikipedia-ngrams data repository from dolthub as our example, the following query will retrieve the bigrams whose total counts have changed the most between 2 versions.

SELECT *, ABS(to_total_count-from_total_count) AS delta
FROM dolt_diff_bigram_counts
WHERE from_commit = "ghqbc0vpjpsl4065rvbgmcgrddh1e69r" AND diff_type = "modified"
ORDER BY delta DESC
LIMIT 10;
+-------------+-------------+------------------+----------------+--------------------+------------------+----------------------------------+-----------+-----------+-------+
| from_bigram | to_bigram   | from_total_count | to_total_count | from_article_count | to_article_count | from_commit                      | to_commit | diff_type | delta |
+-------------+-------------+------------------+----------------+--------------------+------------------+----------------------------------+-----------+-----------+-------+
| of the      | of the      | 21566470         | 21616678       | 3591159            | 3599847          | ghqbc0vpjpsl4065rvbgmcgrddh1e69r | HEAD      | modified  | 50208 |
| _START_ The | _START_ The | 19008468         | 19052410       | 3410523            | 3418602          | ghqbc0vpjpsl4065rvbgmcgrddh1e69r | HEAD      | modified  | 43942 |
| in the      | in the      | 14345719         | 14379619       | 3677563            | 3686463          | ghqbc0vpjpsl4065rvbgmcgrddh1e69r | HEAD      | modified  | 33900 |
| _START_ In  | _START_ In  | 8212684          | 8234586        | 2267450            | 2273503          | ghqbc0vpjpsl4065rvbgmcgrddh1e69r | HEAD      | modified  | 21902 |
| to the      | to the      | 7275659          | 7291823        | 2263950            | 2269093          | ghqbc0vpjpsl4065rvbgmcgrddh1e69r | HEAD      | modified  | 16164 |
| _START_ He  | _START_ He  | 5722362          | 5737483        | 1499075            | 1503863          | ghqbc0vpjpsl4065rvbgmcgrddh1e69r | HEAD      | modified  | 15121 |
| at the      | at the      | 4273616          | 4287398        | 1839024            | 1845056          | ghqbc0vpjpsl4065rvbgmcgrddh1e69r | HEAD      | modified  | 13782 |
| for the     | for the     | 4427780          | 4438872        | 1781278            | 1785682          | ghqbc0vpjpsl4065rvbgmcgrddh1e69r | HEAD      | modified  | 11092 |
| and the     | and the     | 4871852          | 4882874        | 1878831            | 1883399          | ghqbc0vpjpsl4065rvbgmcgrddh1e69r | HEAD      | modified  | 11022 |
| is a        | is a        | 4632620          | 4643068        | 3238685            | 3246381          | ghqbc0vpjpsl4065rvbgmcgrddh1e69r | HEAD      | modified  | 10448 |
+-------------+-------------+------------------+----------------+--------------------+------------------+----------------------------------+-----------+-----------+-------+

Full documentation for dolt_diff_$TABLENAME can be found here

dolt_history_$TABLENAME

Unlike the previous system tables, the Dolt history tables don’t provide access to a feature already exposed by the command line, they expose brand new functionality. The Dolt history tables allow you to see how rows have changed over time.

Dolt MySQL correctness test results are published to the dolthub/dolt-sqllogictest-results repository on dolthub. In order to see how a specific test has changed over time I can query the history of a value in dolt_results by querying dolt_history_dolt_results.

SELECT result, error_message, commit_hash, commit_date 
FROM dolt_history_dolt_results 
WHERE test_file = "index/random/10/slt_good_0.test" and line_num = 14220;
+--------+---------------------------------------------------------------------------------+----------------------------------+-----------------------------------+
| result | error_message                                                                   | commit_hash                      | commit_date                       |
+--------+---------------------------------------------------------------------------------+----------------------------------+-----------------------------------+
| ok     | <NULL>                                                                          | dq70i7vjugf36v2jf4li5h6jmb7911pr | 2020-01-09 21:17:44.39 +0000 UTC  |
| not ok | Unexpected error unsupported syntax: "missed ON clause for JOIN statement"      | 5atladtm3v1b3jpr2ko9m9ge9p2i78d5 | 2019-12-17 01:52:57.719 +0000 UTC |
| not ok | Unexpected error unsupported syntax: "missed ON clause for JOIN statement"      | k8vkutunsgj37j2l3rfd4i1qtt0aug6e | 2019-12-13 21:22:05.154 +0000 UTC |
| not ok | Unexpected error unsupported syntax: "missed ON clause for JOIN statement"      | p168qfb6t1toe2sgnj5qsnaoqite8ard | 2019-12-13 19:20:34.211 +0000 UTC |
| not ok | Unexpected error unsupported syntax: "missed ON clause for JOIN statement"      | t6ulne1gsjf6mhvufmqagnaer6fvi6ig | 2019-11-27 21:38:55.009 +0000 UTC |
| not ok | Unexpected error unsupported syntax: "missed ON clause for JOIN statement"      | u20r477j4cupg2gqktrj6sth57v3gfgs | 2019-11-26 22:40:28.832 +0000 UTC |
| not ok | Unexpected error unsupported syntax: "missed ON clause for JOIN statement"      | 9s1uatmin8d3q08pi9hks4smg9bim6lm | 2019-11-26 01:33:02.269 +0000 UTC |
| not ok | Unexpected error unsupported syntax: "missed ON clause for JOIN statement"      | f31f9u6go77t4s95ses10okecg80ehhv | 2019-11-22 23:34:16.918 +0000 UTC |
| not ok | Unexpected error Code: INVALID_ARGUMENT syntax error at position 11 near 'all'  | j9cvi759a29b6upl52i8kl9qcptm0dhs | 2019-11-22 22:10:52.27 +0000 UTC  |
| not ok | Unexpected error Code: INVALID_ARGUMENT syntax error at position 11 near 'all'  | bkpqqisfq97qrdek7pvo2fsu79ijr2ut | 2019-11-20 19:10:23.325 +0000 UTC |
| not ok | Unexpected error Code: INVALID_ARGUMENT syntax error at position 11 near 'all'  | dfsf7sfcncvpfgm66koj8ji340lu20if | 2019-11-19 00:06:51.079 +0000 UTC |
| not ok | Unexpected error Code: INVALID_ARGUMENT syntax error at position 11 near 'all'  | m6g9g8ev5lbfigarsros8h3q846kf19f | 2019-11-13 17:53:07.653 +0000 UTC |
+--------+---------------------------------------------------------------------------------+----------------------------------+-----------------------------------+

In this case we can say that the error message changed on commit f31f9u6go77t4s95ses10okecg80ehhv and the test was fixed with commit dq70i7vjugf36v2jf4li5h6jmb7911pr.

If I had a test that I knew wasn’t working and I wanted to find out when it broke, and who broke it I could do the following:

SELECT result, commit_hash, committer, commit_date 
FROM dolt_history_dolt_results 
WHERE test_file = "evidence/slt_lang_createview.test" and line_num = 70;
+--------+----------------------------------+----------------+-----------------------------------+
| result | commit_hash                      | committer      | commit_date                       |
+--------+----------------------------------+----------------+-----------------------------------+
| not ok | dq70i7vjugf36v2jf4li5h6jmb7911pr | Zach Musgrave  | 2020-01-09 21:17:44.39 +0000 UTC  |
| not ok | 5atladtm3v1b3jpr2ko9m9ge9p2i78d5 | Daylon Wilkins | 2019-12-17 01:52:57.719 +0000 UTC |
| ok     | k8vkutunsgj37j2l3rfd4i1qtt0aug6e | Zach Musgrave  | 2019-12-13 21:22:05.154 +0000 UTC |
| ok     | p168qfb6t1toe2sgnj5qsnaoqite8ard | Zach Musgrave  | 2019-12-13 19:20:34.211 +0000 UTC |
| ok     | t6ulne1gsjf6mhvufmqagnaer6fvi6ig | Zach Musgrave  | 2019-11-27 21:38:55.009 +0000 UTC |
| ok     | u20r477j4cupg2gqktrj6sth57v3gfgs | Zach Musgrave  | 2019-11-26 22:40:28.832 +0000 UTC |
| ok     | 9s1uatmin8d3q08pi9hks4smg9bim6lm | Zach Musgrave  | 2019-11-26 01:33:02.269 +0000 UTC |
| ok     | f31f9u6go77t4s95ses10okecg80ehhv | Zach Musgrave  | 2019-11-22 23:34:16.918 +0000 UTC |
| ok     | j9cvi759a29b6upl52i8kl9qcptm0dhs | Zach Musgrave  | 2019-11-22 22:10:52.27 +0000 UTC  |
| ok     | bkpqqisfq97qrdek7pvo2fsu79ijr2ut | Zach Musgrave  | 2019-11-20 19:10:23.325 +0000 UTC |
| ok     | dfsf7sfcncvpfgm66koj8ji340lu20if | Zach Musgrave  | 2019-11-19 00:06:51.079 +0000 UTC |
| ok     | m6g9g8ev5lbfigarsros8h3q846kf19f | Zach Musgrave  | 2019-11-13 17:53:07.653 +0000 UTC |
+--------+----------------------------------+----------------+-----------------------------------+

Full documentation for dolt_history_$TABLENAME can be found here

This is Just the Beginning

There is so much more to be done to unlock the power of a version controlled database. Once you start using Dolt to manage your data, features become pretty obvious. Here's a sample of some of the things we're dreaming about.

Time sliced queries

SQL 2011 introduced the concept of time sliced queries. In the future, Dolt will allow you to query the values of a table at a specific timestamp or commit hash, or all the values that occur between two times or commits. The exact syntax is still being determined but it would look something like this:

/* Select a value at a specific commit */
SELECT *
FROM table 
for COMMIT_HASH AS OF 'd8jc6a1bedrkumscc77flgpc7rt9b7me';

/* Select all values for the year 2019 */
SELECT *
FROM table
FOR COMMIT_TIME BETWEEN '2019-01-01' AND '2019-12-31';

To do this in the current version of Dolt you would need to run dolt checkout providing a commit hash before performing your query. This provides a much easier way of accessing the data without checking out the data at a different point in time.

Transactions to add and commit

One of the major features currently missing in SQL for Dolt is the ability to commit changes. We have discussed the idea of using SQL transactions as a mechanism for adding tables that have changed, and committing them with the possibility of automatically merging in the case of conflicts. There is still a lot of design work that would need to go into this, but it is something we will tackle in the future.

Try Dolt out and let us know what you would like to see

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt