Dolt Diff Magic: Part 1 – The dolt_diff system table

SQL
11 min read

Here at DoltHub, we’re building Dolt DB – a new type of distributed, versioned database that gives you all the power and expressivity of a SQL relational database, combined with the ability to time travel through your data to understand exactly when, where, and how your data has changed.

In this blog post series, we're going to be exploring some of the "Dolt magic" that makes Dolt so unique and powerful. We'll learn about lots of ways to explore and leverage the powerful versioned history Dolt tracks for your data, using examples you can follow along with.
This first post explores the dolt_diff system table, which gives us an overview of what's changed in our database. The second post in this series goes deeper into diffs and explores ways to inspect how the data in your tables has changed over time. The third post in this series shows how to diff the full contents of a table at any two revisions, even across branches.

Before we jump in, let's talk a little bit more about why you should care about these features...

Dolt Magic

Dolt Magic

The magic of Dolt is the ability to look at your data and your database schema and be able to answer exactly where the data came from, when it changed, how it changed, who changed it, and hopefully a good commit message will also tell you why it changed! Deeply understanding your data at this level is extremely powerful and opens up a TON of interesting use cases.

In every Dolt database, the data's versioned history is always tracked and available for you whenever you need it. That could be for running ad-hoc queries to troubleshoot data quality issues, or maybe you want to analyze how your data has changed over time and generate reports, or perhaps you want to leverage that versioned history directly in your application to give your customers data versioning features. These are all great use cases for Dolt DB and we'll explore some of these more in this series.

Throughout this series, we'll be focusing on how to access this Dolt magic through a SQL interface, but don’t forget that the Dolt CLI also exposes much of the same information if you want to work with it from the command line.

Okay, let's jump in!

The dolt_diff System Table

Let’s start at the very beginning (traditionally, a very good place to start 🎶 ). The very first question you’re likely to want to ask is “what’s been changing in my database?” To answer this question, we’re going to use the dolt_diff system table. dolt_diff returns the commit history for the currently checked out branch and shows us what commits affected which tables, and whether or not they included data changes and schema changes.

To follow along and try out these queries on your own, just install Dolt and clone one of the public databases hosted on DoltHub. I've been using the Corona Virus Dataset and have included my results below. I've also truncated some of the results for large result sets.

To start, let's just look at the first few rows of the dolt_diff system table to get an idea of what kind of data is stored in the system table.

select * from dolt_diff limit 10;
+----------------------------------+--------------------+------------------------+------------------------+-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------------+
| commit_hash                      | table_name         | committer              | email                  | date                              | message                                                                                                                                                                                                    | data_change | schema_change |
+----------------------------------+--------------------+------------------------+------------------------+-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------------+
| s4eqichm8ebqvfhabmtcsg9tca9gagbb | dolt_query_catalog | Zach Musgrave          | zach@dolthub.com       | 2021-10-04 18:37:42.702 -0700 PDT | Fixed query                                                                                                                                                                                                | true        | false         |
| g2736rfc4qd5k20gapp19ahlohklgbqj | dolt_docs          | Zach Musgrave          | zach@dolthub.com       | 2021-10-04 18:28:37.799 -0700 PDT | Fixed an erroneous query (can't use an alias in a where clause, only a having clause)                                                                                                                      | true        | false         |
| g2736rfc4qd5k20gapp19ahlohklgbqj | dolt_query_catalog | Zach Musgrave          | zach@dolthub.com       | 2021-10-04 18:28:37.799 -0700 PDT | Fixed an erroneous query (can't use an alias in a where clause, only a having clause)                                                                                                                      | true        | false         |
| 2799ua7240mgp6c04vbu1sv8bv516da4 | cases              | LiquidataSystemAccount | liquidatasys@gmail.com | 2020-10-12 18:30:45.112 -0700 PDT | Automated import of cases and places tables downloaded from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/ at Tue Oct 13 01:30:45 2020 GMT | true        | false         |
| g2ok3r4i358qu02avdpq1avihgb597ou | cases              | LiquidataSystemAccount | liquidatasys@gmail.com | 2020-10-12 06:30:41.257 -0700 PDT | Automated import of cases and places tables downloaded from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/ at Mon Oct 12 13:30:41 2020 GMT | true        | false         |
| k43t8jp6celo1boam4ut4dr6fpmke7tn | places             | LiquidataSystemAccount | liquidatasys@gmail.com | 2020-10-11 18:30:37.678 -0700 PDT | Automated import of cases and places tables downloaded from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/ at Mon Oct 12 01:30:37 2020 GMT | true        | false         |
| 41kepaq9hrobft924bbpl2r69l8jf29c | cases              | LiquidataSystemAccount | liquidatasys@gmail.com | 2020-10-11 06:30:40.901 -0700 PDT | Automated import of cases and places tables downloaded from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/ at Sun Oct 11 13:30:40 2020 GMT | true        | false         |
| 41kepaq9hrobft924bbpl2r69l8jf29c | places             | LiquidataSystemAccount | liquidatasys@gmail.com | 2020-10-11 06:30:40.901 -0700 PDT | Automated import of cases and places tables downloaded from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/ at Sun Oct 11 13:30:40 2020 GMT | true        | false         |
| 6l48m2uj7ncfvbtrrvopl3f8qo05ilo2 | places             | LiquidataSystemAccount | liquidatasys@gmail.com | 2020-10-10 18:30:39.849 -0700 PDT | Automated import of cases and places tables downloaded from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/ at Sun Oct 11 01:30:39 2020 GMT | true        | false         |
| m27rucir3r9lqf9m908fbfkpljfksvnu | cases              | LiquidataSystemAccount | liquidatasys@gmail.com | 2020-10-10 06:30:39.894 -0700 PDT | Automated import of cases and places tables downloaded from https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/ at Sat Oct 10 13:30:39 2020 GMT | true        | false         |
+----------------------------------+--------------------+------------------------+------------------------+-----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+---------------+

This gives us a high-level view of what’s going on in our database. For every commit that changed either the data or schema of a table, we see it listed in our results along with some metadata about the commit. If a commit changed multiple tables, that commit shows up multiple times in the results – once for each table that was changed.

This high-level information about changes across our database already enables us to start answering some interesting questions about our data. To demonstrate that, here are a dozen sample questions that we can now answer using the information from the dolt_diff system table.

What tables has a specific user ever updated?

select distinct table_name from dolt_diff where committer like 'Zach%';
+--------------------+
| table_name         |
+--------------------+
| dolt_query_catalog |
| dolt_docs          |
| dolt_schemas       |
+--------------------+

In how many tables did each commit change data?

select commit_hash, count(table_name) as table_count 
  from dolt_diff where data_change=true 
 group by commit_hash
 order by table_count desc;
+----------------------------------+-------------+
| commit_hash                      | table_count |
+----------------------------------+-------------+
| g01h2tgj47o7t4dgroceht9evrq6b737 | 9           |
| ov5j7nde4f1ge6h6f3anhn72go3ihndm | 3           |
| sth8ve0hij95mp8e711r53quj4mq3v5f | 3           |
| qvthoo9lpvjnrpqasfuqpd8o0nm8u9k0 | 3           |
| tq91tnh2np48dj99v0k8mbokqv1nhmg9 | 3           |
| g2736rfc4qd5k20gapp19ahlohklgbqj | 2           |
| 41kepaq9hrobft924bbpl2r69l8jf29c | 2           |
| ...                              | ...         |
+----------------------------------+-------------+

Which commits included changes that modified data and schema?

select * from dolt_diff where data_change=true and schema_change=true;
+----------------------------------+------------------------------------+-----------+------------------+-----------------------------------+-------------------------------------------------------------------------------------------+-------------+---------------+
| commit_hash                      | table_name                         | committer | email            | date                              | message                                                                                   | data_change | schema_change |
+----------------------------------+------------------------------------+-----------+------------------+-----------------------------------+-------------------------------------------------------------------------------------------+-------------+---------------+
| tof6ef5eotkj8q1pk90m1qlu8vfqti24 | case_details                       | Tim Sehn  | tim@liquidata.co | 2020-03-18 14:47:34.561 -0700 PDT | Added case_details table with BIGINT case_id because South Korea changed their id system  | true        | true          |
| g01h2tgj47o7t4dgroceht9evrq6b737 | characteristics_sex                | Tim Sehn  | tim@liquidata.co | 2020-03-02 12:28:17.496 -0800 PST | Merging tip of master into my master with new tables on it                                | true        | true          |
| g01h2tgj47o7t4dgroceht9evrq6b737 | characteristics_wuhan_exposed      | Tim Sehn  | tim@liquidata.co | 2020-03-02 12:28:17.496 -0800 PST | Merging tip of master into my master with new tables on it                                | true        | true          |
| ...                              | ...                                | ...       | ...              | ...                               | ...                                                                                       | ...         | ...           |
+----------------------------------+------------------------------------+-----------+------------------+-----------------------------------+-------------------------------------------------------------------------------------------+-------------+---------------+

How many tables has each committer changed?

select committer, count(distinct table_name) table_count 
  from dolt_diff 
 group by committer
 order by table_count desc;
+------------------------+-------------+
| committer              | table_count |
+------------------------+-------------+
| Tim Sehn               | 14          |
| Zach Musgrave          | 3           |
| LiquidataSystemAccount | 3           |
| bheni                  | 3           |
| Daylon Wilkins         | 1           |
| katie                  | 1           |
+------------------------+-------------+

What schemas changed in the past month/year/etc?

select distinct table_name 
  from dolt_diff 
 where schema_change=true and date > (NOW() - INTERVAL 25 MONTH);
+------------------------------------+
| table_name                         |
+------------------------------------+
| case_details                       |
| characteristics_occupation         |
| characteristics_onset_date_range   |
| ...                                |
+------------------------------------+

What was the most recent schema change?

select table_name, commit_hash, date, substring(message, 1, 25) as message 
  from dolt_diff 
 where schema_change=true 
 order by date desc 
 limit 1;
+--------------+----------------------------------+-----------------------------------+---------------------------+
| table_name   | commit_hash                      | date                              | message                   |
+--------------+----------------------------------+-----------------------------------+---------------------------+
| case_details | tof6ef5eotkj8q1pk90m1qlu8vfqti24 | 2020-03-18 14:47:34.561 -0700 PDT | Added case_details table  |
+--------------+----------------------------------+-----------------------------------+---------------------------+

What schemas haven’t changed in the past year/month/etc?

select table_name 
  from information_schema.tables 
 where table_type='BASE TABLE' 
   and table_name not in 
       (select table_name from dolt_diff where schema_change=true and date > (NOW() - INTERVAL 25 MONTH));
+------------+
| table_name |
+------------+
| cases      |
| places     |
+------------+

Which commits are merge commits in our current branch's history? (Notice here we're joining on the dolt_commit_ancestors system table to find commits that have two parents in order to identify merge commits.)

select distinct dolt_diff.commit_hash, date, substring(message, 1, 25) as message  
  from dolt_diff, dolt_commit_ancestors
 where dolt_diff.commit_hash=dolt_commit_ancestors.commit_hash
   and parent_index=1;
+----------------------------------+-----------------------------------+---------------------------+
| commit_hash                      | date                              | message                   |
+----------------------------------+-----------------------------------+---------------------------+
| mi1qv0f0o5bgc2u95j4fjrhu03jjhimi | 2020-08-23 00:24:30.533 -0700 PDT | Merging latest master     |
| 4maisihmkgogrupbnk1jhgbojmp8bcn9 | 2020-08-05 09:20:05.113 -0700 PDT | Merging latest master     |
| hp47lrlepu7mbf2jnclk952q5snkrh3f | 2020-08-05 08:47:19.393 -0700 PDT | Merging latest master     |
| rhisqcu0rk5an65dmsg3rpr2t9qgsv97 | 2020-08-05 08:00:48.429 -0700 PDT | Merging latest master     |
| ...                              | ...                               | ...                       |
+----------------------------------+-----------------------------------+---------------------------+

What tables did a specific user change in their first commit in the current branch's history?

select table_name 
  from dolt_diff 
 where commit_hash in (select commit_hash from dolt_diff where committer like 'Tim%' order by date asc limit 1);
+------------+
| table_name |
+------------+
| cases      |
| places     |
+------------+

What's the most popular day of the week for commits?

select date_format(date, '%W') as day_of_week, count(distinct commit_hash) as commit_count 
  from dolt_diff 
group by day_of_week 
order by commit_count desc;
+-------------+--------------+
| day_of_week | commit_count |
+-------------+--------------+
| Wednesday   | 153          |
| Thursday    | 145          |
| Monday      | 128          |
| Saturday    | 124          |
| Friday      | 124          |
| Tuesday     | 123          |
| Sunday      | 108          |
+-------------+--------------+

How many commits that changed data have been made to each table?

select table_name, count(distinct commit_hash) as commit_count 
  from dolt_diff 
 where data_change=true 
group by table_name 
order by table_name asc;
+------------------------+--------------+
| table_name             | commit_count |
+------------------------+--------------+
| case_details           | 452          |
| cases                  | 372          |
| characteristics_age    | 2            |
| ...                    | ...          |
+------------------------+--------------+

When was the first and last data-changing commit made to each table in our database? (For more info on Dolt's support for SQL window functions, check out our other blog posts on SQL Window Functions and SQL Window Framing)

select distinct table_name, 
                first_value(date) over (partition by table_name order by date asc) as first_update, 
                first_value(date) over (partition by table_name order by date desc) as last_update 
  from dolt_diff 
 where data_change=true 
 order by table_name asc;
+-----------------------+-----------------------------------+-----------------------------------+
| table_name            | first_update                      | last_update                       |
+-----------------------+-----------------------------------+-----------------------------------+
| case_details          | 2020-03-03 13:31:50.452 -0800 PST | 2020-08-05 11:51:20.362 -0700 PDT |
| cases                 | 2020-02-05 16:49:28.432 -0800 PST | 2020-10-12 18:30:45.112 -0700 PDT |
| characteristics_age   | 2020-03-02 10:28:06.836 -0800 PST | 2020-03-02 12:28:17.496 -0800 PST |
| ...                   | ...                               | ...                               |
+-----------------------+-----------------------------------+-----------------------------------+

That's a wrap!

I hope you enjoyed this first post exploring how the dolt_diff system table enables you to understand what's changing in your database. In future posts, we'll keep digging deeper and show how you can use more Dolt features to view diffs of the data in your tables, see how your database's schema has changed, and query where the current version of your table's data came from.

We love helping customers discover this Dolt magic and helping them find ways to harness it for their applications. There's a lot you can do with programmatic access to diff and history information, and we're excited to continue adding more features.

Do you have opinions on how data versioning and history features could be helpful in your work? Are there more Dolt features that would help you and your team? Come join us on Discord and let us know!
We’d love to hear from you and learn about your use cases and ideas.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.