Dolt Diff Magic: Part 3 – Diffing tables at two revisions

SQL
12 min read

Welcome back to the third post in this series exploring the tricks you can do with Dolt's magical versioning features! 🪄

In the first post of this series, we explored how the dolt_diff system table enables you to see what tables have changed, when they changed, and who changed them. In the second post of this series, we explored the dolt_diff_$tablename system tables and demonstrated how you can see each individual data change in your tables.
In this post, we're exploring use cases that require comparing a table's data at two revisions in order to calculate all the data differences between those two points in the commit history. As we'll see, this is a particularly useful tool for cases such as merging data across branches.

Dolt Magic

Introducing...

Before we jump into examples, let's quickly introduce the two Dolt features that we'll be using to compare table data at two revisions...

dolt_commit_diff_$tablename System Table

The primary tool we'll use to compare a table's data at two revisions is the
dolt_commit_diff_$tablename system table. This system table exists for every user created table in your database.
It works a bit differently from dolt_diff_$tablename that we looked at in part two of this series:

  • You must specify a to_commit and from_commit as filter expressions when using this table, otherwise you will get an error.
  • The commits you can work with aren’t limited to the current checked out branch – you can specify any two commits from any branches in your database. This makes this system table very useful for diffing changes to a table before you merge from another branch, as we'll see shortly.
  • dolt_commit_diff_$tablename shows you the cumulative diff between the two commits you specified – not a log of each individual delta applied in a commit. In other words, if you change a single row 10 times between to_commit and from_commit, you’ll only see one row returned from dolt_commit_diff_$tablename with the cumulative diff of all the changes to that row between those two commits. (If you don't remember how this is different from the dolt_diff_$tablename system table, check out the second post in this series.)

Let's look at a quick example, using the DoltDiffMagic database hosted on DoltHub. We can use the dolt_commit_diff_inventory system table to compare any two revisions of the data in our inventory table. By specifying two revisions in the to_commit and from_commit filter expressions, we can see exactly how each row in our table changed, including the previous values in each changed row at from_commit, the new values in each row at to_commit, commit metadata, and how the row changed (added, removed, or modified).

To try this out locally, install Dolt, clone the DoltDiffMagic database (dolt clone jfulghum/DoltDiffMagic), then launch the Dolt SQL shell from that new directory (dolt sql) to connect to the database. Alternatively, you can access the database on DoltHub and run this query and view the results directly from your browser.

select * 
  from dolt_commit_diff_inventory 
 where to_commit='d4q5m9ks9f8b4eq1rjskicnsaiq343cn' and 
       from_commit='0s185scjtbok85bg834embv7f7ueabkd';
+------------+----------+------------+----------+----------------------------------+-----------------------------------+-----------+------------+--------------+------------+----------------------------------+----------------------------------+-----------+
| to_name    | to_price | to_item_id | to_count | to_commit                        | to_commit_date                    | from_name | from_price | from_item_id | from_count | from_commit                      | from_commit_date                 | diff_type |
+------------+----------+------------+----------+----------------------------------+-----------------------------------+-----------+------------+--------------+------------+----------------------------------+----------------------------------+-----------+
| NULL       | NULL     | NULL       | NULL     | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | soup      | 2          | 1            | 91         | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | removed   |
| blender    | 15       | 3          | 99       | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | blender   | 20         | 3            | 80         | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | modified  |
| camera     | 50       | 4          | 6        | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | camera    | 50         | 4            | 7          | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | modified  |
| walkman    | 10       | 5          | 11       | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | walkman   | 10         | 5            | 12         | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | modified  |
| puppy chow | 15       | 6          | 8        | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | NULL      | NULL       | NULL         | NULL       | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | added     |
| gum        | 1        | 7          | 12       | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | NULL      | NULL       | NULL         | NULL       | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | added     |
+------------+----------+------------+----------+----------------------------------+-----------------------------------+-----------+------------+--------------+------------+----------------------------------+----------------------------------+-----------+

 

dolt_diff() Table Function

We'll primarily be using the dolt_commit_diff_$tablename system table in this post, but I also want to introduce a new feature we recently added to access diff information in SQL: the dolt_diff() system table function.

This new interface is very similar to the dolt_commit_diff_$tablename system table that we just learned about, but there are two key differences:

  • Instead of being exposed as a table, this functionality is exposed as a table function. This is our first table function in Dolt 🎉, and there are some restrictions on how table functions can currently be used in queries. We hope to expand table function support and add other table functions, such as JSON_TABLE in the future.
  • The dolt_diff() table function allows you to see the exact to and from schemas of your data. This is particularly helpful when you need to look at two revisions of your table that have different schemas. We'll see an example of this later.

Let's test out the dolt_diff() table function to calculate the same diff from the dolt_commit_diff_$tablename example above. You can run this query in your local clone of the DoltDiffMagic database, or you can execute it directly on DoltHub.

Notice that, as expected, the same diff data is returned, but the syntax for using the dolt_diff() table function is slightly different. We'll see a more advanced example of dolt_diff() later that shows why you'd want to consider using the dolt_diff() system table instead of the dolt_commit_diff_$tablename system table.

select * 
  from dolt_diff("inventory", '0s185scjtbok85bg834embv7f7ueabkd', 'd4q5m9ks9f8b4eq1rjskicnsaiq343cn');
+------------+----------+------------+----------+----------------------------------+-----------------------------------+-----------+------------+--------------+------------+----------------------------------+----------------------------------+-----------+
| to_name    | to_price | to_item_id | to_count | to_commit                        | to_commit_date                    | from_name | from_price | from_item_id | from_count | from_commit                      | from_commit_date                 | diff_type |
+------------+----------+------------+----------+----------------------------------+-----------------------------------+-----------+------------+--------------+------------+----------------------------------+----------------------------------+-----------+
| NULL       | NULL     | NULL       | NULL     | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | soup      | 2          | 1            | 91         | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | removed   |
| blender    | 15       | 3          | 99       | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | blender   | 20         | 3            | 80         | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | modified  |
| camera     | 50       | 4          | 6        | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | camera    | 50         | 4            | 7          | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | modified  |
| walkman    | 10       | 5          | 11       | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | walkman   | 10         | 5            | 12         | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | modified  |
| puppy chow | 15       | 6          | 8        | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | NULL      | NULL       | NULL         | NULL       | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | added     |
| gum        | 1        | 7          | 12       | d4q5m9ks9f8b4eq1rjskicnsaiq343cn | 2022-03-31 16:02:54.613 +0000 UTC | NULL      | NULL       | NULL         | NULL       | 0s185scjtbok85bg834embv7f7ueabkd | 2022-03-18 21:24:44.55 +0000 UTC | added     |
+------------+----------+------------+----------+----------------------------------+-----------------------------------+-----------+------------+--------------+------------+----------------------------------+----------------------------------+-----------+

Diffs in Action

Alright, now that you know about the dolt_commit_diff_$tablename system tables and the dolt_diff() system table function, let's put them to use in a couple of realistic scenarios where we need to inspect the differences in our data.

Merging Branches

The first big use case for diffing table data at two revisions that we'll look at today is merging changes across branches. Branches are a powerful feature in Dolt that allows you to logically organize and isolate your changes until they're ready to be merged with your main branch, or any other branch. When you merge changes across branches, you need to review those changes before merging to catch any problems. Using dolt_commit_diff_$tablename makes it super easy to view those diffs as part of your merging process.

Continuing our inventory example from the previous post in this series, let's say that our store does quarterly inventory checks to count all the physical inventory and update the state in our database. Counting the inventory is done incrementally on a separate branch (inventory/2022-Q2) and then after all the inventory is counted, those changes are merged to the main branch to go live.

If you checkout both the main and inventory/2022-Q2 branches, you can run this query locally against Dolt, or you can execute it directly on DoltHub in your web browser.

select to_name, to_item_id, from_count, to_count
  from dolt_commit_diff_inventory 
 where from_commit=HASHOF('main') 
   and to_commit=HASHOF('inventory/2022-Q2');
+----------+------------+------------+----------+
| to_name  | to_item_id | from_count | to_count |
+----------+------------+------------+----------+
| nintendo | 2          | 9          | 8        |
| walkman  | 5          | 11         | -11      |
| gum      | 7          | 12         | 13       |
+----------+------------+------------+----------+

It looks like the data entry for walkman inventory was messed up! Someone must have accidentally entered that negative sign when recording the inventory count. Good thing we diffed our branch against main before we merged these changes over!

Comparing Data Across Schema Changes

Generally, we recommend using dolt_commit_diff_$tablename, but when the schema of a table has significant changes between the to_commit and from_commit revisions, you can use the dolt_diff() system table function to see the full schema as it existed at both the to_commit and from_commit revisions of your table.

Let's check out an example of diff'ing across schema changes. In our DoltDiffMagic database, we've got another table called store_hours that holds a text representation of the store's hours:

+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| day_of_week | int          | NO   | PRI |         |       |
| hours       | varchar(255) | NO   |     |         |       |
+-------------+--------------+------+-----+---------+-------+

On the v2 branch of our database, we've moved to a new and improved schema:

+-------------+------+------+-----+---------+-------+
| Field       | Type | Null | Key | Default | Extra |
+-------------+------+------+-----+---------+-------+
| day_of_week | int  | NO   | PRI |         |       |
| open        | int  | YES  |     |         |       |
| close       | int  | YES  |     |         |       |
+-------------+------+------+-----+---------+-------+

When we use dolt_commit_diff_$tablename, the schema at the tip of the current branch is used to create the to_ and from_ columns. As you can see below, when we run the query below with dolt_commit_diff_store_hours, we can't see all the columns in the two versions of the tables, because the schema has diverged significantly.

select to_day_of_week, from_hours, to_hours 
  from dolt_commit_diff_store_hours 
 where from_commit=hashof("main") and to_commit=hashof("v2");
+----------------+------------+----------+
| to_day_of_week | from_hours | to_hours |
+----------------+------------+----------+
| 0              | 12 to 6    | NULL     |
| 1              | 10 to 6    | NULL     |
| 2              | 10 to 6    | NULL     |
| 3              | 10 to 6    | NULL     |
| 4              | 10 to 6    | NULL     |
| 5              | 10 to 6    | NULL     |
| 6              | closed     | NULL     |
+----------------+------------+----------+

The dolt_diff() table function is able to generate the to_ and from_ columns more flexibly though – it includes a to_ column for every column in the table's schema at the to revision, and a from_ column for every column in the table's schema at the from revision you specify, allowing you to see the full changes between the two revisions.

select to_day_of_week, from_hours, to_open, to_close
  from dolt_diff("store_hours", "main", "v2");
+----------------+------------+---------+----------+
| to_day_of_week | from_hours | to_open | to_close |
+----------------+------------+---------+----------+
| 0              | 12 to 6    | 12      | 18       |
| 1              | 10 to 6    | 10      | 18       |
| 2              | 10 to 6    | 10      | 18       |
| 3              | 10 to 6    | 10      | 18       |
| 4              | 10 to 6    | 10      | 18       |
| 5              | 10 to 6    | 10      | 18       |
| 6              | closed     | NULL    | NULL     |
+----------------+------------+---------+----------+

Presto!

That wraps up this third post on Dolt's diff features. I hope you enjoyed learning about the dolt_commit_diff_$tablename system table, the dolt_diff() system table function, and how they let you easily diff the contents of your tables across two revisions or branches in your Dolt database.

Got questions on how to get started with Dolt, or how to use any of Dolt's diff features? Come join us on Discord and let us know! We’d love to connect with you and help you get started using Dolt!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.