Diffing Documents with DOLT_JSON_DIFF

FEATURE RELEASE

6 min read

Dolt is a version controlled SQL database: think Git but for data. This includes the ability to create and merge branches with automatic merge resolution for concurrent changes.

Dolt also works well as a version-controlled document store: if your database contains structured data types like JSON, Dolt will merge concurrent changes to those too.

In order to facilitate JSON merging, we wrote a custom serialization format optimized for data reads and point updates, and a custom algorithm for comparing documents stored in this format.

Some users got really excited about this and asked for the ability to generate these diffs themselves. We thought that was a great idea.

So we’re exposing this diffing functionality in the form of the new DOLT_JSON_DIFF system table function.

Dolt has many special system tables and system table functions: in fact, the primary way of interacting with Dolt’s version control features is through the SQL engine, via special tables, procedures, and functions. While this may seem unusual, we’ve discovered this is actually a great way to access and manipulate this data, since it leverages the full expressivity of SQL and enables the user to easily compose system information and use it to write queries.

DOLT_JSON_DIFF offers a great example of this concept. We’re going to show how this new table function can be used to visualize changes to JSON documents over time.

Example 1: A Simple Query#

Let’s start by looking at the simplest possible example: comparing two documents embedded as constants within a query:

SET @BEFORE_DOC = '{
    "contains_added_key": {},
    "contains_edited_key": {"key": "original_value"},
    "contains_removed_key": {"key": "removed_value"}
}';
SET @AFTER_DOC = '{
    "contains_added_key": {"key": "added_value"},
    "contains_edited_key": {"key": "new_value"},
    "contains_removed_key": {}
}';
SELECT * FROM DOLT_JSON_DIFF(@BEFORE_DOC, @AFTER_DOC);

The result of running this query looks like this:

+-----------+----------------------------+------------------+---------------+
| diff_type | path                       | from_value       | to_value      |
+-----------+----------------------------+------------------+---------------+
| added     | $.contains_added_key.key   | NULL             | "added_value" |
| modified  | $.contains_edited_key.key  | "original_value" | "new_value"   |
| removed   | $.contains_removed_key.key | "removed_value"  | NULL          |
+-----------+----------------------------+------------------+---------------+

This is a good point to talk about the schema of this table function:

+-----------------+---------+
| field           | type    |
+-----------------+---------+
| diff_type       | TEXT    |
| path            | TEXT    |
| from_value      | JSON    |
| to_value        | JSON    |
+-----------------+---------+

diff_type is equal to one of the strings "added", "modified", or "removed", and describes what kind of change is happening in the row.
path is a string conforming to MySQL’s JSON Path syntax, describing the location in the document where the change occurred.
from_value is the JSON value that existed at that path prior to the change (or NULL if the change is an addition)
to_value is the JSON value that existed at that path after the change (or NULL if the change is a removal)

In practice, however, you’re not going to be comparing two specific documents, but rather documents read from a table. You may want to call DOLT_JSON_DIFF once for every row of some other table. This can be accomplished with a lateral join.

Example 2: Identifying all Unstaged Changes#

Consider we start with a table inventory in a database on main branch.

Here is the schema of inventory at the tip of main:

+----------+-------------+------+-----+---------+-------+
| Field    | Type        | Null | Key | Default | Extra |
+----------+-------------+------+-----+---------+-------+
| pk       | int         | NO   | PRI | NULL    |       |
| name     | varchar(50) | YES  |     | NULL    |       |
| metadata | json        | YES  |     | NULL    |       |
+----------+-------------+------+-----+---------+-------+

And here’s the initial state of inventory has at the tip of main:

+----+-------+----------------------------------------------------------------+
| pk | name  | metadata.                                                      |
+----+-------+----------------------------------------------------------------+
| 1  | shirt | {"colors": ["red"] }                                           |
| 2  | shoes | {"colors": ["black"], "size": "small" }                        |
| 3  | pants | {"colors": ["blue", "beige"], "materials": ["denim", "silk"] } |
| 4  | tie   | {"colours": ["red"], "clip-on": true }                         |
+----+-------+----------------------------------------------------------------+

We then create (but don’t stage) a number of different changes, resulting in a working set that looks like this:

+----+-------+------------------+
| pk | name  | metadata.        |
+----+-------+------------------+
| 1  | shirt | {"colors": ["red", "blue"], "types": ["tee", "hawaiian"] }  |
| 2  | shoes | {"colors": ["white"], "size": "medium" } |
| 3  | pants | {"colors": ["blue"] } |
| 4  | tie | { "colors": ["red"], "clip-on": false }  |
+----+-------+------------------+

We added values to the “shirt” document, edited data in the “shoes” document, deleted data from the “pants” document, and renamed a key in the “tie” document.

If we want to get a list of every unstaged change made to any value in the metadata column, we can combine the DOLT_JSON_DIFF() table function with the DOLT_WORKSPACE_inventory table, via a lateral join:

SELECT
    to_pk as pk,
    to_name as name,
    json_diff.diff_type as json_diff_type,
    row_diff.from_metadata,
    row_diff.to_metadata,
    json_diff.path,
    json_diff.from_value,
    json_diff.to_value
FROM
    DOLT_WORKSPACE_inventory AS row_diff
    JOIN
    lateral (SELECT * FROM DOLT_JSON_DIFF(from_metadata, to_metadata)) AS json_diff
WHERE row_diff.diff_type = 'modified' and row_diff.staged = false;

The results of the query provide a summary of only the parts of the JSON documents that have changed between the staged version and the working set:

+----+-------+----------------+----------------------------------------------------------+------------------------------------------------------+-------------+------------------+--------------------+
| pk | name  | json_diff_type | from_metadata                                            | to_metadata                                          | path        | from_value       | to_value           |
+----+-------+----------------+----------------------------------------------------------+------------------------------------------------------+-------------+------------------+--------------------+
| 0  | shirt | added          | {"colors":["red"]}                                       | {"colors":["red","blue"],"types":["tee","hawaiian"]} | $.colors[1] | NULL             | "blue"             |
| 0  | shirt | added          | {"colors":["red"]}                                       | {"colors":["red","blue"],"types":["tee","hawaiian"]} | $.types     | NULL             | ["tee","hawaiian"] |
| 1  | shoes | modified       | {"colors":["black"],"size":"small"}                      | {"colors":["white"],"size":"medium"}                 | $.colors[0] | "black"          | "white"            |
| 1  | shoes | modified       | {"colors":["black"],"size":"small"}                      | {"colors":["white"],"size":"medium"}                 | $.size      | "small"          | "medium"           |
| 2  | pants | removed        | {"colors":["blue","beige"],"materials":["denim","silk"]} | {"colors":["blue"]}                                  | $.colors[1] | "beige"          | NULL               |
| 2  | pants | removed        | {"colors":["blue","beige"],"materials":["denim","silk"]} | {"colors":["blue"]}                                  | $.materials | ["denim","silk"] | NULL               |
| 3  | tie   | modified       | {"clip-on":true,"colours":["red"]}                       | {"clip-on":false,"colors":["red"]}                   | $.clip-on   | true             | false              |
| 3  | tie   | added          | {"clip-on":true,"colours":["red"]}                       | {"clip-on":false,"colors":["red"]}                   | $.colors    | NULL             | ["red"]            |
| 3  | tie   | removed        | {"clip-on":true,"colours":["red"]}                       | {"clip-on":false,"colors":["red"]}                   | $.colours   | ["red"]          | NULL               |
+----+-------+----------------+----------------------------------------------------------+------------------------------------------------------+-------------+------------------+--------------------+

Note how multiple changes in a single row of the inventory table are rendered as multiple rows in the result. When multiple keys in the same object have changed, DOLT_JSON_DIFF reports an individual diff for each key, instead of reporting a single diff for the entire object.

Arrays are diffed by considering each index of the array separately. This means that inserting or removing values in an array anywhere other than the end will shift the indexes of each element, and will be reported as a modification at each index where the value changed.

Finally, let’s look at one last example: visualizing the entire version history of a single JSON document.

Example 3: Viewing the Version History#

Similar to our previous example, we have an inventory table:

Here is the schema of inventory at the tip of main:

+----------+-------------+------+-----+---------+-------+
| Field    | Type        | Null | Key | Default | Extra |
+----------+-------------+------+-----+---------+-------+
| pk       | int         | NO   | PRI | NULL    |       |
| name     | varchar(50) | YES  |     | NULL    |       |
| metadata | json        | YES  |     | NULL    |       |
+----------+-------------+------+-----+---------+-------+

However this time we’re only going to consider a single row that changes over multiple commits:

> select * from inventory as of HEAD~2 where pk = 1
+----+-------+------------------+
| pk | name  | metadata.        |
+----+-------+------------------+
| 1  | shirt | {"colors": ["red"] }  |
+----+-------+------------------+

> select * from inventory as of HEAD~1 where pk = 1
+----+-------+------------------+
| pk | name  | metadata.        |
+----+-------+------------------+
| 1  | shirt | {"colors": ["red", "blue"], "types": ["tee"] }  |
+----+-------+------------------+

> select * from inventory as of HEAD where pk = 1
+----+-------+------------------+
| pk | name  | metadata         |
+----+-------+------------------+
| 1  | shirt | {"types": ["hawaiian"] }  |
+----+-------+------------------+

We want to track the changes that were made at each commit. We can use the system table DOLT_DIFF_$tablename to find every time that a row changed in a commit, and we can visualize the JSON changes for each by performing another lateral join with DOLT_JSON_DIFF:

SELECT
    from_commit,
    to_commit,
    to_pk as pk,
    to_name as name,
    json_diff.diff_type as json_diff_type,
    row_diff.from_metadata,
    row_diff.to_metadata,
    json_diff.path,
    json_diff.from_value,
    json_diff.to_value
FROM DOLT_DIFF_inventory AS row_diff JOIN LATERAL (SELECT * FROM DOLT_JSON_DIFF(from_metadata, to_metadata)) AS json_diff WHERE to_pk = 1;

This produces a result set that shows every change made to the document over the branch’s history, ordered by the commit that introduced that change, starting with the most recent:

+----------------------------------+----------------------------------+----+-------+----------------+------------------------------------------------------+------------------------------------------------------+-------------+----------------+---------------------+ | from_commit | to_commit | pk | name | json_diff_type | from_metadata | to_metadata | path | from_value | to_value | +----------------------------------+----------------------------------+----+-------+----------------+------------------------------------------------------+------------------------------------------------------+-------------+----------------+---------------------+ | rna6nevoe6pa14acsc54g501nk47niff | 0td6msjeflhomjg2rhtb372jjlbj6s6m | 4 | shirt | removed | {“colors”:[“red”,“blue”],“types”:[“tee”,“hawaiian”]} | {“types”:[“tee”]} | $.colors | [“red”,“blue”] | NULL | | rna6nevoe6pa14acsc54g501nk47niff | 0td6msjeflhomjg2rhtb372jjlbj6s6m | 4 | shirt | removed | {“colors”:[“red”,“blue”],“types”:[“tee”,“hawaiian”]} | {“types”:[“tee”]} | $.types[1] | “hawaiian” | NULL | | crfdq7u3cugcafh2n7rebajtmckv0mle | rna6nevoe6pa14acsc54g501nk47niff | 4 | shirt | added | {“colors”:[“red”]} | {“colors”:[“red”,“blue”],“types”:[“tee”,“hawaiian”]} | $.colors[1] | NULL | “blue” | | crfdq7u3cugcafh2n7rebajtmckv0mle | rna6nevoe6pa14acsc54g501nk47niff | 4 | shirt | added | {“colors”:[“red”]} | {“colors”:[“red”,“blue”],“types”:[“tee”,“hawaiian”]} | $.types | NULL | [“tee”,“hawaiian”] | +----------------------------------+----------------------------------+----+-------+----------------+------------------------------------------------------+------------------------------------------------------+-------------+----------------+---------------------+

Conclusion#

It’s easy to image how this is useful when you’re working with large documents, especially when the documents are much larger than the individual updates that are being made to them.

Dolt’s serialization format for JSON documents indexes them under the hood. This means that it doesn’t need to load the entire documents in order to compute these diffs. If the individual changes are small, then producing the diff is cheap and fast.

Exposing this functionality as a table function allows for flexibility in how this data is queried and used. I provided two examples here, but there’s many more scenarios where you would want to compare documents. I’m excited to see what creative applications users come up with.

If you’re not sure if DOLT_JSON_DIFF is what you’re looking for, drop us a line on Discord and we can chat about your use case. If you have other ideas for functionality that Dolt should expose via system tables, let us know and we’ll explore it together.

That’s all for now. I work on Dolt because I sincerely think it’s a cool and useful tool for solving novel problems. We want it to inspire people, which is why the project is open-source. I hope that you all find this as valuable as I do.

Blog

Example 1: A Simple Query#

Example 2: Identifying all Unstaged Changes#

Example 3: Viewing the Version History#

Conclusion#

Get started with Dolt