Blog

8 min read

We started Dolt as a version controlled SQL database because we believed that all data benefits from version control, not just code. Our original tagline was “Git for Data”. I think it’s a great tagline, but it doesn’t really convey why someone might want to use Dolt, which is why we like exploring specific use cases that Dolt empowers and telling stories that way.

Recently, we’ve noticed that Dolt’s underlying design is good for more than just tables, and that version controlling data is good for more than just humans. And we reflect that in how we write about it.

Dolt is the database for agents: If you use AI agents in your workflow, you need to isolate their changes from your own, and use branch merging to integrate their changes.
Dolt is the database for versioned JSON: Dolt’s data storage works for more than just tables, and we believe that Dolt is a competitive way to efficiently store and operate on JSON data, especially if you want to version it.

We recently chatted with a user who was impressed by how Dolt was able to efficiently merge changes to large JSON documents, but expressed confusion about which kinds of merges are automatic and instant, which kinds of merges are automatic but not instant, and what types of merges produce conflicts that require manual resolution by the user.

Background#

Dolt’s biggest feature is version controlled tables. You can branch your database, make changes on both branches, and then merge those changes back together. In most cases, merge conflicts can be resolved automatically without requiring manual resolution from the user. This works because of Prolly Trees, a novel data structure that allows for space-efficient storage of large similar maps and time-efficient comparisons.[1]

We realized that the same techniques that we use for diffing and merging table histories works just as well for diffing and merging JSON documents. This means that comparing a JSON document on two different branches or two different commits should be fast. Merging two branches should automatically combine the changes made to the document on each branch, and that should also be fast. And of course, operations that modify JSON documents should be fast as well. More precisely, each of these operations should scale with the size of the changes, regardless of the total size of the document.

And this is the case for most operations. However, a small number of cases still require manual conflict resolution, or perform slower for large documents. This is something that we believe can be improved, and will make continued improvements to if there’s user demand.

Until then, we wanted to document exactly which operations can be efficiently diffed and merged and which operations can’t. We want users to be able to have a strong intuition about what kinds of document changes are most amenable to efficient diffing and merging.

When Does Dolt Excel?#

The general rule of thumb is to imagine a document as a table that maps paths in the document to the leaf value stored at that path. This is not actually how documents are stored, but it provides a good mental model. Conflicts happen when the same path is modified on both branches.

This means that simply adding and removing object keys from your document will always be efficient, and will almost never result in a merge conflict unless there’s an obvious contradiction between the branches, such as setting the same leaf to two different values or editing a leaf in one branch while deleting it in the other. Manual resolution will always be required to handle such contradictions, just like in Git, in all other cases, Dolt will do exactly what you want.

Considerations When Updating a Document#

Dolt internal representation for JSON is designed to make most update operations fast, but there are a couple operations that currently don’t benefit from this.

Renaming a Document Key#

Because Dolt uses the path within a document to identify values, renaming a key is equivalent to removing the original key-value pair and inserting a new key-value pair. If the value at that key is large, then the resulting deletion and insertion will be large as well. In this case, the time required will scale with the size of the value being moved.

This also means that diffing a document after a rename operation will display a large removal and a large insert, and rendering this diff will take time proportional to the size of the moved object.

This currently has a similar impact on merge performance, although we believe that further improvements to the JSON merger may be able to mitigate much of this.

Inserting or Removing at the Beginning of an Array#

Inserting into the beginning of an array effectively renames every element of that array:

The previous value at array[0] is renamed to array[1],
The previous value at array[1] is renamed to array[2],
…and so on.

As a consequence, this means that when a value is added or removed at the beginning of an array, the entire array gets re-written. Additionally, both the differ and the merger will interpret this as a modification to every element in the array.

If you’re worried about this outcome, we recommend always appending to the end of arrays in your data. If you anticipate needing to drop arbitrary elements from an array, consider using an object with named keys instead.

Outside of these two operations, all the other standard JSON operations will be fast.

Considerations With Automatic Merge Resolution#

Just like with Git, some changes produce merge conflicts that need to be resolved manually, even if the changes aren’t modifying the same value.

In the simplest example, imagine a JSON document that initially contains an empty array. Both branches insert a single element into this array.

Combining the changes means inserting both elements into the array, but it’s ambiguous what order they should be in. It’s also possible that the changes are incompatible: it could be the case that both branches intended for their inserted value to be the first value in the array, and putting that value second would violate the semantic meaning of the change. Without additional context, it’s simply not possible to perform this merge automatically: manual merge resolution is required.

This case will likely never be automatically merged, simply because it would require the merger to make assumptions about the ordering of the inserts, and there’s an unacceptable risk of accidentally changing the meaning of the data. If you made a similar change to a document in Git by inserting two different lines at the same line number, Git would also report a conflict. So we follow Git’s example and require manual resolution.

However, this isn’t the whole story. Dolt is actually a bit more conservative than Git when it comes to automatic merge resolution, and there are situations where Dolt will report a merge conflict that must be resolved manually, even while an analogous operation on Git would merge automatically. All of these situations have to do with both branches modifying the same array.

For example, suppose Dolt contains a JSON document with the value [0, 1, 0], and on one branch this value was edited to become [1, 0, 1]. How should Dolt interpret this diff?

A 0 was removed from the start of the array, and a 1 was inserted at the end.
A 0 was removed from the end of the array, and a 1 was inserted at the start.
All three elements in the array were individually modified.

Now assume that on another other branch, only the first element of the array is modified, resulting in the array [2, 1, 0]. What happens if we merge these two branches together?

The result of the merge is different for each of these three cases.

If we assume case 1, then we have a merge conflict: the first element was modified on one branch and deleted on the other.
If we assume case 2, then we successfully merge and get the value [1, 2, 1]
If we assume case 3, then we have a merge conflict: the first element was modified on both branches and set to different values.

In the analogous Git situation of a three-line file, Git would assume either case 1 or case 2, because it will choose simpler diffs over more complicated ones. Thus, there may or may not be a merge conflict depending on the specific diff that Git chooses.

In comparison, Dolt currently treats each element of an array separately, and thus will always choose case 3 and always report a merge conflict here. We can see this for ourselves with the following steps:

dolt sql -q "create table test(pk int primary key, j json)"
dolt sql -q "insert into test values (0, '[0, 1, 0]')"
dolt commit -Am "create test table"
dolt branch right
dolt checkout -b left
dolt sql -q "update test set j = '[1, 0, 1]'"
dolt commit -Am "left update"
dolt checkout right
dolt sql -q "update test set j = '[2, 1, 0]'"
dolt commit -Am "right update"
dolt merge left

This will produce the following output:

Auto-merging test
CONFLICT (content): Merge conflict in test
Automatic merge failed; 1 table(s) are unmerged.
Use 'dolt conflicts' to investigate and resolve conflicts.

And we can use the dolt conflicts cat command to see the conflict:

> dolt conflicts cat test
+-----+--------+----+---------+
|     |        | pk | j       |
+-----+--------+----+---------+
|     | base   | 0  | [0,1,0] |
|  *  | ours   | 0  | [2,1,0] |
|  *  | theirs | 0  | [1,0,1] |
+-----+--------+----+---------+

(Unfortunately, the interface doesn’t show us why these JSON values conflict, only that they do. The clarity and context of merge conflict messages is something we want to improve.)

Making this merge a conflict is a deliberate choice that we made to reduce the chance of an automatic merge that changes the meaning of the data.

We call this a merge like that a mis-merge. Mis-merges are a potential risk with Git too, but the consequences are less severe: in most cases where a mis-merge happens, the resulting file is not syntactically valid source code and won’t compile. But when you’re versioning data instead of code, the risk posed by a mis-merge is higher.

This was a hypothetical example of an extreme case. There are other cases where two branches perform operations on the same array, but the intended merge result is reasonably intuitive. But because of the [Inserting or Removing at the Beginning of an Array] example above, Dolt will still usually interpret these changes as a conflict and require manual resolution.

This happens because technically, all diffs on arrays are ambiguous: when diffing two arrays, we’re trying to identify the sequence of changes that can be used to produce one array from the other. But in most cases, there are actually multiple different possible sequences that could be used, and the choice has different implications for resolving merge conflicts. We’ve decided that Dolt should not try to make assumptions about the semantic meaning of data under merge, and leave those decisions to human users.

That said, we are currently considering allowing users to aoutmatically merge more arrays by opting-in to the same Longest Common Subequence approximation algorithms used in Git to identify the “simplest” diff between two arrays. It works well enough in most cases, but users should be mindful of the fact that mis-merges are still possible with it.

Until Next Time#

And that’s everything you need to know to understand why Dolt might not optimize for certain JSON update operations or might not automatically merge your JSON documents.

If you have any questions about whether or not this applies to you, feel free to join our Discord server and ask. If you have a feature request, let us know so we can hash it out with you: we take user requests very seriously to help us decide what to prioritize.