So you want Data Merge?

REFERENCE
7 min read

Here at DoltHub, we've had a lot of success with our "So you want..." series of blog posts helping people find Dolt when they are looking for us. Dolt is a lot of things. Dolt is a version controlled database, a Git database, Git for data, data version control, an immutable database, and a decentralized database.

One of the primary features of a version controlled database is you can merge two copies of the data stored in it in a principled way, "data merge", so to speak. Much to our surprise there is a popular product called "data merge" that dominates the search results.

Data Merge Search

We have new technology. We've come to take the "data merge" term back! This blog is our attempt.

Data Merge is not Mail Merge!

It turns out there's a product called data merge in Adobe InDesign. It was originally called mail merge and it was rebranded. WTF!

Mail merge is basic templating for letters where fields like <<Company Name>> and <<Address>> are replaced with the actual company name and address of the recipient as listed in another file, usually a spreadsheet. Mail merge is basically a bridge between a spreadsheet and a document. All the big players in office software, Adobe, Microsoft, and Google, support mail merge.

What is Data Merge?

But what about real data merge? I have data in two tables and I want to merge them together in a principled way? How would that work? What happens when there are merge conflicts?

Data generally comes in the form of tables but the concepts here apply to file-based formats like CSVs or hierarchical data formats like JSON documents. Based on my expertise and for simplicity, I'm going to discuss data merge in terms of tables.

Merge Icon

Let's assume the data has the same schema: same primary keys, column names, and column types. If the tables have different schema, merge can get a little complicated. A set of rules would need to be defined for schema merges and conflicts. Once schemas are merged, you could proceed with a data merge.

Let's explain with an example. Let's say we have a table of image labels and bounding boxes. We start with one image and two labels, "fruit" and "bowl".

CREATE TABLE image_labels (
	image_id int, 
	label varchar(100), 
	top_left_x int, 
	top_left_y int, 
	bottom_right_x int, 
	bottom_right_y int, 
	PRIMARY KEY(image_id, label), 
	CHECK(top_left_x < bottom_right_x), 
	CHECK(top_left_y < bottom_right_y)
);
INSERT INTO image_labels VALUES (1, 'bowl', 0, 5, 5, 10), (1, 'fruit', 0, 0, 5, 5);

Let's say two users wants to modify the table but they don't want to stomp on each others work. This is a data merge problem. In traditional SQL, the two may agree to coordinate through a temporary table. The first user will work on the original table and the second user will do their work in a temporary table. The first user will refine the bounding boxes we have.

UPDATE image_labels SET top_left_x=1;

The second user is interested in classifying the fruit in the bowl for a different problem.

CREATE TABLE image_labels2 AS SELECT * FROM image_labels;
DELETE FROM image_labels2;
INSERT INTO image_labels2 VALUES (1, 'apple', 0, 0, 2, 5), (1, 'orange', 2, 0, 5, 5);

I now have two tables, image_labels and image_labels2 that look like this:

SELECT * from image_labels;
+----------+-------+------------+------------+----------------+----------------+
| image_id | label | top_left_x | top_left_y | bottom_right_x | bottom_right_y |
+----------+-------+------------+------------+----------------+----------------+
| 1        | bowl  | 1          | 5          | 5              | 10             |
| 1        | fruit | 1          | 0          | 5              | 5              |
+----------+-------+------------+------------+----------------+----------------+

SELECT * from image_labels2;
+----------+--------+------------+------------+----------------+----------------+
| image_id | label  | top_left_x | top_left_y | bottom_right_x | bottom_right_y |
+----------+--------+------------+------------+----------------+----------------+
| 1        | orange | 2          | 0          | 5              | 5              |
| 1        | apple  | 0          | 0          | 2              | 5              |
+----------+--------+------------+------------+----------------+----------------+

In SQL, there is a way to merge data from two tables with the same schema INSERT ON DUPLICATE KEY UPDATE.

INSERT INTO image_labels 
SELECT * from image_labels2 
ON DUPLICATE KEY UPDATE 
image_labels.top_left_x = image_labels2.top_left_x,
image_labels.top_left_y = image_labels2.top_left_y,
image_labels.bottom_right_x = image_labels2.bottom_right_x,
image_labels.bottom_right_y = image_labels2.bottom_right_y;

This yields the following table:

SELECT * from image_labels
+----------+--------+------------+------------+----------------+----------------+
| image_id | label  | top_left_x | top_left_y | bottom_right_x | bottom_right_y |
+----------+--------+------------+------------+----------------+----------------+
| 1        | apple  | 0          | 0          | 2              | 5              |
| 1        | bowl   | 1          | 5          | 5              | 10             |
| 1        | fruit  | 1          | 0          | 5              | 5              |
| 1        | orange | 2          | 0          | 5              | 5              |
+----------+--------+------------+------------+----------------+----------------+

Is this a valid merge? Maybe. There are a few problems with this approach.

The first problem is conflicts. The part after ON DUPLICATE KEY UPDATE specifies what to do in the case of a key collision conflict. Do you keep what you have or take what the other table has? You need to define what to do in a single SQL statement. That can get complicated. Also, what if both tables changed different cells? You may be forced to overwrite one with the other. Conflicts are hard to deal with and a database with true data merge can help manage them.

The second problem is this approach requires two copies of the data be stored in the database without sharing any of the data between the copies. This may be untenable for large tables. An ideal solution would structurally share data between versions. A database with data merge would structurally share data between versions.

The third problem is deletes. What if you've deleted rows in one of the tables and want those merged into the other table? What if you've deleted rows in both tables? The SQL to solve this problem becomes very complex and is not easily generalizable. A database with data merge would handle deletes on multiple versions.

The fourth problem is speed. This approach requires a full table scan of table2 and an indexed lookup of each primary key for each duplicate, even if the tables are the same. An ideal merge approach would scale with the size of the differences between the two tables, not the size of the data in the tables. A database with data merge would be able to perform merges fast.

Thankfully, there is a novel data structure called a prolly tree, pioneered by the good folks who built Noms. A prolly tree is a content addressed b-tree. Prolly trees are fast to query like b-trees but also fast to compare. When you combine a prolly tree with a Merkle DAG, you can build a Git-style version controlled database you can three way merge.

A database built on prolly trees can track conflicts, structurally share data between versions, handle deleted rows, and do all this fast. We know because we built one and it's featured later in this article.

Why do you need Data Merge?

Data Quality

I recently published an opus on Data Quality Control arguing that version control and testing is a missing piece in the data quality control stack. Branch, Diff and merge are the core building blocks of version control. Utilizing data version control with merges can be a critical tool to increase data quality in your organization.

Collaboration

Asynchronous collaboration using decentralized version control works really well in source code. Two editors make changes on separate branches and merge the changes back together later. This allows for long running projects and experiments which are not possible using contemporary databases. The same style of collaboration model could work well on data by using data merge functionality.

Products

Data merge is a hard technical problem. Presently, only two databases can do it. One can do data merge on graph and document databases. The other can merge SQL databases.

Data Merge

Terminus DB

Tagline
Making Data Collaboration Easy
Initial Release
October 2019
GitHub
https://github.com/terminusdb/terminusdb

TerminusDB is a version controlled graph and document database. It applies Git concepts to graph or document databases. Think Git and MongoDB/Neo4J had a baby.

A Terminus data merge example is well documented so I won't repeat it here. When you merge and Terminus identifies a duplicate record, record information is appended. The provenance information is stored in the record so the user can pick which side of the merge to use.

TerminusDB is new and we like what we see. The company is very responsive, has an active Discord, and is well funded.

Dolt

Tagline
It's Git for Data
Initial Release
August 2019
GitHub
https://github.com/dolthub/dolt

Dolt is the first and only version controlled SQL database. Dolt is like Git and MySQL had a baby. Dolt can be run on the command line, like Git or as a running server, like MySQL.

Data merge in Dolt is built on top of prolly trees and a Merkle DAG like we explained above. Data merge can produce conflicts, structurally share data between versions, handle deleted rows, and do all this fast. Dolt can merge data and schema. Dolt merges rows across versions using primary keys.

Going back to our example, now with a version controlled database. Instead of creating a temporary table we will create a branch and use the same table. We made a commit after creating the image_labels table.

CALL DOLT_CHECKOUT('-b', 'user2');
DELETE FROM image_labels;
INSERT INTO image_labels VALUES (1, 'apple', 0, 0, 2, 5), (1, 'orange', 2, 0, 5, 5);
CALL DOLT_COMMIT('-am', 'Classified fruit');

The first user will modify main with the same SQL:

UPDATE image_labels SET top_left_x=1;
CALL DOLT_COMMIT('-am', 'Modified slightly off top corner');

Now I have two different tables on two different branches. Notice the AS OF syntax.

SELECT * from image_labels;
+----------+-------+------------+------------+----------------+----------------+
| image_id | label | top_left_x | top_left_y | bottom_right_x | bottom_right_y |
+----------+-------+------------+------------+----------------+----------------+
| 1        | bowl  | 1          | 5          | 5              | 10             |
| 1        | fruit | 1          | 0          | 5              | 5              |
+----------+-------+------------+------------+----------------+----------------+

SELECT * from image_labels AS OF user2;
+----------+--------+------------+------------+----------------+----------------+
| image_id | label  | top_left_x | top_left_y | bottom_right_x | bottom_right_y |
+----------+--------+------------+------------+----------------+----------------+
| 1        | apple  | 0          | 0          | 2              | 5              |
| 1        | orange | 2          | 0          | 5              | 5              |
+----------+--------+------------+------------+----------------+----------------+

Now let's merge and see what happens.

CALL DOLT_MERGE('user2')"
+--------------+-----------+
| fast_forward | conflicts |
+--------------+-----------+
| 0            | 1         |
+--------------+-----------+
Merge conflict detected, transaction rolled back. Merge conflicts must be resolved using the dolt_conflicts tables before committing a transaction. To commit transactions with merge conflicts, set @@dolt_allow_commit_conflicts = 1

A conflict is detected because the first user modified a row that the second user deleted. If I want to produce the above merge result I just delete the conflicts from the dolt_conflicts table but I can produce whatever merge I results want. Let's say I want to keep the update but only on the apple label and now I'm only interested in fruit.

START TRANSACTION;
CALL DOLT_MERGE('user2');
UPDATE image_labels SET top_left_x=1 where label='apple';
DELETE FROM image_labels where label='bowl';
DELETE FROM dolt_conflicts_image_labels;
CALL DOLT_COMMIT ('-am', 'Produced a merge where I am only interested in fruit');

This produces the following merge but the power is telling the user there were conflicting edits and allowing any merge to be produced.

SELECT * from image_labels;
+----------+--------+------------+------------+----------------+----------------+
| image_id | label  | top_left_x | top_left_y | bottom_right_x | bottom_right_y |
+----------+--------+------------+------------+----------------+----------------+
| 1        | apple  | 1          | 0          | 2              | 5              |
| 1        | fruit  | 1          | 0          | 5              | 5              |
| 1        | orange | 2          | 0          | 5              | 5              |
+----------+--------+------------+------------+----------------+----------------+

Obviously, we're a little biased but we think Dolt is the best of breed tool for data merge. Want to discuss? Come talk to us on Discord.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.