Merging Branches with Foreign Keys

15 min read

Dolt is a SQL database with Git-style versioning. With each new version of Dolt, we increase the number of supported SQL features. We're moving toward our goal of being a complete drop-in replacement for MySQL, while adding all of the versioning features you know and love from Git applied to a database, such as branching, diffs, merging, etc. Relational databases have not progressed over the years with versioning support in mind, so we've had to face many novel problems regarding versioning data.

Back when we first started Dolt, the schema was simple, and our rules for merging were simple: two writers on the same cell create a conflict. Additionally, two writers modifying the same column in the schema is a conflict, otherwise the merge is fully resolved. As we've added functionality such as foreign keys, check constraints, and default column values, we've had to update our merge rules to keep up. Reasoning about those merge rules is one of the more complex product problems that we face here at DoltHub.

This blog post focuses on one such problem: merging data when foreign keys are involved. We'll discuss our journey to finding a solution to the problem, as well as provide examples of how to use our solutions.

Git For Data

Version control is practically standard in modern software development, and over the years many systems have come and gone, with Git currently the most popular. As ubiquitous as version control is for source code, the same does not exist for data, which is where Dolt comes into play. For those familiar with Git's flavor of version control, Dolt allows you to create branches and merge them.

Have Branch A, Create Branch B and update, Update Branch A, Merge Branch B into A

During merging, we find all differences at the cell level and merge what can be merged, and throw a conflict when it cannot be resolved (similar to changing the same line to different values in Git). These conflicts provide an easy way for the user to see the rows in question, how they conflict, and then decide how to resolve the conflict. In Git, these conflicts are all that need to be accounted for, and once they're resolved you're "done" with the merge. For source code, that does not necessarily mean that your code is correct or compilable, but it is up to the user to resolve these issues. With data, there are sometimes constraints that are placed on the data, and one such constraint is the foreign key constraint.

Foreign Keys and Merging

Foreign keys are imperative in databases to maintain integrity of the data. Not only do they enforce restrictions on the data, but they may also modify the data to ensure that data conforms to the restrictions. They're generally always enabled, and only disabled when changes are made that will eventually conform to the restrictions of the foreign key, when the intermediate steps do not. In Dolt, we originally chose to view a merge as an operation that should eventually produce data that conforms to foreign keys (and other constraints), with eventually being at the time of commit.

The Temporary Solution

To implement this eventuality, we chose to enforce foreign keys at commit time. This was handled by iterating over every row of every table that had a foreign key constraint, and verifying that the row conformed to all foreign key constraints. At the end of the scan, we displayed the rows that violated the constraints and their respective tables, allowing a user to fix them. This solution was adequate at the time for our users and internal use cases, but we eventually ran into the most obvious bottleneck: full table scans.

We host bounties where you can get paid for sourcing data every month, and as these repositories reached into the gigabytes, our workflow for merging pull requests became very slow. We changed the row scanning logic to operate on the differences since the previous commit, which significantly sped up our scan since we are only looking at a subset of the data. This did mean that we were no longer looking at all the data, so if you forced a commit that had rows that violated foreign keys then they became impossible to find. We found this acceptable as forcing a commit is intentional, although we did introduce a new CLI command (verify-constraints) that still did a full table scan. This lasted us quite a while, but was only useful to CLI users. Those that made use of the equivalent SQL commands (Dolt allows access to CLI commands through SQL using specially-named functions) needed a new solution, and we needed to flesh out what it means to merge with foreign keys having referential actions.

The Real Solution

Conflicts are implemented in Dolt as a set of special system tables that users may view through the CLI or query through SQL, allowing functional parity between our two primary forms of interaction. After some internal discussion, this looked like a good model to follow for foreign key violations, and we decided on the dolt_constraint_violations system table, with each table having its own dolt_constraint_violations_TABLENAME system table. Per the name, we intend to use this table for all constraint violations, so we can also catch and store unique index violations and CHECK violations here as well.

Next we decided on what to do regarding referential actions. In a standard SQL database under normal operation, statements are processed serially. In the context of foreign keys, this means that any statements that may alter other rows outside of the target row(s) will produce a deterministic result. A merge, however, processes all rows concurrently, therefore we do not have an order upon which to apply referential actions from foreign keys, and some kind of order is required.

In some scenarios we can arbitrarily pick an order that may seem "correct", such as applying the referential actions from the parent table before checking for violations on the child table. That fails when considering chained foreign keys (where changing A affects B which affects C and so on), circular foreign keys (where changing A affects B which affects A), some forms of self-referential keys, etc. Even for those it's possible to arbitrarily pick some ordering (first in the chain, alphabetically, etc.) Still, are any of these correct? It's impossible to answer definitively, as choosing an ordering is choosing among several equally correct relationships with the data, and correctness is dependent upon the application of the database. As a result, we decided to not apply any referential actions. This gives the user full control over what to do for their particular use case, and with the power of SQL and all of its features (including stored procedures), workflows can be created that automatically handle constraint violations of any nature.

In summary, when you merge two branches, we first apply the changes only considering unique index violations. Next we iterate over the differences brought by the merge result (compared to the ancestor) and append all constraint violations to a special system table. From there, we simply check if the dolt_constraint_violations system table is empty before proceeding with a commit (and you can still force commits with violations). We still provide the CLI command (renamed to constraints verify) if you wish to do a full table scan, and we also added SQL functions to do the same.

Examples

Now that we've covered the history and where we currently are, let's look at a few examples for dealing with constraint violations. Feel free to download Dolt and follow along! Each example will assume that we're starting from an empty folder.

Simple Foreign Key Constraint Violation

First let's go over a simple foreign key constraint violation and how it may be resolved.

$ dolt init
Successfully initialized dolt data repository.

$ dolt sql << SQL
CREATE TABLE parent(pk INT PRIMARY KEY, v1 INT, INDEX (v1));
CREATE TABLE child(pk INT PRIMARY KEY, v1 INT, CONSTRAINT fk_parent FOREIGN KEY (v1) REFERENCES parent (v1));
INSERT INTO parent VALUES (1, 1), (2, 2);
INSERT INTO child VALUES (1, 1);
SQL
Rows inserted: 3 Rows updated: 0 Rows deleted: 0

$ dolt add -A

$ dolt commit -m "Ancestor Commit"
commit 5lntcccvdqaqeo9i73epeoeukpr5cd5k
Author: Daylon Wilkins <daylon@dolthub.com>
Date:   Wed Jul 21 12:00:00 -0700 2021

        Ancestor Commit

Here we've created two tables and a foreign key, and then committed this to the master branch. Next, let's create a branch and modify both the master branch and this other branch. Each branch's data will be valid according to the foreign key.

$ dolt branch other

$ dolt sql -q "DELETE FROM parent WHERE pk = 2;"
Query OK, 1 row affected

$ dolt add -A

$ dolt commit -m "Removed row on master"
commit tnhjjmffaa28d9chfjl7qj6kvv5b15gs
Author: Daylon Wilkins <daylon@dolthub.com>
Date:   Wed Jul 21 12:00:00 -0700 2021

        Removed row on master


$ dolt checkout other
Switched to branch 'other'

$ dolt sql -q "INSERT INTO child VALUES (2, 2);"
Query OK, 1 row affected

$ dolt add -A

$ dolt commit -m "Added row on other"
commit t2j26e9r4jn63vgb88s7f8cfh20qh2e1
Author: Daylon Wilkins <daylon@dolthub.com>
Date:   Wed Jul 21 12:00:00 -0700 2021

        Added row on other


$ dolt checkout master
Switched to branch 'master'

Now we've removed a parent row in the master branch, and added a reference to that same row in the other branch. When we merge, we're going to add that new row to our master branch, but as the parent row no longer exists, we'll have a constraint violation.

$ dolt merge other
Updating tnhjjmffaa28d9chfjl7qj6kvv5b15gs..t2j26e9r4jn63vgb88s7f8cfh20qh2e1
Auto-merging child
CONSTRAINT VIOLATION (content): Merge created constraint violation in child
Automatic merge failed; fix constraint violations and then commit the result.
Constraint violations for the working set may be viewed using the 'dolt_constraint_violations' system table.
They may be queried and removed per-table using the 'dolt_constraint_violations_TABLENAME' system table.

$ dolt sql -q "SELECT * FROM dolt_constraint_violations;"
+-------+----------------+
| table | num_violations |
+-------+----------------+
| child | 1              |
+-------+----------------+

$ dolt sql -q "SELECT * FROM dolt_constraint_violations_child;"
+----------------+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| violation_type | pk | v1 | violation_info                                                                                                                                                                                                     |
+----------------+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| foreign key    | 2  | 2  | {"Columns": ["v1"], "ForeignKey": "fk_parent", "Index": "v1", "OnDelete": "RESTRICT", "OnUpdate": "RESTRICT", "ReferencedColumns": ["v1"], "ReferencedIndex": "v1", "ReferencedTable": "parent", "Table": "child"} |
+----------------+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

$ dolt sql -q "SELECT * FROM parent;"
+----+----+
| pk | v1 |
+----+----+
| 1  | 1  |
+----+----+

$ dolt sql -q "SELECT * FROM child;"
+----+----+
| pk | v1 |
+----+----+
| 1  | 1  |
| 2  | 2  |
+----+----+

You can see that the parent and child tables are not in alignment according to the foreign key, and we have two new system tables to help us out. dolt_constraint_violations tells us every table in the database that has a constraint violation, and how many violations each one has. Every table has its own system table for constraint violations, so the child table has the dolt_constraint_violations_child system table (parent also has one but it's empty). dolt_constraint_violations_child tells us the type of the violation, and gives us some info in the form of a JSON string. For foreign keys, we're including all of the relevant details about the foreign key in the info to help some use cases.

To resolve our constraint violation, let's insert the missing row into the parent, and then clear the violations table.

$ dolt sql -q "INSERT INTO parent SELECT pk, v1 FROM dolt_constraint_violations_child;"
Query OK, 1 row affected

$ dolt sql -q "DELETE FROM dolt_constraint_violations_child;"
Query OK, 1 row affected

$ dolt sql -q "SELECT * FROM dolt_constraint_violations;"
+-------+----------------+
| table | num_violations |
+-------+----------------+
+-------+----------------+

$ dolt sql -q "SELECT * FROM parent;"
+----+----+
| pk | v1 |
+----+----+
| 1  | 1  |
| 2  | 2  |
+----+----+

You must remember to clear any resolved constraint violations from the respective system table, as this is not done automatically. We are investigating ways to do this automatically, but in the meantime it is a manual operation. Don't worry about accidentally committing with constraint violations though, as we report an error on such attempts. Additionally, dolt_constraint_violations is a read-only table that is automatically updated.

Let's finish our merge and get our new commit in!

$ dolt status
On branch master
All conflicts and constraint violations fixed but you are still merging.
  (use "dolt commit" to conclude merge)

Changes not staged for commit:
  (use "dolt add <table>" to update what will be committed)
  (use "dolt checkout <table>" to discard changes in working directory)
        modified:       child
        modified:       parent

$ dolt add -A

$ dolt commit -m "Merged other and fixed constraint violations"
commit goprqqk18ns284qakr7elb9b4lbtg0os
Merge: tnhjjmffaa28d9chfjl7qj6kvv5b15gs t2j26e9r4jn63vgb88s7f8cfh20qh2e1
Author: Daylon Wilkins <daylon@dolthub.com>
Date:   Wed Jul 21 12:00:00 -0700 2021

        Merged other and fixed constraint violations

Constraints Verify Command

Perhaps you're not doing a merge, but you want to check for any constraint violations. We've got two ways to do this, through the CLI and through SQL functions, and I'll go over both of them. First we'll create our tables and foreign key.

$ dolt init
Successfully initialized dolt data repository.

$ dolt sql << SQL
CREATE TABLE parent(pk INT PRIMARY KEY, v1 INT, INDEX (v1));
CREATE TABLE child(pk INT PRIMARY KEY, v1 INT, CONSTRAINT fk_parent FOREIGN KEY (v1) REFERENCES parent (v1));
INSERT INTO parent VALUES (1, 1);
INSERT INTO child VALUES (1, 1);
SQL
Rows inserted: 2 Rows updated: 0 Rows deleted: 0

$ dolt add -A

$ dolt commit -m "Created foreign key"
commit 3l7alshco7cetlspql2npep976ni9lpm
Author: Daylon Wilkins <daylon@dolthub.com>
Date:   Wed Jul 21 12:00:00 -0700 2021

        Created foreign key

Now let's intentionally insert an invalid row, commit, and then do it again.

$ dolt sql << SQL
SET foreign_key_checks = 0;
INSERT INTO child VALUES (2, 2);
SET foreign_key_checks = 1;
SQL
Rows inserted: 1 Rows updated: 0 Rows deleted: 0

$ dolt add -A

$ dolt commit --force -m "Added a bad row"
commit n1bbnst51svvvsqpg8v58pftn8nnhjhs
Author: Daylon Wilkins <daylon@dolthub.com>
Date:   Wed Jul 21 12:00:00 -0700 2021

        Added a bad row


$ dolt sql << SQL
SET foreign_key_checks = 0;
INSERT INTO child VALUES (3, 3);
SET foreign_key_checks = 1;
SQL
Rows inserted: 1 Rows updated: 0 Rows deleted: 0

Now let's run our constraints verify command. I'll also run the equivalent CONSTRAINTS_VERIFY() SQL function for the sake of example, however it won't add anything as the system table will have already been populated by the CLI command.

$ dolt constraints verify child
All constraints are not satisfied.

dolt_constraint_violations_child
+----------------+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| violation_type | pk | v1 | violation_info                                                                                                                                                                                                     |
+----------------+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| foreign key    | 3  | 3  | {"Columns": ["v1"], "ForeignKey": "fk_parent", "Index": "v1", "OnDelete": "RESTRICT", "OnUpdate": "RESTRICT", "ReferencedColumns": ["v1"], "ReferencedIndex": "v1", "ReferencedTable": "parent", "Table": "child"} |
+----------------+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

$ dolt sql -q "SELECT CONSTRAINTS_VERIFY('child');"
+-----------------------------+
| CONSTRAINTS_VERIFY('child') |
+-----------------------------+
| 0                           |
+-----------------------------+

$ dolt constraints verify parent

$ dolt sql -q "SELECT CONSTRAINTS_VERIFY('parent');"
+------------------------------+
| CONSTRAINTS_VERIFY('parent') |
+------------------------------+
| 1                            |
+------------------------------+

constraints verify and CONSTRAINTS_VERIFY() lets you specify which tables you want to verify (omitting tables will verify all tables). This is why verifying child returns an error message (0 for false/fail in SQL), and yet verifying parent returns nothing (1 for true/success in SQL). In addition to the displayed error message, the constraint violation was also written to the dolt_constraint_violations_child system table.

You'll notice that this only caught our (3, 3) row, even though (2, 2) is also a constraint violation, and that's because we compare against the previous commit by default. To check the contents of the entire table, we need to add the --all parameter (or call CONSTRAINTS_VERIFY_ALL() in SQL).

$ dolt constraints verify --all child
All constraints are not satisfied.

dolt_constraint_violations_child
+----------------+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| violation_type | pk | v1 | violation_info                                                                                                                                                                                                     |
+----------------+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| foreign key    | 2  | 2  | {"Columns": ["v1"], "ForeignKey": "fk_parent", "Index": "v1", "OnDelete": "RESTRICT", "OnUpdate": "RESTRICT", "ReferencedColumns": ["v1"], "ReferencedIndex": "v1", "ReferencedTable": "parent", "Table": "child"} |
| foreign key    | 3  | 3  | {"Columns": ["v1"], "ForeignKey": "fk_parent", "Index": "v1", "OnDelete": "RESTRICT", "OnUpdate": "RESTRICT", "ReferencedColumns": ["v1"], "ReferencedIndex": "v1", "ReferencedTable": "parent", "Table": "child"} |
+----------------+----+----+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The CLI has a parameter that disables writing the results to the system table (--output-only), but the equivalent does not exist for SQL (as the SQL function does not output anything).

Unique Index Constraint Violations

Dolt will handle more than just foreign key constraint violations, and we can also catch unique index (or unique key/constraint) violations. This will only ever occur on merge, as any insert operations will immediately fail when a unique index is violated. Let's create a table and insert non-unique values to different branches.

$ dolt init
Successfully initialized dolt data repository.

$ dolt sql -q "CREATE TABLE example(pk INT PRIMARY KEY, v1 INT UNIQUE);"

$ dolt add -A

$ dolt commit -m "Created table"
commit qajvj54jd5e37bvp78qq1953fe4qj1h7
Author: Daylon Wilkins <daylon@dolthub.com>
Date:   Wed Jul 21 12:00:00 -0700 2021

        Created table


$ dolt branch other

$ dolt sql -q "INSERT INTO example VALUES (1, 1);"
Query OK, 1 row affected

$ dolt add -A

$ dolt commit -m "Inserted into master"
commit 4nf3k4sbjvar5u849n38nsjrme46srk6
Author: Daylon Wilkins <daylon@dolthub.com>
Date:   Wed Jul 21 12:00:00 -0700 2021

        Inserted into master


$ dolt checkout other
Switched to branch 'other'

$ dolt sql -q "INSERT INTO example VALUES (2, 1);"
Query OK, 1 row affected

$ dolt add -A

$ dolt commit -m "Inserted into other"
commit 7j6s9l65pvo418kiug049or9i8h0o66e
Author: Daylon Wilkins <daylon@dolthub.com>
Date:   Wed Jul 21 12:00:00 -0700 2021

        Inserted into other


$ dolt checkout master
Switched to branch 'master'

$ dolt merge other
Updating 4nf3k4sbjvar5u849n38nsjrme46srk6..7j6s9l65pvo418kiug049or9i8h0o66e
Auto-merging example
CONSTRAINT VIOLATION (content): Merge created constraint violation in example
Automatic merge failed; fix constraint violations and then commit the result.
Constraint violations for the working set may be viewed using the 'dolt_constraint_violations' system table.
They may be queried and removed per-table using the 'dolt_constraint_violations_TABLENAME' system table.

$ dolt sql -q "SELECT * FROM dolt_constraint_violations_example;"
+----------------+----+----+-----------------------------------+
| violation_type | pk | v1 | violation_info                    |
+----------------+----+----+-----------------------------------+
| unique index   | 2  | 1  | {"Columns": ["v1"], "Name": "v1"} |
+----------------+----+----+-----------------------------------+

$ dolt sql -q "SELECT * FROM example"
+----+----+
| pk | v1 |
+----+----+
| 1  | 1  |
+----+----+

As can be seen, merge was unable to insert (2, 1) from the branch other into our branch master, and it was written to the constraint violations system table instead. One subtle different between unique index and foreign key constraint violations is that unique index violations will only contain rows that do not exist in the actual tables, while foreign key violations will only contain rows that do exist in the actual tables.

Conclusion

As Dolt matures, we continually work toward solving problems that are unique to a database with versioning features. For foreign key constraint violations, we're looking into adding built-in tooling to assist users in resolving the violations, such as specifying a merge ordering to allow referential actions to process.
As with any tool, many improvements are iterative, and are also driven by a need or desire to improve existing functionality. We're proactive in attempting to find bugs and faults (such as with the use of our fuzzer), but some solutions and fixes, such as finding the best way to handle foreign keys during a merge, come from just using the product. You can stay up-to-date on our progress by following our releases, and you can directly interact with us by joining our Discord server. We're very attentive to feedback, and you can help us grow Dolt to work perfectly with your specific use case. We hope you'll join us for the ride!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt