Will it merge?

September 11, 2023

9 min read

Dolt DB is the world's first fully-versioned SQL relational database. It allows you to track the history of your data in every row and every column of all your tables. In the same way that Git enables you to branch, merge, fork, and diff your source code, Dolt enables you to branch, merge, fork, and diff your relational data.

One of the really cool features of Dolt and Git is the ability to merge changes between branches. Dolt has always had support for merging data, and earlier this year, we launched support for merging schema changes with our new, performance-focused storage engine, too. That means, if you merge branch other into branch main, and you've added or removed columns, changed column defaults, etc, then you can expect those same schema changes to be applied to the main branch, in addition to merging all the data. Merging schema changes isn't necessarily harder than merging data, but there are quite a few edge cases that need to be considered when merging schema changes.

Will it Merge?!?

Dolt may not be able to turn an iPhone X into powder like the infamous Blendtec blenders can, but being able to merge schema changes between branches is arguably more useful anyway. Let's take a look at a few examples of schema changes and see how well Dolt can merge them...

Adding/Removing Non-Primary Key Columns

Adding new columns to a table is one of the most common schema changes as your application and database evolve, so it was one of the first cases we added automatic schema merge support for. You can add and remove columns on a branch and Dolt will merge the changes back together automatically for you.

In the example below, the dev development branch adds two new columns to a table, while the main branch removes a column from the same table. If you want to follow along, install Dolt, create a new directory, run dolt init, then start up a SQL shell with dolt sql and run the commands below.

-- create some test data for us to work with
create table t (pk int primary key, color varchar(100), size varchar(100), age int);
insert into t values (1, "blue", "large", 42);
call dolt_commit('-Am', 'adding table t on main');

-- create a development branch off of main
call dolt_branch('dev');

-- other changes happen on main after our dev branch has been created
alter table t drop column size;
insert into t values (2, "brown", 2);
call dolt_commit('-Am', 'removing the size column from table t on main');

-- on our dev branch, we add some new columns for a feature we're working on 
call dolt_checkout('dev');
alter table t add column location varchar(100);
alter table t add column price float default (10.0);
insert into t values (100, "purple", "medium", 2, "France", 22);
call dolt_commit('-am', 'adding purple to our enum on dev');

-- switch back to our main branch and merge in the dev branch
call dolt_checkout('main');
call dolt_merge('dev');

Now that we've merged the branches together, let's take a look at table t and verify that the size column has been removed and the location and price columns have been added. Because we specified a column default value for the price field, we should also see that default value applied for rows 1 and 2 that were created before we added the price field.

show create table t;
+-------+------------------------------------------------------------------+
| Table | Create Table                                                     |
+-------+------------------------------------------------------------------+
| t     | CREATE TABLE `t` (                                               |
|       |   `pk` int NOT NULL,                                             |
|       |   `color` varchar(100),                                          |
|       |   `age` int,                                                     |
|       |   `location` varchar(100),                                       |
|       |   `price` float DEFAULT (10.0),                                  |
|       |   PRIMARY KEY (`pk`)                                             |
|       | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin |
+-------+------------------------------------------------------------------+

select * from t;
+-----+--------+-----+----------+-------+
| pk  | color  | age | location | price |
+-----+--------+-----+----------+-------+
| 1   | blue   | 42  | NULL     | 10    |
| 2   | brown  | 2   | NULL     | 10    |
| 100 | purple | 2   | France   | 22    |
+-----+--------+-----+----------+-------+

Sure enough, Dolt has correctly merged together our data changes and schema changes from our main and dev branches.

Adding new `ENUM` and `SET` values

Enums and sets are super handy data types to help make your data easier to understand as well as ensure that only valid values are entered into fields. It's not uncommon to need to expand enums and sets, and this is another schema change that Dolt can automatically merge for you. The only catch is that the new enum or set values must be added to the end of the enum or set. This ensures that existing data that uses those enum values (which are internally stored as integers IDs, btw) don't need to be translated to new IDs. That's an optimization we could certainly add in the future if customers ask for it. Let's take a look at an example of a schema merge where we've added a new value to the end of an enum...

-- create some test data for us to work with
create table t (pk int primary key, color enum('green', 'red', 'blue'));
insert into t values (1, 'red');
call dolt_commit('-Am', 'adding table t on main');

-- create a branch for a new feature we're going to work on
call dolt_branch('feature1');

-- more changes happen on main while we're building our feature 
insert into t values (2, 'blue');
call dolt_commit('-am', 'inserting a second row to t on main');

-- our feature requires supporting a new color, so we add it to the end of our enum 
call dolt_checkout('feature1');
alter table t modify column color enum('green', 'red', 'blue', 'purple');
insert into t values (100, 'purple');
call dolt_commit('-am', 'adding purple to our enum on feature1');

-- switch back to our main branch and merge in the feature1 branch
call dolt_checkout('main');
call dolt_merge('feature1');

Now for the moment of truth... Dolt has merged our branches together, so let's verify that our data looks correct:

show create table t;
+-------+------------------------------------------------------------------+
| Table | Create Table                                                     |
+-------+------------------------------------------------------------------+
| t     | CREATE TABLE `t` (                                               |
|       |   `pk` int NOT NULL,                                             |
|       |   `color` enum('green','red','blue','purple'),                   |
|       |   PRIMARY KEY (`pk`)                                             |
|       | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin |
+-------+------------------------------------------------------------------+

select * from t;
+-----+--------+
| pk  | color  |
+-----+--------+
| 1   | red    |
| 2   | blue   |
| 100 | purple |
+-----+--------+

The schema of our table now contains the new purple element in our color enum, and when we look at our data, we see that row 100 has the correct value of purple. You could repeat this example with sets and as long as you add the new set values to the end of the existing set, you'll be able to automatically merge in that datatype change as well.

Expanding `VARCHAR` fields

Another common schema change is enlarging a column to a wider type that can store larger values. For example, maybe you started off with a VARCHAR(50) column, and during development, you realized that you need more room in that field. Dolt can now automatically merge schema changes where you widen string types. For example, if you started with VARCHAR(50), you can alter that to be VARCHAR(100) or CHAR(200) or even TEXT and when you go to merge those changes back into your main branch, Dolt will seamlessly apply the schema change and migrate any existing data to the new type definition.

-- create some test data for us to work with
create table t (pk int primary key, description varchar(50));
insert into t values (1, 'little fluffly clouds');
call dolt_commit('-Am', 'adding table t on main');

-- create a sprint-10 branch off of main 
call dolt_branch('sprint-10');

-- more changes happen on main while we're working on our sprint-10 branch 
insert into t values (2, 'metallic spheres in colour');
call dolt_commit('-am', 'inserting a second row to t on main');

-- our development works requires increasing the size of the description field to hold more data 
call dolt_checkout('sprint-10');
alter table t modify column description TEXT;
insert into t values (100, 'this is a really, really, really, really, really, really long description that would not fit in our previous VARCHAR(50) column');
call dolt_commit('-am', 'alter the description column to a TEXT type on sprint-10');

-- switch back to our main branch and merge in the sprint-10 branch
call dolt_checkout('main');
call dolt_merge('sprint-10');

Our merge completed successfully, with no data or schema conflicts, but let's take a closer look at the schema of table t to confirm that our VARCHAR(50) column was changed to a TEXT column, and then take a look at the data in our table and make sure it's what we're expecting.

show create table t;
+-------+------------------------------------------------------------------+
| Table | Create Table                                                     |
+-------+------------------------------------------------------------------+
| t     | CREATE TABLE `t` (                                               |
|       |   `pk` int NOT NULL,                                             |
|       |   `description` text,                                            |
|       |   PRIMARY KEY (`pk`)                                             |
|       | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin |
+-------+------------------------------------------------------------------+

select * from t;
+-----+---------------------------------------------------------------------------------------------------------------------------------+
| pk  | description                                                                                                                     |
+-----+---------------------------------------------------------------------------------------------------------------------------------+
| 1   | little fluffly clouds                                                                                                           |
| 2   | metallic spheres in colour                                                                                                      |
| 100 | this is a really, really, really, really, really, really long description that would not fit in our previous VARCHAR(50) column |
+-----+---------------------------------------------------------------------------------------------------------------------------------+

We can see the schema change was correctly applied and if we looked at the raw bits stored for our table, we'd see that the values for the description field are now stored as TEXT data, instead of VARCHAR data, which means they are stored "out of band" from the table data and instead of the raw value, the table data stores pointers to other locations where the real content is stored.

Expanding `BINARY` into `BLOB`

Similar to the VARCHAR/CHAR/TEXT support above, binary type changes can now be automatically merged as well. This means you can start with a VARBINARY(50) column and alter it to be BINARY(51) or BLOB and Dolt will automatically merge in that schema change and update existing data when you're ready to merge in those type changes. This example is extremely similar to the VARCHAR to TEXT example above, so we'll leave it as an exercise for the reader to try it out.

What Doesn't Merge?

Just like when merging files with Git, some merges are just more difficult than others...

Some merges are difficult

We've optimized several common cases that our customers have told us are important to them, and these types of schema changes now merge smoothly and automatically. However, there are still many ways you can alter a table's schema that will require manual intervention to resolve conflicts and complete the merge. Let's look at two quick examples.

Shrinking a column's type

As an example... let's try to undo our change above to widen a VARCHAR(10) column into a TEXT column. We saw above how Dolt's schema merge support was able to expand that column, but going the other way requires a little more work. The reason is simple – when widening a column, it's guaranteed that all existing data will be able to fit into the new column, but when shrinking a column, there's a risk that some existing data might not fit in the column. In these cases, dolt_merge() will report a schema conflict and ask the user to step in and resolve the conflict:

-- Move back to the sprint-10 branch and change the description field back to varchar(50)
-- Note: we must first fix the row with data longer than 50 chars, otherwise we'll get an error
call dolt_checkout('sprint-10');
update t set description=substring(description, 0, 50) where pk=100;
alter table t modify column description VARCHAR(50);
call dolt_commit('-am', 'changing description to varchar(50) on sprint-10 branch');

-- Go back to the main branch to prepare to merge in the changes from the sprint-10 branch
call dolt_checkout('main');

-- Turn off @@autocommit so that any merge conflicts don't get automatically rolled back
set @@autocommit=0;

This time, when we call dolt_merge(), our merge doesn't complete successfully, and instead we see that a conflict is reported.

call dolt_merge('sprint-10');
+------+--------------+-----------+
| hash | fast_forward | conflicts |
+------+--------------+-----------+
|      | 0            | 1         |
+------+--------------+-----------+

Fortunately, it's still really easy to handle these merges – you just need to manually get the schemas in sync, which also gives you a chance to correct any data that wouldn't fit in the new schema. We'll abort the merge we're currently in, bring the schemas into sync, then re-run our merge.

-- Abort the current merge so we can fix the schema conflict that was reported above
call dolt_merge('--abort');

-- On the main branch, apply the same schema changes that were made to the sprint-10 branch
-- Note: the dolt_patch() function can be useful to get an idea of the schema changes needed 
update t set description=substring(description, 0, 50) where pk=100;
alter table t modify column description VARCHAR(50);
call dolt_commit('-am', 'changing description to varchar(50) on main branch');

Now, when we re-run our merge, we see that it completed successfully and created a new commit with our changes merged:

call dolt_merge('sprint-10');
+----------------------------------+--------------+-----------+
| hash                             | fast_forward | conflicts |
+----------------------------------+--------------+-----------+
| 1i19l897j39am18oq1aau3t6o3a9l7j5 | 0            | 0         |
+----------------------------------+--------------+-----------+

We could optimize the general case of making a column more "narrow", for example, by attempting the change and only reporting a conflict error if the data won't fit in the new type. This hasn't been important to our customers yet though, so we'll hold off on this feature until they tell us they need it.

Altering primary key fields

Another good example of a schema change we don't yet automatically merge is changing the primary key columns on a table. This case is more complicated because we need to map rows from the right table version, left table version, and ancestor table version during a merge in order to tell how each row changed, and that matching is done by primary key. So, if the primary key structure changed on one side of the merge, there's extra translation to do when mapping the rows. Extending a primary key with additional fields is not an uncommon operation, so it's likely we'll tackle that subset of the problem as our first step in the future.

Conclusion

There are a TON of ways to modify SQL tables and columns! We don't support automatic schema change merging for every one of those cases yet, but we want to keep expanding support for more cases. If there are specific types of schema changes that would help you if they were easier to merge, let us know by filing an issue on GitHub and we'll be happy to take a look and see what we can do! Or, feel free to swing by the DoltHub Discord and ask us any schema or data merging questions you have.

Blog

Will it Merge?!?

Adding/Removing Non-Primary Key Columns

Adding new ENUM and SET values

Expanding VARCHAR fields

Expanding BINARY into BLOB