Introducing Type Changes

FEATURE RELEASESQL
6 min read

Dolt is a SQL database with Git-style versioning. With each new version of Dolt, we increase the number of supported SQL features, moving toward our goal of being a complete drop-in replacement for MySQL, while adding all of the versioning features you know and love from Git applied to a database, such as branching, diffs, merging, etc. In our latest release, we added support for changing a column's type, and in this blog post I'll briefly go over the significance of this functionality in Dolt.

Changing a Column's Type?

Dolt relies on schemas to define the structure of its tables, with a schema consisting of a set of columns. Each column has a set of properties such as a name and constraints, but it also has a type. This type forms the baseline of what is allowed in the column, and also how to interpret the values of the column.

For a simplified example, let's look at the SMALLINT type. All values belonging to a column with that type will be an integer ranging from -32,768 to 32,767. If you attempted to insert 100,000 into the column, then an error would be returned as it is outside the range of valid values. Additionally, any clients that connect to the database and request data from the table will know to interpret all values belonging to that column as an integer.

Now building on the example, let's say that later on the creator decides that they need to store integers larger than the maximum 32,767 of SMALLINT, such as 1,000,000. Well MEDIUMINT allows for integers up to 8,388,607, so this would be a great type to replace it with. In any SQL database, you would run any of the column modification commands, such as ALTER TABLE table_name MODIFY COLUMN column_name MEDIUMINT, so the user would do just that, changing the column's type from SMALLINT to MEDIUMINT. The database would internally make the necessary adjustments so that it can properly support the larger range of values, but this would all be relatively invisible to the user.

Significance to Versioning

Well, those internal adjustments are a bit different for Dolt compared to traditional SQL databases. Unlike others, Dolt is fully versioned, meaning you get commits, branching, merging, diffs, etc. Like Git, but specifically made for data. We store our row values keyed by a numeric tag, rather than depending on the column name. This way, you can rename the column or reorder the columns, and we don't have to rewrite every row at the storage layer. Table rewrites are especially expensive in Dolt, as we store the table both before and after the rewrite, effectively doubling storage (assuming the previous data was committed). Tags are also generated deterministically, so two people working on their own branches can add the same column, and we know to treat them as the same column!

So what do tags have to do with changing a column's type? Well, at the lower level where the diff functionality operates, Dolt works with a different set of types, and the SQL types are mapped to those internal types. Each tag is implicitly associated with an internal type, therefore changing the SQL type may necessitate changing the internal type, which would necessitate changing the tag. If we didn't do this, then two revisions of the database would have values with the same tag and different internal types, meaning we would be unable to compare them for diffs and merges. Thus additional work had to be done to ensure that tag changes are as infrequent as possible, so that logical SQL type changes—such as SMALLINT to MEDIUMINT—results in as few changes to the underlying storage. To note, when a type change results in a tag change, Dolt treats it as a full table rewrite by dropping the column with the old type, and adding a new column with the same name, the new type, and the equivalent data.

Drop-in Replacement for MySQL

We define "equivalent data" as what MySQL converts values to when changing types. As one of our goals is to be a drop-in replacement for MySQL servers, this extends to what a MySQL user expects to happen to their data when they change a column. There are quite a lot of types that are supported, and sadly we could not find much documentation explicitly stating how types translate their values between each other.

So, we took it upon ourselves and wrote a framework that tests a local MySQL installation. For each type (including variations of parameterized types such as BIT), we generate several permutations on a set of values that are valid for that type, and run a MODIFY COLUMN statement to change to all of the other types. The framework records the success or failure, along with the values that the data was changed to if it was successful. With this data, it generates tests that we import into Dolt, and allows us to develop our conversion logic directly against the output of MySQL. With this approach, we're able to generate thousands and thousands of tests, increasing our confidence as a drop-in replacement, and also nicely exercising our code due to thorough testing.

Example Time!

Let's run through creating a table and changing the type a few times. First, let's create a table and insert some data.

$ dolt init
Successfully initialized dolt data repository.

$ dolt sql <<SQL
CREATE TABLE example (
    pk int PRIMARY KEY,
    val int
);
INSERT INTO example VALUES (0, 3), (1, 4), (2, 5);
SQL
Rows inserted: 3 Rows updated: 0 Rows deleted: 0

We can see our data using a simple SELECT query. dolt sql accepts many different forms of input, so we'll use the query flag this time.

$ dolt sql -q "SELECT * FROM example;"
+----+-----+
| pk | val |
+----+-----+
| 0  | 3   |
| 1  | 4   |
| 2  | 5   |
+----+-----+

Let's commit our table, which looks the same as committing source files in Git.

$ dolt add -A

$ dolt commit -m "Committing our example table"
commit ttltdmtqrugd2m2ugvtribbavr3up0m5
Author: Daylon Wilkins <daylon@dolthub.com>
Date:   Wed Feb 10 12:00:00 -0800 2021

       Committing our example table

Now let's change the type. Both MODIFY COLUMN and CHANGE COLUMN will work, so let's go with MODIFY COLUMN.

$ dolt sql -q "ALTER TABLE example MODIFY COLUMN val BIGINT;"

Now our table differs from what was committed. We can check it out using the diff command, just like with Git.

$ dolt diff
diff --dolt a/example b/example
--- a/example @ qc29k1n56jfnp9ohkdj072fbnon6gmb7
+++ b/example @ 5kflv187dt2ss7upnl2ehg695jjf6jki
 CREATE TABLE example (
   `pk` INT NOT NULL
<   `val`    INT
>   `val` BIGINT
    PRIMARY KEY (pk)
 );

+-----+----+-----+
|     | pk | val |
+-----+----+-----+
+-----+----+-----+

Just as you'd expect, it shows that we updated the type of the val column. As these two types have the same internal type, the data at the storage layer was not changed. Let's change the type to a string type rather than an integer type this time, which will change the internal type.

$ dolt sql -q "ALTER TABLE example MODIFY COLUMN val VARCHAR(20);"

$ dolt diff
diff --dolt a/example b/example
--- a/example @ qc29k1n56jfnp9ohkdj072fbnon6gmb7
+++ b/example @ rsr21l6hft4ju8hhu31jjk4upaifni7n
 CREATE TABLE example (
   `pk` INT NOT NULL
-   `val` INT
+   `val` VARCHAR(20)
    PRIMARY KEY (pk)
 );

+-----+----+------+------+
|  <  | pk |      | val  |
|  >  | pk | val  |      |
+-----+----+------+------+
|  <  | 0  | NULL | 3    |
|  >  | 0  | 3    | NULL |
|  <  | 1  | NULL | 4    |
|  >  | 1  | 4    | NULL |
|  <  | 2  | NULL | 5    |
|  >  | 2  | 5    | NULL |
+-----+----+------+------+

Now our diff looks a bit different. No longer does the schema show that val was updated, it now shows val as a drop-and-add. In addition, we now have a data diff. The values appear to be the same, but we've gone from integers to strings, where our string values are representative of the integers. You'll also notice that our from type is INT rather than BIGINT. This is because we never committed our previous type change, so the change was only in the working set. Our diff is, by default, against the previous commit, which is the same behavior that Git users would expect.

Conclusion

Allowing a column's type to change while properly interacting with all of Dolt's versioning capabilities greatly expands the value of commits in relation to current data. Even recently, we've used the new type changing functionality in our bounties, allowing us to actively respond to the demands of our participants. You can stay up-to-date on our progress by following our releases, and you can directly interact with us by joining our Discord server. We hope you'll join us for the ride!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.