Schema Overriding

March 22, 2024

8 min read

DoltDB is the world's first fully-versioned relational database. You can branch, merge, diff, push, and pull your relational data in the same ways that Git allows you to work with source code files. If you're curious what sorts of use cases that enables, check out some of the many ways customers are using Dolt.

One of the key features of Dolt is that you can access your data (and your data's schema) at any point along your database's commit graph. You can go back in time to see how your data and schema looked two years ago, and features like branches and tags makes accessing different points in that history very easy. In this blog post, we're taking a look at a new feature, called "schema overriding", that allows you to look at your data through the lens of another schema.

Schema Overriding

Let's start off with a really simple example to introduce this feature. In fact, we'll reuse the same example from GitHub that was originally used to request this feature. In that issue, the customer did a nice job describing the problem succinctly:

"I want to have some assurance or safety in writing queries that will be backwards-compatible as I add or remove columns during table migrations across commits."

As a concrete example, they described a table called people with the following schema at earlier commits in the database:

+-----------+------+------+-----+---------+-------+
| Field     | Type | Null | Key | Default | Extra |
+-----------+------+------+-----+---------+-------+
| Id        | int  | NO   | PRI | NULL    |       |
| Name      | text | NO   |     | NULL    |       |
+-----------+------+------+-----+---------+-------+

Then, as the database evolved, a Birthdate field was added to the table in later commits:

+-----------+------+------+-----+---------+-------+
| Field     | Type | Null | Key | Default | Extra |
+-----------+------+------+-----+---------+-------+
| Id        | int  | NO   | PRI | NULL    |       |
| Name      | text | NO   |     | NULL    |       |
| Birthdate | date | YES  |     | NULL    |       |
+-----------+------+------+-----+---------+-------+

The customer wanted an easier way to query the data, without having to manually adjust queries for all the different schema versions in the data's history. In this example the customer wants to use this query, which works with the current, latest schema of the people table:

select Name, Birthdate from people;

However, if he tries to work with older data from earlier commits that don't have the Birthdate field, that query returns an error about the missing Birthdate field:

select Name, Birthdate from people;
column "Birthdate" could not be found in any table in scope

Schema overriding solves this problem – it automatically maps your data to an overridden schema, so that you don't have to manually update your queries to run against a different schema. Let's take a look at how we can use schema overriding for this simple example...

Let's assume that we're working on a branch called olderData and we want to query the older data on this branch using the schema as it currently exists on the main branch. The first thing we need to do is set the @@dolt_override_schema variable to a commit that contains the schema we want to use. We could use a branch or tag name, or a specific commit, or even an ancestor reference such as myOtherBranch~~. In this case, we're just using the name of our branch that contains our most recent schema. Note that when we set a schema override, it's in place for all tables as long as that session variable is set. If you want to remove the schema override, you can set the variable to NULL.

SET @@dolt_override_schema='main';

At this point, we've got a schema override in place, and if we query any of our data on this olderData branch, it'll automatically be mapped to the schema present at the tip of the main branch. We can now run the same query above that was previously giving us an error:

select Name, Birthdate from people;
+-----------+-----------+
| Name      | Birthdate |
+-----------+-----------+
| Frank     | NULL      |
| Columbia  | NULL      |
| Dr. Scott | NULL      |
+-----------+-----------+
3 rows in set (0.00 sec)

Et voila! Instead of returning an error like before, the query runs and simply fills in NULL values for the Birthdate column, since that column doesn't actually exist in the data on the olderData branch.

A Bigger Example

Let's look at a slightly larger example that's more representative of something you might see in a real world system.

In this example, we're running a shop and using Dolt to track our inventory. We have a new "inventory valuation" report that we've put together to better understand how much investment we have sitting around in unsold inventory, and we want to go back in time and run this report to see how our inventory valuation has changed over the past year. However, we've made several changes to our inventory schema recently, so the query won't run in its current form on older commits.

Let's take a look at our current schema, how it has changed over time, and then see how we can use schema overriding to run our report for older data.

Here's what our schema looks like today... We've got a table called Inventory with the following schema:

dolt/main> describe Inventory;
+--------------+--------------+------+-----+---------+-------+
| Field        | Type         | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| inventory_id | int          | NO   | PRI | NULL    |       |
| product_id   | int          | YES  |     | NULL    |       |
| quantity     | int          | YES  |     | NULL    |       |
| checked_by   | varchar(255) | YES  |     | NULL    |       |
+--------------+--------------+------+-----+---------+-------+

and a table called Products with the following schema:

dolt/main> describe Products;
+--------------+--------------+------+-----+---------+-------+
| Field        | Type         | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+-------+
| product_id   | int          | NO   | PRI | NULL    |       |
| product_name | varchar(255) | YES  |     | NULL    |       |
| cost         | decimal(6,2) | YES  |     | NULL    |       |
| description  | text         | YES  |     | NULL    |       |
+--------------+--------------+------+-----+---------+-------+

Here's the inventory valuation query we're running to see how much we have invested currently in unsold product:

SELECT
    P.product_name,
    P.description,
    I.quantity,
    (I.quantity * P.cost) AS total_value
FROM
    Inventory I
INNER JOIN
    Products P ON I.product_id = P.product_id
ORDER BY
    total_value DESC;

Running this on our current data at the tip of main works great!

dolt/main> SELECT
        ->     P.product_name,
        ->     P.description,
        ->     I.quantity,
        ->     (I.quantity * P.cost) AS total_value
        -> FROM
        ->     Inventory I
        -> INNER JOIN
        ->     Products P ON I.product_id = P.product_id
        -> ORDER BY
        ->     total_value DESC;
+--------------+-------------------------+----------+-------------+
| product_name | description             | quantity | total_value |
+--------------+-------------------------+----------+-------------+
| Water gun    | Simulates a rain storm  | 64       | 256.00      |
| Confetti     | For that special moment | 84       | 126.00      |
| Glow Stick   | When you need a light   | 101      | 101.00      |
| Rice         | Throw it at newlyweds   | 21       | 42.00       |
| Toast        | May I propose a toast?  | 80       | 40.00       |
+--------------+-------------------------+----------+-------------+

But... as soon as we go back to earlier commits, we get errors about columns not being found. Note that we're useing a revision database to lock to a specific commit. This is one of the many ways Dolt enables you to work with different commits, tags, and branches in your database's commit history.

use `dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap`;
Database changed

dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap> SELECT
        ->     P.product_name,
        ->     P.description,
        ->     I.quantity,
        ->     (I.quantity * P.cost) AS total_value
        -> FROM
        ->     Inventory I
        -> INNER JOIN
        ->     Products P ON I.product_id = P.product_id
        -> ORDER BY
        ->     total_value DESC;
table "p" does not have column "product_id"

Before we get to the punch line and turn on a schema override, let's take a look at what the schema looks like in this older commit and see how the tables have changed.

Here's the Inventory table:

dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap> describe Inventory;
+------------+------+------+-----+---------+-------+
| Field      | Type | Null | Key | Default | Extra |
+------------+------+------+-----+---------+-------+
| id         | int  | NO   | PRI | NULL    |       |
| product_id | int  | YES  | MUL | NULL    |       |
| quantity   | int  | YES  |     | NULL    |       |
+------------+------+------+-----+---------+-------+
3 rows in set (0.00 sec)

And here's the Products table:

dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap> describe Products;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| id    | int          | NO   | PRI | NULL    |       |
| name  | varchar(255) | YES  |     | NULL    |       |
| cost  | float        | YES  |     | NULL    |       |
+-------+--------------+------+-----+---------+-------+

Looking at those schemas, we see several changes were made:

The id column in the Inventory table was renamed to inventory_id
A new checked_by column was added to the Inventory table
The id column in the Products table was renamed to product_id
The name column in the Products table was renamed to product_name
A new description column was added to the Products table
The cost column in the Products table was changed from float to decimal(6,2)

Let's set a schema override so that we lock to the schema as it exists on the main branch, and then run our query again.

dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap> set @@dolt_override_schema='main';
dolt/c5nhjj20mv1bm3vsbdnjctnevd4sdnap> SELECT
        ->     P.product_name,
        ->     P.description,
        ->     I.quantity,
        ->     (I.quantity * P.cost) AS total_value
        -> FROM
        ->     Inventory I
        -> INNER JOIN
        ->     Products P ON I.product_id = P.product_id
        -> ORDER BY
        ->     total_value DESC;
+--------------+-------------+----------+-------------+
| product_name | description | quantity | total_value |
+--------------+-------------+----------+-------------+
| Water gun    | NULL        | 80       | 320.00      |
| Confetti     | NULL        | 119      | 178.50      |
| Glow Stick   | NULL        | 154      | 154.00      |
| Rice         | NULL        | 54       | 108.00      |
| Toast        | NULL        | 0        | 0.00        |
+--------------+-------------+----------+-------------+

It worked! Thanks to @@dolt_override_schema, we were able to easily run our query on older data without having to manually adjust the query to account for the schema changes.

Limitations

Schema overriding works well with changes such as renaming, adding, and removing columns. However, as the schema changes become more complex, it's more likely that a query will fail when using schema overriding. One, probably obvious, example is when a column has been added and you're using that column in a condition in your query, such as a filter or a join condition. Since the column added in the new version of the schema won't exist in older commits, those values will be NULL, so your conditions likely won't work correctly, since they depend on that column having real data. However, for many cases where the query doesn't use columns that were added in conditions, schema overriding can help make it easier to run queries on older data without having to rewrite them.

There are a few additional limitations when using schema overriding. If one of these is a total deal-breaker for you, let us know by sending us an issue on GitHub or dropping into our Discord server, and we'll be happy to see what we can do to help you out.

Read-only – when a schema override is in place, the database becomes read-only in that session. Attempting to write data or execute DDL will result in an error about the database being read-only.
System tables – Dolt system tables do not honor schema overrides.
Collation changes – Collations affect how data is sorted, including the order rows are stored on disk. Mapping data from one collation to another collation requires extra processing to ensure the returned results are sorted according to the mapped collation. This extra processing is not supported yet, so collation changes will not appear when overriding a schema.
Column defaults – If the overridden schema has added new columns with column defaults, those column defaults do not currently get applied when that column is queried.
Untrackable column changes – Some column changes can't be tracked across commits currently. Dolt is able to track column renames and column type changes, but if a column is both renamed and has its type changed, then Dolt is not able to track that the new column is logically the same column as in the older commit.

Wrap Up

The ability to track all the data and schema changes in your database over its complete history is extremely powerful. Dolt provides many ways to use that history to compute diffs, audit changes, create branches and tags off of older data, and we're excited to add schema overriding to the list of ways to make it easier to access and use your historical data.

Schema overriding is a new feature, so we're eager to hear from you about how you're using it, any snags you hit, or any ideas you have for how we can improve it. Feel free to hit us up on GitHub with an issue or swing by our Discord server and chat with us.

Blog

Schema Overriding

A Bigger Example

Limitations

Wrap Up

Get started with Dolt