Extending Shared Datasets

USE CASE
7 min read

Dolt is the world's first fully-versioned, relational database. You can branch, merge, diff, push, and pull your relational data in all the same ways that Git allows you to work with source code files.

One of the powerful features of Dolt is how it enables people to share, collaborate on, and extend datasets. With Dolt, it's easy to take a shared dataset, clone it locally, then modify the data as needed. You can keep your changes in a fork or clone of the database and easily merge in new changes from the origin database. For example, you can add new tables to track additional information, update existing data, or tune indexes for your needs. It's easy to customize your copy of the database, and best of all, you can still continue to sync changes from the origin database easily and efficiently.

This is a stark contrast to traditional ways of updating forks of shared datasets. Without Dolt, there's a good chance you'd end up downloading the full dataset each time you wanted to update your instance. It's a pretty big bummer to have to download a full dataset just to get a couple of new rows that were added! In addition to that, if you've made changes to your local copy, then you'll have to either manually reapply those changes or create and maintain your own scripts that replay any modifications you need.

With Dolt, this really is as simple as running: dolt pull. When you run that, Dolt will use its extremely efficient prolly tree diffing logic to calculate exactly what changes you need to pull down from the origin database, and then it's going to merge those changes into your local database. That means you'll never have to pull down any data you already have locally and any changes you've committed to your local copy will be kept. This gives you a really easy way to keep your local copy of the database in sync with the origin database, while still keeping all your custom changes. The only exception to this is if you make changes that conflict with new changes made in the origin database. In that case, Dolt will let you know about the conflict and you'll need to resolve it before finishing the merge. In common cases where you're simply extending or augmenting the existing data, you'll never have to worry about conflicts.

DoltHub: A Place to Share and Collaborate on Dolt Databases

Speaking of shared datasets... we're so passionate about the benefits of decentralized, version-controlled databases, that in addition to the Dolt database engine itself, we also built DoltHub – a place to explore, share, and collaborate on public or private datasets. DoltHub is to Dolt databases, as GitHub is to Git repositories. There's even the familiar Pull Request workflow that you probably already know and love from GitHub.

You can find a ton of shared, public datasets online at DoltHub. One popular dataset is post-no-preference/stocks. This Dolt database contains US stock symbol information, daily stock values for open, high, low, and close prices, as well as information on stock splits and stock dividends. This Dolt database is maintained by a DoltHub customer, and he's done a great job keeping the data updated daily for several years now. (btw, he also makes another great Dolt database for stock options that you can find on DoltHub, too!)

Example: Adding a New Secondary Index

Let's walk through a concrete example and see how this really works...

Using the post-no-preference/stocks Dolt database I mentioned above, we want to write an application that works with the daily high price for each stock. Our application will use those daily high prices to try and predict future daily high prices, so we need to be able to quickly query the daily high prices for each stock symbol and also to quickly identify which stocks have had high prices over specific thresholds. The primary key on the table already lets us efficiently query the daily high prices for a specific stock symbol on a specific date, but it doesn't help us efficiently find when and which stocks had a high price over a specific threshold. To do this efficiently, we'll need to add a new secondary index on the high column of the ohlcv table.

Let's start by cloning the database from DoltHub and starting up a Dolt SQL shell:

dolt clone post-no-preference/stocks
cd stocks
dolt sql 

To test out merging in updates to our new secondary index, we're going to create a new local branch that is based off of the data from two days ago. Otherwise, we'd have to wait for the origin database to update before we could show how to merge in updates. At the end of this example, we'll merge in the latest updates and show how efficiently Dolt is able to merge those updates into our secondary index.

call dolt_branch('branch1', (select commit_hash from dolt_log where date < DATE_SUB(NOW(), INTERVAL 2 DAY) limit 1));
call dolt_checkout('branch1');

Let's break down what just happened there before we go any further... we're using the dolt_branch() stored procedure to create a new branch called branch1 and we're passing in a commit hash to tell Dolt what commit that new branch should point to. Without passing in that second parameter, Dolt would, by default, create the new branch pointing at the HEAD of the branch we have currently checked out. Instead of passing in a literal commit hash value, we're using a SQL subquery to look into the dolt_log system table and find the first reachable commit that is older than 2 days ago. After that we just call the dolt_checkout() stored procedure to switch over to that new branch.

To be sure our new branch was created correctly, let's take a look at the dolt_branches system table and see what it says:

stocks> select * from dolt_branches;
+---------+----------------------------------+--------------------+-----------------------------------+---------------------+--------------------------+--------+--------+
| name    | hash                             | latest_committer   | latest_committer_email            | latest_commit_date  | latest_commit_message    | remote | branch |
+---------+----------------------------------+--------------------+-----------------------------------+---------------------+--------------------------+--------+--------+
| branch1 | imb2u9oek9pf1hqk8ilqte9dbdajf0at | post-no-preference | post.no.preference@protonmail.com | 2023-12-11 09:00:38 | symbol 2023-12-10 update |        |        |
| master  | l8440mjkrdpph8e6hauu6vvj8677ggg7 | post-no-preference | post.no.preference@protonmail.com | 2023-12-13 09:46:38 | ohlcv 2023-12-12 update  | origin | master |
+---------+----------------------------------+--------------------+-----------------------------------+---------------------+--------------------------+--------+--------+

This looks good! We see our new branch and by looking at the latest_commit_date column, we can see that it is indeed pointing to a commit two days older than the other branch.

Now, let's take a look at the ohlcv table and see what indexes it has on it:

stocks> show create table ohlcv;
+-------+------------------------------------------------------------------+
| Table | Create Table                                                     |
+-------+------------------------------------------------------------------+
| ohlcv | CREATE TABLE `ohlcv` (                                           |
|       |   `date` date NOT NULL,                                          |
|       |   `act_symbol` varchar(16383) NOT NULL,                          |
|       |   `open` decimal(7,2),                                           |
|       |   `high` decimal(7,2),                                           |
|       |   `low` decimal(7,2),                                            |
|       |   `close` decimal(7,2),                                          |
|       |   `volume` bigint,                                               |
|       |   PRIMARY KEY (`date`,`act_symbol`),                             |
|       |   KEY `ohlcv_act_symbol_idx` (`act_symbol`)                      |
|       | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin |
+-------+------------------------------------------------------------------+

We can see that the primary key covers the date and stock symbol, and that there's one secondary index that covers just the stock symbol. However, there's no secondary index that tracks the high column that our application needs to work with. That means, when we write queries again that column, they're not going to have the benefit of using an efficient index, and they'll end up scanning more rows than is necessary.

Fortunately, it's very easy for us to fix that in our local clone and still easily stay in sync with updates from the origin database. All we have to do is add that new secondary index on our clone, and then create a new Dolt commit in our commit graph for that change. On my machine, it took about three minutes to build this new index from scratch:

alter table ohlcv add key(high);
call dolt_commit('-am', 'Adding a new index for daily highest price per stock symbol');

Now that we've got our new index created and committed, let's take a look at the experience of pulling in updates to the data. If we were working off of the main branch for this database, then we'd simply run dolt pull to pull in the latest changes and merge it with our local changes, but remember that we're working off of a branch that is two days behind the main branch so that we can easily simulate new data coming in, without having to wait for the origin database to be updated. Because of how Dolt (and Git) work, these two workflows are almost identical – dolt pull internally runs dolt fetch followed by dolt merge, but with our setup, we can skip the fetch and just run dolt merge directly.

call dolt_merge('master');
+----------------------------------+--------------+-----------+
| hash                             | fast_forward | conflicts |
+----------------------------------+--------------+-----------+
| bmlfjiptqpp25n895ogl9k6bvrekodns | 0            | 0         |
+----------------------------------+--------------+-----------+
1 row in set (1.39 sec)

On my machine, merging in the latest changes and updating the new secondary index took a whopping 1.4s! 🎉 That's pretty darn impressive, both in the time it took and also in how easy it was to perform. This is especially true when you contrast this to older approaches of pulling in updates from shared datasets where you'd probably have to start a fresh database, pull down and import the ENTIRE shared dataset, then reapply all your customizations, including building that new secondary index from scratch every time.

Wrap up

Hopefully this post showed you a small taste of the many benefits of using a version-controlled, decentralized database like Dolt. Being able to extend a shared dataset and then quickly and efficiently pull down updates that get seamlessly merged in to your local copy is one of the many ways Dolt makes working with data easier and more fun.

If you haven't tried out Dolt yet, install it, explore some datasets on DoltHub, and let us know what you think! If you hit any problems or think of feature requests, we'd love to hear about it! You can send us an issue on GitHub, or swing by our Discord server and chat with us. Our dev team hangs out on Discord every day while we're working and we're always happy to talk about databases, versioning, and data sharing! 🤓

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.