Dolt's New Format is live on DoltHub!

5 min read

Dolt is a MySQL compatible database with killer version control features. Its data diff, branch, and merge features give applications version control functionality at the database layer. And its git-like interface makes it one of the most developer friendly databases too. Over the last year, the team at DoltHub has been working on a new database storage format which improves Dolt's performance bringing it closer to MySQL.

The new format is currently benchmarking at just ~3x slower than MySQL. Compared to the old format which benchmarks at 8.3x, the new format is considerably faster. The interface is 100% compatible so no changes other than a migration are required. You can read more about how we benchmark Dolt and how the new format is different in the footnotes.

Today, we're officially launching support for the new format on DoltHub! 🎉 Any new database created through the DoltHub UI will use the new format and you can push existing new format databases to DoltHub as well.

New format databases have a special __DOLT__ badge: List badge

Detail badge

How did we add support for the new format on DoltHub?

When we swapped out the storage formats, any machinery that used the storage representation of a row directly had to be redone. Typically, this meant a wrapper around a byte buffer that has functionality to decode the row into Golang types. Some of the machinery could be changed to use a non-storage row representation (a slice of Go types). Others had to be redone completely.

One difficulty was getting DoltHub's test suite running in the new format. As part of our tests, we have static Dolt databases checked into the repository. The first issue was that these Dolt databases were old and were difficult to migrate to the new format. I had to checkout old versions of Dolt and run migration scripts with custom modifications. Overall I felt pretty dirty doing that kind of thing but in the end it worked. 🤷

The second issue was that all of the commit hashes changed in the new format. Because the row data of each commit is represented differently the commit hashes necessarily changed. Any test that used a commit hash had to be rewritten. For each test a mapping between commits was created:

// The left column contains new format hashes, the right column contains 
// the corresponding old format hash.

cmMap := mapCommitsAcrossFormats(
  "mqp9g4usos98lc7jcpvpg481sgbqpu22", "96g44hrs94m4elbi9l7nc1g5fqqbvi6n",
  "7k5k4naglndud8ipg8ecd187krbnfj2c", "9ad33lc8kvsf262c7uqcnfdbt7dllrvq",
  "ds0pftbqfj676ef0mkbl5b6ihgj60g1k", "v077lj2oeqgqekpn4786o4rlbcsn5km3",
)

// reference commit #1 regardless of the format
cmMap["mqp9g4usos98lc7jcpvpg481sgbqpu22"]

It was pretty painful to build these mappings. First, a list of all commits were dumped using SELECT * from dolt_commits; then the commit messages were used to manually identify matching commits. A possible improvement here is to automatically encode this mapping during the migration.

The most difficult task was reimplementing the scoreboard calculation for DoltHub's data bounties. The scoreboard compresses authorship information of the entire database. It has to track which rows were inserted by a bounty participant and who else modified individual cells of that row. All of this is done without storing additional logs or metadata. For each PR, a diff is calculated and that is used to attribute each cell to a participant. This is what a scoreboard typically looks like:

Scoreboard

You can view the full bounty here

How to use the new format on DoltHub

If you don't have an existing database, create one here and run a SQL query or import a CSV. On the CLI, you can use dolt init --new-format and then push the database to DoltHub.

If you have an existing database on DoltHub, follow these instructions to migrate it:

  1. Clone your database locally using dolt clone.

  2. Migrate the database using dolt migrate.

  3. Create a new database on DoltHub to house your new format database.

  4. Add the remote to Dolt by following the instructions on the database page:

Push existing

  1. Delete your old Dolt database on DoltHub and inform any colleagues of the new remote URL.

What's Next?

We're nearing the end of the release process for the new format. We are testing it using the us-housing-prices-v2 bounty. The bounty database's size is ~85Gb which has helped uncover some issues and make us more confident about a 1.0 release. Soon, we'll also add a migration button to DoltHub to migrate your databases with one click.

The team at DoltHub has been eager to get the new format released. The performance wins have been super exciting to see and we can't wait for more users to switch over. Come talk to us on Discord if you have any questions!

Additional Reading

If you'd like to learn more about Dolt's new storage format:

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.