Migrating Dolt's Binary Format

REFERENCE
5 min read

Dolt is a MySQL-compatible version-controlled database. It's the only SQL database with branches that you can diff, merge and sync.

Over the past year, the team at DoltHub has been working on a new storage engine for Dolt. Earlier this summer we released an alpha version of the new storage engine as a preview for the Dolt community. Since then, we've been gathering feedback and tinkering with the internals to improve correctness and performance.

One of the major changes with the new storage engine is that it reads and writes data in a new binary serialization format. Changing the binary format is necessary to continue to improve Dolt's performance, but it also means making a disruptive breaking change for users. To make the transition as smooth as possible, we've built a binary migration tool that is packaged into new Dolt releases. Today's blog will dive into how and why we're migrating Dolt's binary format, and how you can run a migration yourself.

Motivation

Dolt's current storage engine evolved from Noms, an open-source data store that introduced Prolly Trees. Building on top of Noms allowed us to get a head start on early versions of Dolt, but it also meant that we inherited its design decisions. From the beginning, Dolt used a relational data model with tables and static schemas. In contrast, the Noms used a schemaless NoSQL-like model that prioritized flexibility and self-describing data. As performance became a larger priority for Dolt, we began to optimize Noms for the access patterns that were most important for us. Initially, we made major strides in our OLTP benchmarks. Eventually, however, we realized that the serialization format was a limiting factor to continued performance gains. Roughly a year ago, we set about designing a new storage engine and serialization format purpose-built for Dolt. The result is a storage engine that's more than twice as fast across our standard benchmarks.

+------------------------+-------+-------+----------+
|  Old Format Read Tests | MySQL |  Dolt | Multiple |
+------------------------+-------+-------+----------+
| covering_index_scan    |  2.00 |  6.55 |      3.3 |
| groupby_scan           | 12.30 | 22.69 |      1.8 |
| index_join_scan        |  1.12 | 18.61 |     16.6 |
| index_scan             | 30.26 | 63.32 |      2.1 |
| oltp_point_select      |  0.15 |  0.60 |      4.0 |
| oltp_read_only         |  2.97 |  9.73 |      3.3 |
| select_random_points   |  0.30 |  1.39 |      4.6 |
| select_random_ranges   |  0.35 |  1.55 |      4.4 |
| table_scan             | 30.81 | 58.92 |      1.9 |
| types_table_scan       | 68.05 | 569.6 |      8.4 |
+------------------------+-------+-------+----------+
|   Reads Mean Multiple  |               |      5.0 |
+------------------------+---------------+----------+

+------------------------+-------+-------+----------+
| Old Format Write Tests | MySQL | Dolt  | Multiple |
+------------------------+-------+-------+----------+
| bulk_insert            | 0.001 | 0.001 |      1.0 |
| oltp_delete_insert     |  2.91 | 20.00 |      6.9 |
| oltp_insert            |  1.44 |  8.28 |      5.7 |
| oltp_read_write        |  5.18 | 37.56 |      7.3 |
| oltp_update_index      |  1.44 |  9.56 |      6.6 |
| oltp_update_non_index  |  1.50 |  6.67 |      4.4 |
| oltp_write_only        |  2.22 | 26.68 |     12.0 |
| types_delete_insert    |  3.07 | 155.8 |     50.7 |
+------------------------+-------+-------+----------+
|  Writes Mean Multiple  |               |     11.8 |
+------------------------+---------------+----------+
| Overall Mean Multiple  |               |      8.1 |
+------------------------+---------------+----------+
+------------------------+-------+-------+----------+
|  New Format Read Tests | MySQL |  Dolt | Multiple |
+------------------------+-------+-------+----------+
| covering_index_scan    |   2.0 |  2.81 |      1.5 |
| groupby_scan           |  12.3 | 24.83 |      2.0 |
| index_join_scan        |  1.12 |  7.98 |      7.0 |
| index_scan             | 30.26 | 44.17 |      1.4 |
| oltp_point_select      |  0.15 |  0.50 |      3.3 |
| oltp_read_only         |  2.97 |  8.74 |      2.9 |
| select_random_points   |  0.30 |  0.83 |      2.8 |
| select_random_ranges   |  0.35 |  1.23 |      3.5 |
| table_scan             | 30.81 | 52.89 |      1.7 |
| types_table_scan       | 68.05 | 539.7 |      7.7 |
+------------------------+-------+-------+----------+
|   Reads Mean Multiple  |               |      3.4 |
+------------------------+---------------+----------+

+------------------------+-------+-------+----------+
| New Format Write Tests | MySQL | Dolt  | Multiple |
+------------------------+-------+-------+----------+
| bulk_insert            | 0.001 | 0.001 |      1.0 |
| oltp_delete_insert     |  2.91 | 11.04 |      3.5 |
| oltp_insert            |  1.44 |  2.91 |      1.9 |
| oltp_read_write        |  5.18 | 17.63 |      3.4 |
| oltp_update_index      |  1.44 |  4.91 |      3.0 |
| oltp_update_non_index  |  1.50 |  5.28 |      3.3 |
| oltp_write_only        |  2.22 |  8.58 |      3.8 |
| types_delete_insert    |  3.07 | 12.75 |      4.2 |
+------------------------+-------+-------+----------+
|  Writes Mean Multiple  |               |      3.0 |
+------------------------+---------------+----------+
| Overall Mean Multiple  |               |      3.2 |
+------------------------+---------------+----------+

The new Dolt storage engine is faster across every test that we measure. For use cases working with TEXT fields or other large data (types_delete_insert) it's ten times faster! On the whole, writes get proportionally faster than reads when comparing the two storage engines. While not captured in these metrics, the decreased latency of write queries has a major impact on transactional throughput. Anecdotally, we've also seen dramatic improvements in Dolt system table performance after switching to the new storage engine. We plan to detail these differences in an upcoming post.

Migrating to the New Format

In order to realize the performance wins of the new format, Dolt users need migrate their databases from the existing Dolt format. In a traditional database, this process would likely be as simple as exporting data to an archive, upgrading, and reimporting the data. However, Dolt's unique history and versioning features complicate its migration story. A single Dolt database may have thousands of branches and commits, and we need to maintain all of that information as we roll forward to the new format. Thankfully we have a clean conceptual model for how to do this: rebase.

Dolt doesn't have an explicit rebase command, but it exposes rebase like functionality through the dolt filter-branch command. Filter-branch operates on a database by first rewinding a portion of the history, and then editing each commit as it replays the history. Dolt's new format migration tool works much the same way, except the transformation applied to each commit is made only to the physical representation of the persisted data. At a logical level, the data in each table is unchanged. To make this operation performant we leverage Dolt's efficient diff operation to minimize the amount of work needed to transform each commit. During the "replay" stage of the rebase we diff the current commit against its parent to find the novel data and migrate only these rows.

The migration operation is built to be self-validating. The logical equivalence of each row and each table is checked during the migration using our SQL engine go-mysql-server. After each commit is migrated, each of its tables are validated to ensure that they contain the same rows pre- and post-migration.

Try It Today!

The latest version of Dolt, v0.40.25, is the beta release for the new storage engine and includes the new dolt migrate command. The new binary format is stable as of this release and no compatibility-breaking changes will be made going forward. Running on the new storage engine, Dolt is 99.74% correct against sqllogictests. The newer storage engine is still an opt-in feature for Dolt, but will become the default later this year. If you want to use the new format for a fresh Dolt database use dolt init --new-format when initializing. And if you're unsure of which format a database is using, you can run dolt version to check.

Rewriting the storage engine has been a major undertaking for the DoltHub team, but the results speak for themselves. It's a major step on our journey toward performance parity with MySQL. Our goal in transitioning formats is to ensure the process is seamless for Dolt users. If you have any questions or concerns with the migration please file a Github issue or speak with us directly on Discord.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.