Dolt's New Storage Engine

7 min read

In the beginning, there was Noms. The creation of Aaron Boodman and Attic Labs, Noms introduced Prolly Trees, a novel search index structure that supports diff, merge and sync operations. Noms development has since been halted, but its contributions live on in the open source community. Replicache and Dolt both use Prolly Tree based storage engines.

DoltHub was founded in 2018 with the vision of creating an open platform for data sharing and collaboration. We wanted to create Git-for-Data, and we built Dolt as our central collaboration tool. From the beginning, Dolt was built on Noms. It's why Dolt is the only SQL database that can branch, diff, merge, push, and pull. As DoltHub has grown, our focus has shifted from data-sharing to becoming a production database. This week we launched Hosted Dolt, our first cloud offering of Dolt. As our aims have changed, performance has become a major focus of development. Optimizing indexed access in our storage engine has been particularly important for improving Dolt's efficiency, and we've made significant progress on catching up to MySQL on benchmarks.

Incrementally optimizing Noms has provided substantial gains in efficiency, but we've reached a point of diminishing returns. One major barrier is the need to maintain compatibility with Noms binary serialization format. Altering this format is a major breaking change for Dolt customers and would require a database migration, but we believe it's the best path forward. Today we're announcing the alpha release of Dolt's new storage engine!

Purpose-Built Storage Engine

Storage engines are at the core of database architecture. They're responsible for indexing data, managing transaction concurrency and isolation, and providing durability guarantees. They're often a key differentiator between databases because design decisions at the storage layer impact the features and performance characteristics of the overall system.

Noms is described as a "decentralized database", a syncable peer-to-peer data store indexed with Prolly Trees. Noms data format is schema-less, somewhat like a NoSQL database. In general, Noms storage layer was designed for maximum flexibility and its data model was reflected in its binary format. However, this flexibility does come at a cost and is largely unnecessary in the context of a relational database like Dolt.

By removing the binary compatibility requirement, we were able to design a new storage engine purpose-built for Dolt. Our new engine borrows heavily from Noms, but its architecture is focused on OLTP access patterns. Specifically, we wanted fast access to indexed data and minimal memory allocation overhead. To achieve this, we wrote a new serialization format for tuples and Prolly Trees. Let's dive into the details.

Static Schemas

Noms self-describing serialization format encodes type information into serialized data. Its type system uses type accretion to build up a graph of type-descriptors where each object contained metadata about its children. Each serialized value encodes both its own type and the types of all of its child values. For large indexes, these type descriptors can have a noticeable overhead both in serialization size and cpu overhead.

Dolt stores rows data in a Prolly Tree with Map<Tuple,Tuple> semantics: key fields are grouped in one tuple, and non-key fields are grouped into another. During index maintenance we enforce map semantics by guaranteeing key uniqueness within tree. Because Dolt is a relational database, each of rows has the same layout corresponding to the static schema of the table. Storing schema information out-of-band creates a much more compact encoding, reducing the size of our indexes. A compact encoding format is important for reducing IO overhead when reading from persistent storage, and when caching index nodes in memory. However, our new encoding format is roughly the same size as the old format because of decisions we made to optimize deserialization.

Fast Random Access

After the type system, the biggest changes in the new storage engine come from the value encoding format. The focus of these changes was to create a format that supports fast, zero-allocation decoding and fast random access. Random access in this context means the ability to access data in the middle of an object without first scanning the entire prefix of the object. Let's look at how index data is serialized to see what's changed and why it matters.

Once serialized, tuples and Prolly Tree nodes have a similar structure. They're essentially container objects that organize their elements as an array: nodes hold a collection of key and value tuples, tuples hold a collection of fields. This holds for both the old and new format. The key difference in the new format is that indexes and tuples serialize offsets to the end of their buffer after elements have been serialized. Offsets are like back pointers in the buffers that enable random access. In the new format values can be accessed in O(1) by looking up pointers to the beginning and end. The zeroth offset can be omitted as the start of the buffer is already known. The new structure of a Tuple looks like this:

| Value 0 | Value 1 | ... | Value K | Offset 1 | Offset 2 | ... | Offset K | Count |

Compared to Tuples in the old format:

| Type 0 | Length 0 | Value 0 | Type 1 | Length 1 | Value 1 | ... | Type k | Length k | Value K |

In the old format, random access also uses offsets, but these offsets must be computed on the fly during deserialization. When an object is read from disk it's scanned front to back to compute the locations of its elements. Computing offsets in this way means that serialized data is smaller, but it comes with a cost on the read path as a new buffer must be allocated to store these offsets. It also means that random access is O(n) when an object is read from persistent storage. Once cached in memory, these offsets can be reused.

Benchmark Results

Let's run some benchmarks to get a sense of how the new format performs. Each benchmark was performed against the same in-memory persistence layer. The differences between the old and new storage engines is down to only their serialization formats. If you're interested, the source code for the benchmarks is here. The first benchmark does point lookups of individual rows, the second benchmarks does point inserts. Each benchmark is run against indexes scales of 10K, 100K, and 1 million rows.

goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkMapGet/benchmark_maps_10k/benchmark_new_format_reads-12         	  824058	      1822 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapGet/benchmark_maps_10k/benchmark_old_format_reads-12         	  199551	      5893 ns/op	    2783 B/op	      47 allocs/op
BenchmarkMapGet/benchmark_maps_100k/benchmark_new_format_reads-12        	  663870	      1869 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapGet/benchmark_maps_100k/benchmark_old_format_reads-12        	  127732	      9615 ns/op	    3702 B/op	      66 allocs/op
BenchmarkMapGet/benchmark_maps_1M/benchmark_new_format_reads-12          	  508759	      3162 ns/op	       0 B/op	       0 allocs/op
BenchmarkMapGet/benchmark_maps_1M/benchmark_old_format_reads-12          	   58594	     19413 ns/op	    5114 B/op	      88 allocs/op
goos: darwin
goarch: amd64
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkMapUpdate/benchmark_maps_10k/benchmark_new_format_writes-12         	   14779	     68341 ns/op	   19563 B/op	      54 allocs/op
BenchmarkMapUpdate/benchmark_maps_10k/benchmark_old_format_writes-12         	    1684	    660560 ns/op	  486386 B/op	    3204 allocs/op
BenchmarkMapUpdate/benchmark_maps_100k/benchmark_new_format_writes-12        	   16516	     72802 ns/op	   21005 B/op	      54 allocs/op
BenchmarkMapUpdate/benchmark_maps_100k/benchmark_old_format_writes-12        	    1294	   1054807 ns/op	  713440 B/op	    4905 allocs/op
BenchmarkMapUpdate/benchmark_maps_1M/benchmark_new_format_writes-12          	   12213	     96054 ns/op	   28961 B/op	      54 allocs/op
BenchmarkMapUpdate/benchmark_maps_1M/benchmark_old_format_writes-12          	    1219	   1025194 ns/op	  814927 B/op	    5089 allocs/op

On the read-path we're between 3x and 6x faster. On the write-path we're consistently 10x faster!

Future Work

The new storage engine and serialization format are still experimental, but we're releasing an early version to users in order to gather feedback and to preview future performance gains. It's currently behind a feature flag, but you can turn it on by setting the environment variable DOLT_DEFAULT_BIN_FORMAT=__DOLT_1__ and creating a new database with dolt init.

While the new format is in development, the full feature set of Dolt is not yet supported, specifically many CLI utilities have not yet been ported. The focus of this alpha release is dolt sql and dolt sql-server, which are largely supported with the exception of some DDL and Foreign Key operations. A general release of the new storage engine and storage format will be available later this year and will include a migration path for existing databases. If you have any questions or want more information about the storage engine roadmap, join us on our Discord!



Get started with Dolt

Or join our mailing list to get product updates.