In the beginning, there was Noms. The creation of
Aaron Boodman and Attic Labs, Noms introduced Prolly
Trees, a novel search index structure that supports diff, merge and sync operations. Noms development has since been
halted, but its contributions live on in the open source community. Replicache and
Dolt both use Prolly Tree based storage engines.
DoltHub was founded in 2018 with the vision of creating an open platform for data sharing and collaboration. We wanted
to create Git-for-Data, and we built Dolt as our central collaboration tool. From the beginning, Dolt was built on Noms.
It's why Dolt is the only SQL database that can branch, diff, merge, push, and pull. As DoltHub has grown, our focus has
shifted from data-sharing to becoming a production database. This week we launched
Hosted Dolt, our first cloud offering of Dolt. As our aims have
changed, performance has become a major focus of development. Optimizing indexed access in our storage engine has been
particularly important for improving Dolt's efficiency, and we've made significant progress on catching up to MySQL
Incrementally optimizing Noms has provided substantial gains in efficiency, but we've reached a point of diminishing
returns. One major barrier is the need to maintain compatibility with Noms binary serialization format. Altering
this format is a major breaking change for Dolt customers and would require a database migration, but we believe it's
the best path forward. Today we're announcing the alpha release of Dolt's new storage engine!
Purpose-Built Storage Engine
Storage engines are at the core of database architecture. They're responsible for indexing data, managing transaction
concurrency and isolation, and providing durability guarantees. They're often a key differentiator between databases
because design decisions at the storage layer impact the features and performance characteristics of the overall system.
Noms is described as a "decentralized database", a syncable peer-to-peer data store indexed with Prolly Trees. Noms
data format is schema-less, somewhat like a NoSQL database. In general, Noms
storage layer was designed for maximum flexibility and its data model was reflected in its binary format. However, this
flexibility does come at a cost and is largely unnecessary in the context of a relational database like Dolt.
By removing the binary compatibility requirement, we were able to design a new storage engine purpose-built for Dolt.
Our new engine borrows heavily from Noms, but its architecture is focused on OLTP access patterns. Specifically, we
wanted fast access to indexed data and minimal memory allocation overhead. To achieve this, we wrote a new
serialization format for tuples and Prolly Trees. Let's dive into the details.
Noms self-describing serialization format encodes type information into serialized data. Its type system uses
type accretion to build up a graph of
type-descriptors where each object contained metadata about its children. Each serialized value encodes both its own
type and the types of all of its child values. For large indexes, these type descriptors can have a noticeable overhead
both in serialization size and cpu overhead.
Dolt stores rows data in a Prolly Tree with
Map<Tuple,Tuple> semantics: key fields are grouped in one tuple, and
non-key fields are grouped into another. During index maintenance we enforce map semantics by guaranteeing key
uniqueness within tree. Because Dolt is a relational database, each of rows has the same layout corresponding to the
static schema of the table. Storing schema information out-of-band creates a much more compact encoding, reducing the
size of our indexes. A compact encoding format is important for reducing IO overhead when reading from persistent
storage, and when caching index nodes in memory. However, our new encoding format is roughly the same size as the old
format because of decisions we made to optimize deserialization.
Fast Random Access
After the type system, the biggest changes in the new storage engine come from the value encoding format. The focus of
these changes was to create a format that supports fast, zero-allocation decoding and fast random access. Random access
in this context means the ability to access data in the middle of an object without first scanning the entire prefix of
the object. Let's look at how index data is serialized to see what's changed and why it matters.
Once serialized, tuples and Prolly Tree nodes have a similar structure. They're essentially container objects that
organize their elements as an array: nodes hold a collection of key and value tuples, tuples hold a collection of
fields. This holds for both the old and new format. The key difference in the new format is that indexes and tuples
serialize offsets to the end of their buffer after elements have been serialized. Offsets are like back pointers
in the buffers that enable random access. In the new format values can be accessed in O(1) by looking up pointers to the
beginning and end. The zeroth offset can be omitted as the start of the buffer is already known. The new structure of a
Tuple looks like this:
| Value 0 | Value 1 | ... | Value K | Offset 1 | Offset 2 | ... | Offset K | Count |
Compared to Tuples in the old format:
| Type 0 | Length 0 | Value 0 | Type 1 | Length 1 | Value 1 | ... | Type k | Length k | Value K |
In the old format, random access also uses offsets, but these offsets must be computed on the fly during
deserialization. When an object is read from disk it's scanned front to back to compute the locations of its elements.
Computing offsets in this way means that serialized data is smaller, but it
comes with a cost on the read path as a new buffer must be allocated to store these offsets. It also means that random
access is O(n) when an object is read from persistent storage. Once cached in memory, these offsets can be reused.
Let's run some benchmarks to get a sense of how the new format performs. Each benchmark was performed against the same
in-memory persistence layer. The differences between the old and new storage engines is down to only their serialization
formats. If you're interested, the source code for the benchmarks is here.
The first benchmark does point lookups of individual rows, the second benchmarks does point inserts. Each benchmark
is run against indexes scales of 10K, 100K, and 1 million rows.
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkMapGet/benchmark_maps_10k/benchmark_new_format_reads-12 824058 1822 ns/op 0 B/op 0 allocs/op
BenchmarkMapGet/benchmark_maps_10k/benchmark_old_format_reads-12 199551 5893 ns/op 2783 B/op 47 allocs/op
BenchmarkMapGet/benchmark_maps_100k/benchmark_new_format_reads-12 663870 1869 ns/op 0 B/op 0 allocs/op
BenchmarkMapGet/benchmark_maps_100k/benchmark_old_format_reads-12 127732 9615 ns/op 3702 B/op 66 allocs/op
BenchmarkMapGet/benchmark_maps_1M/benchmark_new_format_reads-12 508759 3162 ns/op 0 B/op 0 allocs/op
BenchmarkMapGet/benchmark_maps_1M/benchmark_old_format_reads-12 58594 19413 ns/op 5114 B/op 88 allocs/op
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkMapUpdate/benchmark_maps_10k/benchmark_new_format_writes-12 14779 68341 ns/op 19563 B/op 54 allocs/op
BenchmarkMapUpdate/benchmark_maps_10k/benchmark_old_format_writes-12 1684 660560 ns/op 486386 B/op 3204 allocs/op
BenchmarkMapUpdate/benchmark_maps_100k/benchmark_new_format_writes-12 16516 72802 ns/op 21005 B/op 54 allocs/op
BenchmarkMapUpdate/benchmark_maps_100k/benchmark_old_format_writes-12 1294 1054807 ns/op 713440 B/op 4905 allocs/op
BenchmarkMapUpdate/benchmark_maps_1M/benchmark_new_format_writes-12 12213 96054 ns/op 28961 B/op 54 allocs/op
BenchmarkMapUpdate/benchmark_maps_1M/benchmark_old_format_writes-12 1219 1025194 ns/op 814927 B/op 5089 allocs/op
On the read-path we're between 3x and 6x faster. On the write-path we're consistently 10x faster!
The new storage engine and serialization format are still experimental, but we're releasing an early version to users
in order to gather feedback and to preview future performance gains. It's currently behind a feature flag, but you can
turn it on by setting the environment variable
DOLT_DEFAULT_BIN_FORMAT=__DOLT_1__ and creating a new database with
While the new format is in development, the full feature set of Dolt is not yet supported, specifically
many CLI utilities have not yet been ported. The focus of this alpha release is
dolt sql and
dolt sql-server, which
are largely supported with the exception of some DDL and Foreign Key operations. A general release of the new storage
engine and storage format will be available later this year and will include a migration path for existing databases.
If you have any questions or want more information about the storage engine roadmap, join us on our