Why We Chose Flatbuffers

August 22, 2022

9 min read

Dolt is a versioned SQL database with Git-like functionality, including clone, push, pull, branch, merge and diff. It stores structured relational data on disk as tuples in Prolly trees and it serializes structured data like commits, tags and database schemas.

Primarily motivated by performance, we have recently been working on implementing a new storage format for Dolt. As part of that work, we've adopted a model where every chunk Dolt stores is a flatbuffers message. In this post, we look at how we are using flatbuffers in the new storage format and why we think it's a good choice for this use case.

Background

Dolt was originally developed on top of a versioned non-SQL database named noms. Noms had a type system that included primitives such as numbers and strings, as well as aggregate data structures like maps, lists, sets, structs and blobs. Noms serialized all its data using a custom self-describing format which was somewhat tag + length prefix based.

By building directly on top of noms, Dolt adopts its storage format, using noms structs for structured data like commits and table schemas, and using maps of tuples for storing relational table data.

As Dolt implemented richer SQL features and saw more adoption for OLTP SQL database use cases, it became clear that allocation and deserialization overhead was a significant performance cost on the read path and serialization overhead was similarly a meaningful overhead on the write path.

We started experimenting with new ways of storing table data which would allow us to reduce allocation and CPU overhead. We knew that we could afford to be less general purpose and less featureful than the original noms format. Because allocation and GC overhead were a particular performance painpoint of the old noms format, we paid particular attention to being able to read and perform fixed-width-field updates on table data without too much allocation overhead.

Investigations and Choices

Because it is in the hot-path for query execution, our original investigations explored replacing the nom's map and tuple serialization with custom serialization/deserialization code focused on tuple storage in particular. We started by experimenting with a number of tuple formats and some simple ways of serializing arrays of them into chunks. We were quickly able to proof-of-concept some meaningful performance wins which showed moving substantial query execution bottlenecks away from binary codec concerns. This motivated us to continue pursuing the work.

After we had a manually coded optimized codec for clustered indexes done, we started doing further work to convert other types of prolly maps to the new tuple format. Dolt stores a lot of prolly maps that are not exactly clustered indexes for table data — things like secondary indexes, keyless table data, merge conflicts, and commit closures. We quickly realized as we were adapting new use cases how much we had relied on the flexibility and rich feature set of noms' type system when we were developing Dolt.

At the same time, we knew that we had a relatively small and slowly changing set of types of structs and maps that we actually create and process as part of Dolt. As a team, we have a decent amount of experience with protobufs from previous jobs and from building gRPC services and clients that we run internally at DoltHub. In general, we like the schema-first codegen approach for application-specific persisted structured data.

So we took our manually rolled tuple storage formats that we had developed and we experimented with converting them to schema-first serialization formats like protobufs. As we evaluated formats and tooling, we were looking for:

Good golang support.
Low allocation overhead and good performance on serialization and deserialization.
A reasonable path to zero-copy query execution.
Mature, adopted, and actively developed.
Schema-first with a decently ergonomic developer experience.
A good backward compatibility story. It should be easy to add fields and let new versions of Dolt read databases created by older versions.

Forward compatibility is less important to us — if old versions of Dolt can read databases created by newer versions, that's great, but it's not critical that this remains possible. We are interested in adding new data types and improving Dolt in ways that break forward compatibility, and we are OK with having users upgrade their version of Dolt if they want to read or collaborate on databases that are created by newer versions of Dolt.

We also closely compared space overhead of the serialized data. Both noms and protobufs use varint encoding, for example, and flatbuffers does not.

After comparing the options, we felt that we had found a nice sweet spot in flatbuffers. Flatbuffers is an open source project developed at Google which provides schema-first binary serialization in a format designed to allow direct access to existing serialized messages without allocation or a deserialization step. We really liked the ability to do zero-allocation, in-place "deserialization" and its backward compatibility story fit our needs perfectly.

How we used flatbuffers

Every chunk in the new format is a serialized flatbuffers table. We use flatbuffers file identifiers to identify the type of the root table in the chunk. For structured data and fixed-schema prolly maps, we created table definitions that closely mirror the use-case in Dolt. For example, this is the table definition for a Commit:

table Commit {
  // hash addr of the root value associated with the commit.
  root:[ubyte] (required);
  height:uint64;

  // An ordered list of hash addrs of the parents; [0:20] is the addr
  // of the first parent, [20:40] the addr of the second, etc.
  parent_addrs:[ubyte] (required);

  // hash addr of the root chunk of the CommitClosure table.
  parent_closure:[ubyte];

  // Commit metadata
  name:string (required);
  email:string (required);
  description:string (required);
  timestamp_millis:uint64;
  user_timestamp_millis:int64;
}

We have similar definitions for tables like RootValue, Table, Conflicts, MergeArtifacts, TableSchema, etc.

You can see from the above schema that we are somewhat intentional about avoiding indirection — parent_addrs:[Hash] with table Hash { ... } is too much overhead, for example. This somewhat unfortunately means the accessors for parent_closure, which is a single address, look exactly like the accessors for parent_addrs, which is actually a list of addresses.

Dolt stores relational table data in prolly maps. The fields of the tuples in any given prolly map are based on the schema of the table. So tuple encoding and decoding is not done by flatbuffers. Instead, we use flatbuffers messages to store the byte arrays that are the serialized tuples. Currently the primary index node definition looks like this:

table ProllyTreeNode {
  // sorted array of key items
  key_items:[ubyte] (required);
  // items offset for |key_items|
  // first offset is 0, last offset is len(key_items)
  // the second tuple is key_items[key_offsets[1]:key_offsets[2]]
  key_offsets:[uint16] (required);
  // item type for |key_items|; enum to encode the tuple format encoding itself.
  key_type:ItemType;

  // array of values items, ordered by paired key
  value_items:[ubyte];
  // item offsets for |value_items|
  // first offset is 0, last offset is len(value_items)
  value_offsets:[uint16];
  // item type for |value_items|
  value_type:ItemType;
  // offsets for each address (if any) in |value_items|
  // (eg value tuples containing out-of-line BLOB addresses)
  value_address_offsets:[uint16];

  // array of chunk addresses
  //  - subtree addresses for internal prolly tree nodes
  //  - value addresses for AddressMap leaf nodes
  address_array:[ubyte];

  // array of uvarint encoded subtree counts
  subtree_counts:[ubyte];
  // total count of prolly tree
  tree_count:uint64;
  // prolly tree level, 0 for leaf nodes
  tree_level:uint8;
}

All of the tuples for the keys go back to back in key_items, and delimiters for the individual tuples in the items array go in key_offsets. This worked better for us than modeling the Tuple as a flatbuffer message, where the space and indirection overhead becomes much too large. You can also see we're still somewhat sensitive about space overhead — on most read and update paths we do not need to access the subtree_counts and so it worked out better for us to varint encode those potentially large and rarely used integers instead of storing them in uint64s, for example.

Benchmark Setup and Results

In order to further evaluate our choices vs. the original encoding that Dolt used, we wrote some small benchmarks to measure wall-clock time and memory allocation overhead for decoding and accessing leaf database pages. In the tests, we generate an array of same-sized prolly-tree leaf nodes, each encoding a mapping from an (int64, int64) tuple to another (int64, int64) tuple.

The first test looks at deserialization overhead associated with something like a table scan. We measure how long it takes to deserialize and scan all the tuples in a single page, accessing the first field from the key tuple and the first field from the value tuple.

We run versions of the benchmark for three different formats: the original noms sequence based encoding using the noms-based tuple encoding; the new flatbuffers based encoding, using the new tuple encoding; and a hypothetical protobuf encoding, using largely the same message definition as our flatbuffers schema and using the new tuple encoding.

BenchmarkScanNoms-8         	   19893	     60952 ns/op	   21086 B/op	     484 allocs/op
BenchmarkScanFB-8         	  183279	      7199 ns/op	     544 B/op	       5 allocs/op
BenchmarkScanProto-8      	  153792	     10043 ns/op	    6912 B/op	      22 allocs/op

As you can see, compared to the new tuple format, the old noms format was not ideal for storing normalized relational data in prolly trees. This was a major motivation for our work on the new format.

The flatbuffers and protobuf benchmarks are quite a bit closer to each other. A few things are going on here: 1) flatbuffers is over 28% faster than protobufs on this benchmark and 2) flatbuffers absolutely win on number of allocations they need to perform to do this work, and the number of bytes we allocate per scanned leaf chunk.

One place where flatbuffers might show even better is in partial reads, where it can avoid overheads associated with processing the entire message. In our next benchmark we access the chunk and then access a single, random tuple within it, instead of scanning every tuple.

BenchmarkOrdinalNoms-8      	  172642	      7271 ns/op	    1375 B/op	      15 allocs/op
BenchmarkOrdinalFB-8      	 1635603	       748.6 ns/op	       0 B/op	       0 allocs/op
BenchmarkOrdinalProto-8   	  481422	      3989 ns/op	    6368 B/op	      17 allocs/op

As you can see, the cost for this operation is quite a bit cheaper across all three formats than the full scan of the chunk. And especially in allocations, the old format does a lot better here — many of the allocations in the old format are for accessing tuples or map entries, not the map itself. Still, flatbuffers is far and away the fastest of the three, and the protobuf implementation has to allocate just as it does in the full scan case.

For the sake of completeness, we also include a benchmark which performs a lookup of a tuple in the leaf page. The lookup is of a tuple which does not appear in the page, at a random position in the page.

BenchmarkLookupNoms-8       	  128235	      9549 ns/op	    2451 B/op	      28 allocs/op
BenchmarkLookupFB-8       	 1127490	      1100 ns/op	       0 B/op	       0 allocs/op
BenchmarkLookupProto-8    	  400362	      3630 ns/op	    6367 B/op	      16 allocs/op

In all three benchmarks, flatbuffers in golang compares favorably. We've also found it decently ergonomic and easy to work with.

Current Work and Future Directions

Overall, we've been very happy with the choice to use flatbuffers here. It massively reduced development time compared to rolling our own use-case-optimized serialization, while keeping performance and space overhead relatively minimal. There are a few changes and speedbumps that we came across in our adoption.

First of all, golang support is decently mature and it is developed in the main flatbuffers repository, but it lags in features and support a little bit. For example, if arrays in structs were supported in golang, it's possible we would have modeled some fields of Commit above as something like:

struct Addr {
  addr:[ubyte:20];
}

table Commit {
  root:Addr;

  parents:[Addr];
  parent_closure:Addr;
  ...
}

That would have been more self-documenting and would have slightly reduced indirection and storage overhead for our commits. However, this feature isn't supported in the golang code generator yet and so it wasn't an option.

Relatedly, there are some things the generated golang code does that are not super efficient. For now, we are developing against a fork of the flatbuffers compiler because we needed more control over initializing new table values without having to allocate the table itself on the heap.

We've also extended the flatbuffers generated code to allow us to loudly fail when we come across fields in a table that were added after the running version of Dolt was compiled. In many contexts, forward compatibility and being able to work with unknown fields is a great feature. But for our current approach to binary compatibility, we can't safely work with the database if we have a client that doesn't understand all the fields that are present.

One final observation on the flatbuffers golang ecosystem is that the go compiler itself is much less aggressive about optimizing and inlining than a modern C++ compiler. Additionally, slices and bounds checking are far from free. When using flatbuffers generated code from C++, it's idiomatic to Verify all the offsets within a message when you first encounter it and then to use non-range-checked accessors into the raw byte array. In the golang generated code, everything operates on 24-byte slices instead of 8-byte pointers and bounds check are applied at almost every slice access, especially when you cross function call boundaries. Additionally, the golang compiler does not aggressively inline non-trivial functions. The end result of all of this is that accessing flatbuffers fields is notably slower than accessing native fields, by a factor that is significantly larger than for similar access patterns in C++. For now, this has not been a problem for us, but it's definitely something we would love to circle back on and engage with the community on in the future.

Overall, flatbuffers have been a huge productivity win for implementing Dolt's new storage format and we've been very happy with them. If you notice similarities between what we've done and a use case you have in the future, we strongly recommend you give them a look. In the mean time, you can try out the new format for Dolt databases by following these instructions with details about migrating existing databases and creating new ones.

Blog

Background

Investigations and Choices

How we used flatbuffers

Benchmark Setup and Results

Current Work and Future Directions

Get started with Dolt