
DoltHub is the home of multiple version-controlled databases, including Dolt, Doltgres, and Doltlite. These databases enable users to branch and merge their data just as they would with their source code. We are dipping our toe in the document database world with DumboDB, and we’re excited to share some updates about the storage layer in DumboDB.
Last week, we discussed the addition of garbage collection to DumboDB. In that post, I acknowledged that there was a discrepancy between the amount of disk space used by DumboDB and Dolt. When storing equivalent amounts of data, Dumbo’s disk usage was about 6x that of Dolt. I couldn’t explain it at the time, and I promised to investigate the issue.
This week, I have an answer for you, and a release of DumboDB that includes multiple fixes for this issue. Let’s dive in!
TL;DR;#
There were three primary causes for the difference in disk usage between DumboDB and Dolt:
- DumboDB was using a separate chunk for every document. This resulted in a large number of small chunks, which is inefficient for storage.
- The test data exacerbated the issue, as it was dominated by documents which were only 20 bytes in size.
- Extended JSON metadata increased the size of each document by roughly 20%.
These issues have been addressed in the 0.2 release of DumboDB, and the disk usage is now virtually identical to Dolt when storing the same data.
One Chunk Per Document#
In Dolt, data is stored in chunks, which are typically around 4KB in size. You can obviously store many rows of a common table into 4KB, and we have plenty of machinery to do that efficiently. In DumboDB prior to version 0.2, each JSON document was being stored in its own chunk. This means that if you had a document that was 20 bytes in size, it would be given an address (another 20 bytes) and would result in an explosion of the Prolly tree.
Addressing this was pretty straightforward. We leveraged the existing chunking machinery in Dolt to store multiple documents in a single chunk. This resulted in a significant reduction in the number of chunks, and therefore a significant reduction in disk usage. With this change alone, we saw DumboDB’s disk usage drop from 6x to 2x that of Dolt.
It’s worth calling out that this behavior was actually not as designed. The original plan was to use Dolt’s chunking machinery to store multiple documents in a single chunk. This is a “write-only” code base, though, and the agent did not follow the plan. I’m not blaming the agent; I mention this to explain why I decided to test this in the first place. When coding with agents, you are left with a black box which you will never have time to read completely. All we can do is look at the behavior of the black box and see if it matches expectations. In this case, the behavior did not match my expectations, which is what led me down the path of investigating the storage format.
At the end of the day, this improvement in storage efficiency is really a story about how Dolt does storage correctly, and DumboDB just needs to follow its example.
This fix is a change in the format of data on disk and therefore is not backward compatible. 0.2.0 is a breaking release, and you will need to start with a fresh database to take advantage of this change. At this stage, we don’t have enough users to merit supporting a migration path. Let us know on Discord if you want help!
Test Data#
The storage test harness was pretty naive about what data should actually look like. This was simply a mistake on my part. The test data was dominated by documents which were only 20 bytes in size. All Dumbo documents require an additional 20 bytes of overhead for the address, so the storage overhead was huge. See how this accounts for the 2x difference in disk usage after the chunking fix?
This was also an easy fix. I updated the test data to have a matrix of documents with varying sizes, from 100 bytes up to 1KB. This is more representative of real-world data, and it also allows us to see the benefits of the chunking fix more clearly. With this change, we saw DumboDB’s disk usage drop to 1.1x that of Dolt. This makes sense because we have that 20-byte overhead per document, but now the documents are large enough that the overhead is not as significant.
Extended JSON Metadata#
At this stage, I was pretty satisfied with the results. For an Alpha-stage product, being 10% larger than our reference product is pretty good. However, I had a nagging feeling that there was more juice to squeeze.
We need to digress a little bit first. MongoDB operates on data called BSON. BSON is a binary representation of JSON documents, and it includes some additional metadata to support MongoDB’s features: native encoding of timestamps, regular expressions, ObjectIds, and more. All MongoDB client-side drivers convert human-readable JSON into BSON, and all wire protocols and storage on disk are in BSON.
Since DumboDB is a MongoDB-compatible document database, speaking BSON is necessary. Dolt has a lot of existing code to work with JSON documents, specifically MySQL JSON, which is slightly different from vanilla JSON. We decided to leverage this existing code to store the documents on disk, but this meant that we needed to convert the BSON documents from the MongoDB driver into JSON documents that can be stored in Dolt.
MongoDB has a specification for storing BSON losslessly in JSON documents, called Extended JSON. The high-level idea is that if you are storing a value that has type ambiguity, you need to wrap it in an object with a single key that indicates the type. For example, if you want to store the number 42, you would store it as {"$numberInt": "42"}. This allows us to store the number 42 as a four-byte integer, rather than a string or an eight-byte long. Human-readable JSON can only store 6 primitive types of data, whereas BSON can store 21. Each one expands the amount of data when you are forced to encode as extended JSON.
How about an example? Let’s say you have a document that looks like this:
{ x: 42 }
BSON’s binary format for the document is pretty compact. Every document starts with the 4-byte length of the document, followed by a series of values for each field, which are <type><key><value>, and then a null byte at the end of the document. The type is a single byte that indicates the type of the value, the key is a null-terminated string, and the value is the binary representation of the value. In this case, the type for an integer is 0x10, the key is x followed by a null byte, and the value is 42 in little-endian format.
The BSON for this document would look like this in hex:
\x0c\x00\x00\x00 \x10 \x78\x00 \x2a\x00\x00\x00 \x00
└──────┬───────┘ │ └──┬──┘ └──────┬───────┘ │
Doc Length Type Key "x" Value 42 End
All told, this document takes up 12 bytes on disk in BSON format.
When we convert this document to extended JSON for storage in Dolt, it looks like this:
{ "x": { "$numberInt": "42" } }
If you drop the whitespace, this document ends up being 25 bytes. The BSON traveling over the wire is 12 bytes, but the JSON we are storing on disk is 25 bytes. This is a pretty significant increase in size. This small-number, 42, example is pretty extreme, especially when you consider that strings don’t have this problem. Don’t walk away from this believing that extended JSON is 2x more expensive than BSON, because that’s not the case for all documents. We did look at our test data and determine that extended JSON resulted in our test documents being expanded by about 10%-30%, depending on the document contents.
There is an additional concern here which was in the back of my mind: What are the performance implications of converting between BSON and extended JSON? If we need to convert BSON from the wire into extended JSON on every write, then convert it back into BSON on every read, that’s gotta cost some CPU cycles, right?
Time for an experiment.
Native BSON#
DumboDB is completely written by Claude Code. I’m not gonna lie, I rarely look at the code directly. We’ve talked about this before, and we here at DoltHub believe there is a place for this methodology. One of the main things it shines at is rapid prototyping and experimentation. Given this, I decided to see how hard it would be to use what we had learned at Dolt about storing JSON documents effectively, and port that to native BSON storage in DumboDB.
At a high level, Dolt stores JSON documents in a way that allows for structural sharing of large documents. So if you have a 10KB document and you update a small field in the document, chunks are reused between the old and new versions of the document, which saves a lot of space on disk. This matters in a version-controlled database because we are going to store all versions of the document forever.
Given that we had a reference implementation of structural sharing for JSON documents in Dolt, we were able to port that code to work with BSON documents in DumboDB. One wrinkle is that in vanilla JSON documents, there is no length prefix. It’s just a series of nested objects and arrays. In BSON, every document has a 4-byte length prefix at the beginning of the document. This is also the case for sub-documents, and arrays (which are a series of documents). The reason this is important is that when you update a large BSON document, say, for example, by inserting an element in an array, the length of the document changes. That means that you can’t just update one chunk; you may need to modify several. So we couldn’t just use the existing structural sharing code from Dolt; we had to modify it to account for the length prefix in BSON documents.
We had two options to consider: 1) On the write path, adjust all the appropriate length prefixes. 2) Drop the length from the data and recalculate it on the read path.
Given how fast it is to experiment in this new world of agentic coding, I did both. Both were a significant improvement compared to the extended JSON solution. A lot depended on the operation in question, but there were cases where we went from a 7x multiplier down to a 2x multiplier. That’s a huge improvement, and it made the decision to migrate to BSON pretty easy. Not all operations had that level of improvement, but by and large, BSON is an improvement over extended JSON. There was not a single benchmark that got worse with native BSON.
As far as the choice between (1) and (2), solution (1) was a little slower on writes and barely faster on reads. Solution (2) was a little faster on writes and barely slower on reads. The differences were negligible. We went with (1) because it enables streaming of documents, thus preventing us from having to read the entire document into memory on every read. Also, reads are generally more common than writes, so we wanted to optimize for reads.
I embarked on this BSON side quest mostly because I was curious about the performance implications, and that paid off. But what about the storage implications? When I took the structural sharing and test data updates mentioned above, and then added native BSON storage, I saw DumboDB’s disk usage drop to 0.998x that of Dolt, a variance that isn’t statistically significant.
In other words, when it comes to storage, DumboDB is now on par with Dolt, specifically when you use Dolt to store JSON documents. SQL tables will always be more compact because there are no field names to store for every row. Document databases have a lot of flexibility, but that flexibility comes at the cost of storage overhead. Nevertheless, we put a lot of effort into optimizing JSON storage in Dolt. So much so that we are convinced we have the best JSON implementation of any SQL database. DumboDB being on par sets it up to be a very capable version-controlled document database.
Try DumboDB 0.2.0#
We know that Alpha software is a little scary. But surely you have a side project you’d like to try out a version-controlled document database on? Maybe you want to build a wiki, or a CMS, or a blog engine, or a todo list app. Maybe you just want to play around with the technology and see what you can build with it. Whatever your use case, we encourage you to give DumboDB a try! It’s not producing bloated storage artifacts anymore, and we’d love to see what you build with it.
Want to learn more about Dolt and Dumbo? Hop on our Discord to ask questions and nerd out about version-controlled databases!