PRODUCTS

KEYWORDS

Dumbo: Under the Hood

DumboDB Logo

Last week, we announced DumboDB, a new database that combines the best of MongoDB and Git. In this post, we’ll take a behind-the-scenes look at what makes DumboDB tick and how you can take a look under the hood.

NoSQL on Dolt#

Dolt is a SQL database. DumboDB is a NoSQL database. DumboDB is built on top of Dolt, so surely DumboDB must just be calling SQL queries under the hood, right? Nope.

Dolt’s SQL engine is very capable, but it’s complex. I wanted to ensure that DumboDB’s interaction with its own data did not force us to build up a SQL abstract syntax tree and then execute that. Instead, we wanted to build a NoSQL database that just happened to use Dolt’s storage format. To do that, we built a custom query engine that translates MongoDB queries directly into Dolt’s storage format.

Git and Dolt are both content-addressable storage systems. This means that every piece of data is stored using an identifier calculated from its cryptographic hash. The Prolly Trees that we create and manage to make your database work are tightly coupled to the idea of content-addressed data. The code we use to serialize our data must be rock solid in order to ensure that the same data always produces the same hash and that different data produces different hashes. This is critical for ensuring the integrity of the data and for enabling features like branching and merging.

We’ve put a tremendous amount of effort into ensuring that Dolt’s content-addressable storage system is robust and reliable. By building DumboDB on top of Dolt’s storage, we can leverage this robust system to ensure that DumboDB’s data is stored reliably. This strategy is in stark contrast to DoltLite which kept SQLite’s SQL layer and rewrote Dolt’s storage in C. I’m right and Tim is wrong. DumboDB for life.

FlatBuffers All the Way Down#

To understand how this works, we need to understand how Dolt stores data. All data in a Dolt database is modelled using FlatBuffers, which are an efficient way to serialize structured data. People familiar with Google’s Protobufs will be right at home with FlatBuffers. Dolt’s FlatBuffers are defined here. These FlatBuffers cover all information you would push and pull between databases, including commit objects, schema information, foreign key constraints, and so forth. To make it concrete, let’s look at the commit FlatBuffer definition, which is used to store commit information in Dolt:

table Commit {
  // hash addr of the root value associated with the commit.
  root:[ubyte] (required);
  height:uint64;

  parent_addrs:[ubyte] (required);

  parent_closure:[ubyte];

  name:string (required);
  email:string (required);
  description:string (required);
  timestamp_millis:uint64;
  user_timestamp_millis:int64;
  signature:string;
  committer_name:string;
  committer_email:string;
}

This FlatBuffer defines the structure of a commit object in Dolt. There are multiple [ubyte] fields, which are used to store hash addresses that reference other serialized data. The other fields are fairly self-explanatory, containing metadata about the commit such as the author’s name and email, a description of the commit, and timestamps.

The critical piece is the root value; this is where all of your database state is stored. This is akin to the treeId in Git. It’s an address that points to another FlatBuffer - the AddressMap - which is a mapping of table names to their corresponding data values. Each table’s data value is another FlatBuffer that contains the actual data for that table:

table Table {
  // address of schema.
  schema:[ubyte] (required);

  // an embedded row map;
  primary_index:[ubyte] (required);

  // Entries map from index names to addresses of index maps.
  secondary_indexes:[ubyte]; // Embedded AddressMap

  conflicts:Conflicts;

  // address of artifacts
  artifacts:[ubyte];
}

It turns out that if we want to represent Dumbo’s “Collections” of key-value pairs, we can model that just fine with the Table FlatBuffer. The schema is always a two-column schema, where the first column is the key and the second column is the value. Not only that, there are additional fields that enable us to support three-way merges and conflict resolution. Collections are just simple tables. That’s it. Dumbo’s indexes are happy to be represented as secondary indexes in the Table FlatBuffer. Dumbo’s “documents” are just rows in the primary index of the Table FlatBuffer.

Kicking the Tires#

To show you how Dumbo data maps onto Dolt’s storage system, let’s fabricate some data and look at it… with Dolt.

Create Some Data#

If you haven’t already, install DumboDB. I mean, it came out last week, so you must have it already, right?

Start your server. We are going to be poking at the data on disk, so I suggest you do this with a directory you create and have full access to:

mkdir dumbo_data
dumbodb --data-dir dumbo_data

We’re going to be working with a Dumbo database called mybiz. To create a Dumbo database, we connect to it. We’ll do so in a separate terminal. We’ll use the collections customers and orders, and we’ll insert some documents.

$ mongosh mongodb://localhost/mybiz
...
------
   The server generated these startup warnings when booting
   2026-05-11T22:15:48.171Z: Powered by DumboDB v0.1.0.
   2026-05-11T22:15:48.171Z: Star Us! https://github.com/dolthub/dumbodb
------

mybiz> mybiz> db.customers.insertMany([
   { _id: 1, name: "Alice", email: "alice@example.com" },
   { _id: 2, name: "Bob", email: "bob@example.com" }
 ])
{ acknowledged: true, insertedIds: { '0': 1, '1': 2 } }
mybiz> db.orders.insertMany([
   { order_id: 101, customer_id: 1, items: ["Laptop"], total: 1200 },
   { order_id: 102, customer_id: 1, items: ["Mouse"], total: 25 },
   { order_id: 103, customer_id: 2, items: ["Monitor"], total: 300 }
])
{
  acknowledged: true,
  insertedIds: {
    '0': ObjectId('6a025655a4237f79d869142c'),
    '1': ObjectId('6a025655a4237f79d869142d'),
    '2': ObjectId('6a025655a4237f79d869142e')
  }
}

Next, let’s create an index on the customer email field, then commit everything:

mybiz> db.customers.createIndex({ email: 1 }, { unique: true })
email_1
mybiz> db.runCommand({dumboCommit: 1})
{
  commitId: 'f8rj10hkgvekcfh1cb3qru2mhnqskfht', // Remember this Commit ID, we'll be looking at it in a second.
  branch: 'main',
  message: 'dolt commit',
  author: 'dumbodb <dumbodb@dumbodb>',
  timestamp: ISODate('2026-05-11T22:27:00.776Z'),
  committer: 'dumbodb <dumbodb@dumbodb>',
  committerTimestamp: ISODate('2026-05-11T22:27:00.776Z'),
  ok: 1
}
mybiz>

Now that you’ve created some data, shut down your server (CTRL+C in the terminal where it’s running).

Tricking Dolt#

If you look at the files created in the dumbo_data directory, you’ll see something like this:

$ tree dumbo_data
dumbo_data
├── admin
│   ├── journal.idx
│   ├── LOCK
│   ├── manifest
│   ├── oldgen
│   └── vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
└── mybiz
    ├── journal.idx
    ├── LOCK
    ├── manifest
    ├── oldgen
    └── vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv

Every MongoDB server has an admin database, and we see that here. We also see the mybiz database that we created. Each of these databases is a Dolt database, kind of. Specifically, it’s the noms portion of a Dolt repository. You can read about the anatomy of a Dolt repository here. For historical reasons that go back more than 10 years, the noms directory is where all Dolt data is stored. But there isn’t a noms directory here. What gives?

DumboDB uses a slightly streamlined on-disk file hierarchy, compared to Dolt. Dolt’s top-level directory is .dolt, which contains a noms subdirectory for database data, in addition to other files which are all local configuration. DumboDB stores everything in the data directory. Instead of mybiz/.dolt/noms, we have mybiz/. DumboDB won’t store any additional files outside of the mybiz directory; local configuration is stored directly as FlatBuffers.

The fact that the DumboDB server directory mybiz is identical to the .dolt/noms directory allows us to do a little hack. We can trick Dolt into accepting Dumbo data by creating a noms directory, symlinking the contents of the mybiz directory into it, and creating a repo_state.json file:

$ mkdir -p dumbo_data/dolt_like/.dolt
$ ln -s dumbo_data/mybiz dumbo_data/dolt_like/.dolt/noms
$ printf '{\n  "head": "refs/heads/main",\n  "remotes": {},\n  "backups": {},\n  "branches": {}\n}\n' > dumbo_data/dolt_like/.dolt/repo_state.json

After this is done, you can run dolt command-line commands against the dolt_like directory and see the data that Dumbo created:

$ cd dumbo_data/dolt_like
$ dolt log
commit f8rj10hkgvekcfh1cb3qru2mhnqskfht (HEAD -> main)
Author: dumbodb <dumbodb@dumbodb>
Date:  Mon May 11 15:27:01 -0700 2026

        dolt commit

commit 396piaisu9pfdb040vo1bfet3al99rvb
Author: dumbodb <dumbodb@dumbodb>
Date:  Mon May 11 15:20:54 -0700 2026

        Initialize database

You can see that the commit id we saw when running the dumboCommit command is the same commit id that dolt log is showing.

The dolt show --no-pretty command allows us to see the raw FlatBuffer data for the commit:

$ dolt show --no-pretty f8rj10hkgvekcfh1cb3qru2mhnqskfht
SerialMessage {
	Name: dumbodb
	Desc: dolt commit
	Email: dumbodb@dumbodb
	Timestamp: 2026-05-11 15:27:00.776 -0700 PDT
	UserTimestamp: 2026-05-11 15:27:00.776 -0700 PDT
	Height: 2
	RootValue: {
		#qefsbrdcac7b0giu1imu83gec8rem2li
	}
	Parents: {
		#396piaisu9pfdb040vo1bfet3al99rvb
	}
	ParentClosure: {
		#jerpt0l9pd1et6d0ksr21jh6d2gacuhv
	}
}

That is the same FlatBuffer structure that we mentioned above. Similarly, we can see the FlatBuffer data for the root value:

$ dolt show --no-pretty qefsbrdcac7b0giu1imu83gec8rem2li
SerialMessage {
	FeatureVersion: 7
	ForeignKeys: #00000000000000000000000000000000
	Tables: AddressMap {
		customers: #pd286pptv91msqh42g4q21orqunmncuq
		orders: #vt95ldk13suna63bc50ubvf35ub0pad5
	}
}

The two collections we created, customers and orders, are represented as tables in the root value. If we look at the customers table, we can see the secondary index we created on the email field:

$ dolt show --no-pretty pd286pptv91msqh42g4q21orqunmncuq
SerialMessage {
	Schema: #r7gpgrhi7uj7lam4u7h0f3js60hhhje1
	Primary index: #0528ucv84ce0lta7cd2slnnr0kvn8pdl
	Secondary indexes: AddressMap {
		email_1: #m7ue3cmolfkoa7b12i1hr0kuftkkjjr8 // This is the index on the email field
	}
}

$ dolt show --no-pretty m7ue3cmolfkoa7b12i1hr0kuftkkjjr8
SerialMessage {
	Blob - {"name":"email_1","keys":[{"field":"email"}],"unique":true,"map_root":"1948de1e21f339203c5316c2348340fe3d9918e8"}
}

That map_root is yet another FlatBuffer that represents the index on the email field. We can probably stop here; you get the idea. Even though DumboDB is a NoSQL database, it’s still using Dolt’s storage format under the hood.

So It’s Just a Dolt Database?#

No. There is a subtle difference which I haven’t acknowledged yet. The storage system has the secondary index stored in the FlatBuffer, but the Dolt SQL engine doesn’t know how to use it. The Dolt SQL engine is completely unaware of the secondary index, and so it won’t use it when executing queries.

dolt sql -q 'show create table customers'
+-----------+------------------------------------------------------------------+
| Table     | Create Table                                                     |
+-----------+------------------------------------------------------------------+
| customers | CREATE TABLE `customers` (                                       |
|           |   `_id` binary(20) NOT NULL,                                     |
|           |   `doc` json NOT NULL,                                           |
|           |   PRIMARY KEY (`_id`)                                            |
|           | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin |
+-----------+------------------------------------------------------------------+

Whereas in DumboDB, you can list the indexes on the collection:

mybiz> db.customers.getIndexes()
[
  { v: 2, key: { _id: 1 }, name: '_id_' },
  { v: 2, key: { email: 1 }, name: 'email_1', unique: true }
]

This is all to say that the storage format is the same, but the query execution is different. The DumboDB query engine understands the indexes that DumboDB creates and can use them to execute queries efficiently, whereas the Dolt SQL engine is unaware of those indexes and can’t use them. So while it would be nifty to say “hey you can execute SQL queries against DumboDB,” that isn’t really the case.

Why?#

Humans wrote Dolt. We’ve committed years of effort to build a robust product that has had millions of hours of usage. Through that time we’ve actually rewritten the storage engine multiple times, and the Prolly Tree representations have been better for it.

AI coding agents wrote DumboDB, and that will continue to be the case, for the most part. By building on top of a storage system that we understand well, we can be more confident in the robustness of DumboDB. Users of coding agents will agree that sometimes they very enthusiastically assemble very compelling smoke-and-mirrors demos. To avoid this, we hard-code tests to ensure that simple data written by DumboDB must be readable by Dolt, and that gives us confidence that the data is being stored correctly. We also have some other signals, like zero extraneous files that aren’t related to Dolt’s on-disk format. But we aren’t naive enough to assume that we can keep 1-1 linkage between the two databases, and that has never been a goal. We want to be able to evolve DumboDB in ways that make sense for a NoSQL database, and that will definitely mean that we have to diverge from Dolt’s SQL engine. The storage format is a different story. Maybe with more evolution we’ll find some updates we need to make to the FlatBuffers to better support DumboDB, but that will almost certainly be easier to do in place with the current storage format, and Dolt will just ignore those updates.

What’s Next!?#

Currently, DumboDB does not support garbage collection. Similar to Dolt, as work is being done there are many intermediate objects being created that are not required after the user commits their changes. These objects tend to fill up the journal file (that’s the vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv file you see in the data directory). The Dolt storage system has a mechanism to clean this up, and store data chunks into more persistent files that can be pushed and pulled between databases.

This is another benefit of depending on Dolt’s storage system. Garbage collection is a hard problem that carries the risk of throwing out important data if done incorrectly. Thanks to all the work done on garbage collection in Dolt, Dumbo will be ahead of the curve on this. DumboDB will support automatic garbage collection in release 0.2. It took until 1.75 to get that in Dolt!

Conclusion#

We don’t want DumboDB to be a black box we have no understanding of, which can be a risk with coding agents. We want to be able to look under the hood and see how it works, and we want you to be able to do that too. We hope this gives you confidence in the robustness of the foundation underneath DumboDB.

Hop on our Discord to ask questions and nerd out about version-controlled databases!