PRODUCTS

KEYWORDS

DumboDB Indexes: They Work Now!

DumboDB Logo

I have to confess: a couple of months ago, when I pulled the trigger to announce DumboDB 0.1, I told a fib. I said that DumboDB had indexes. Yes, it had them, but I was skeptical about how correct they actually were. Given it was a 0.1 release of an entirely vibe coded product, I think it’s reasonable to say expectations were low. But now, after a fair amount of testing and development, I can safely say many aspects of the initial index implementation were incorrect.

Today, I’m going to tell you all the ways they were broken, and give those brave souls who have tried DumboDB a really good reason to upgrade to version 0.3.0.

To honor the SpaceX IPO, we will join Dolty on a journey of discovery and exploration of the DumboDB index bugs of yesterday. SpaceX has taught us that it’s important to blow up some rockets in order to learn how to build better ones.

Post Mortem

DumboDB Index Bugs of Yesterday#

Until now, DumboDB had a number of bugs in its indexes. They fell into two categories: MongoDB parity, and version-control correctness. We’ll cover each of these in turn.

MongoDB Parity#

DumboDB is a MongoDB clone, and as such, it has to be compatible with MongoDB. The interfaces used to create indexes pretty much worked, and the simplest queries involving only one index would go faster. But beyond that, there were several correctness issues that would honestly make your application unusable:

  • deleteMany would not drop any keys from the index, so queries would return deleted documents.
  • updateMany had the same issue as above.
  • .explain() would result in a full column scan for any collection that had more than one index.
  • Regardless of what the .explain() output was, the query planner would always choose a full column scan over an index scan.
  • .hint() with the $natural key would panic.

So this is a list that spans the full spectrum of “facepalm, I can’t believe I released this” to “Does anyone use .hint()?” But all summed up, it was a pretty bad experience for anyone trying to use indexes in DumboDB.

There were additional known gaps in support, specifically the lack of support for partial indexes. But those were not correctness issues, just missing features. The above list was a set of correctness issues that would make your application unusable. Dissecting the issues, it was clear a lot more work was required to get this thing off the ground.

Back to the Drawing Board

We addressed all of these issues in the same way we built the first version of DumboDB: by writing a lot of parity tests. To briefly summarize the methodology, we wrote a set of tests that would run against both MongoDB and DumboDB, and we compared the results. There are known feature gaps in DumboDB, and they are marked as such in the test suite.

Ultimately Claude and I produced a plan that was converted into Beads. The plan included enabling 27 tests in the parity harness related to indexes and query planning that DumboDB failed outright. In addition, there were 48 new tests in the parity suite to cover much more surface area. All said and done, it amounted to about 100 beads. Claude worked over the course of 12 hours or so to get DumboDB to pass all 77 tests related to indexing. Agents for the win! How maintainable is the code? Only time will tell!

Version-Control Correctness#

With the MongoDB parity issues fixed, we turned our attention to the version-control correctness issues. These also fell on a spectrum of “facepalm”:

  • While unique index support was dodgy at best, merging would result in duplicate entries. Indexes were effectively corrupted.
  • Merging branches resulted in a full index re-write. No structural sharing was happening, and the merge was not performant.
  • The resolution workflow for merge conflicts was unusable. All values in the dumboConflicts response were null.

These issues were more subtle to discover. There is no way to parity-test such features because MongoDB does not have version control. Determining that index data was not being structurally shared required instrumentation that would count the number of chunks stored in different scenarios. We also added instrumentation to Dolt itself, and ran through merge scenarios to determine that Dolt was doing a lot more structural sharing than DumboDB.

More testing and instrumentation for the win. Asking Claude to go deep on Dolt’s source code and port it, or better yet use it directly, was the solution here. It required a fair amount of human verification and hand-holding to ensure that the code was ported correctly, and that the tests were passing. But we got there, and DumboDB now has a fully functional index implementation that is both MongoDB-compatible and structurally shared.

Another Test

The last remaining issue was the merge conflict resolution workflow. That required a bit more work because it exposed the limitations of the current conflict resolution workflow. Ultimately I decided that the previous information available to users for resolving data conflicts was not sufficient, and I added a new dumboConflicts response that provides more information for users to resolve conflicts. Where before users had no mechanism to resolve conflicts in merging indexes, now they do. The new dumboConflicts response provides the following information:

mydb> db.runCommand({ doltConflicts: 1 })
{
  collections: [
    {
      collection: 'items',
      conflicts: [
        {
          conflictId: 'T9/nJ47hKUBV7x1XJSUHJg',
          type: 'uniqueKeyCollision',              // Currently uniqueKeyCollision and documentEdit
                                                   // are the only two conflict types.
                                                   // More will be added in the future.
          reason: {
            code: 'uniqueKeyCollision',
            // Human-readable message useful in third-party applications to assist with conflict resolution.
            message: `unique index "by_sku": branch 'main' (ours) and branch 'feature' (theirs) both have sku = "S-1"`,
            index: 'by_sku',
            key: { sku: 'S-1' }
          },
          // Full document of each member of the three-way merge.
          base: null,
          ours: { _id: 10, doc: { _id: 10, sku: 'S-1' }, diffType: 'added' },
          theirs: { _id: 20, doc: { _id: 20, sku: 'S-1' }, diffType: 'added' }
        }
      ]
    }
  ],
  ok: 1
}

The structure of the response provides useful information for users to act on now. This tells you that the two branches both created a document with the same unique key, and that you need to resolve the conflict. You can resolve the conflict using dumboResolveConflict. Conflict resolution allows the user to choose ours, theirs, or a custom resolution.

We went from being dead in the water with merge conflicts to having a workflow to address them. Yay!

Next Launch#

You’ll note that I haven’t said a word about query performance. That’s because I haven’t gotten to test it yet! I’m leaving town for a week, so I figured it was better to get this out. Having the correct contents and query results is step one. I’ll jump on performance when I get back!

We are still testing. Similar to testing rockets, there is a long way from liftoff to orbit. DumboDB is significantly more usable now than it was a week ago. You don’t need to enable or turn on anything to get the corrected index and new functionality. Just upgrade to DumboDB 0.3.0, and it just works.

Want to learn more about Dolt and Dumbo? Hop on our Discord to ask questions and nerd out about version-controlled databases!

Lift Off