Blog

7 min read

We’re building Dolt, the world’s first version-controlled SQL database. Dolt hit 1.0 in 2023, which meant, among other criteria, that we promised forwards storage compatibility:

Dolt 1.0 will be backwards compatible with all further 1.X versions.

Now it’s 2026 and we’re on the verge of releasing Dolt 2.0. In preparation for this release, we’re pursuing some work we’ve long put off: finally removing support for the pre-1.0 format, which we referred to internally and in binary logs as LD1 in honor of our original company name.

Read on for details of why we took this step and how we accomplished it.

The challenge#

In 2022, we first landed Dolt’s current storage format in alpha release. The new storage format diverged radically from what came before, partially summarized here with a before and after view of how tuple values in rows are stored:

Before:

+--------+----------+---------+--------+----------+---------+-----+--------+----------+---------+
| Type 0 | Length 0 | Value 0 | Type 1 | Length 1 | Value 1 | ... | Type k | Length k | Value K |
+--------+----------+---------+--------+----------+---------+-----+--------+----------+---------+

After:

+---------+---------+-----+---------+----------+----------+-----+----------+-------+
| Value 0 | Value 1 | ... | Value K | Offset 1 | Offset 2 | ... | Offset K | Count |
+---------+---------+-----+---------+----------+----------+-----+----------+-------+

Storing tuples in the new format was part of a series of changes to how values are serialized to disk that collectively resulted in over a 5x speedup in our internal benchmarks. We pursued these changes primarily for performance reasons, and it worked: Dolt is now as fast as MySQL on sysbench.

But these changes came at a cost. Because the old and new storage formats were incompatible with one another, and existing paying customers were running their production databases on the old format, we needed to support both storage formats in parallel. We did this the typical way in many programming languages, by introducing interfaces that abstracted away the differences between the two implementations. For example, here’s how we define a table’s storage:

type Table interface {
	HashOf() (hash.Hash, error)
	GetSchemaHash(ctx context.Context) (hash.Hash, error)
	GetSchema(ctx context.Context) (schema.Schema, error)
	SetSchema(ctx context.Context, sch schema.Schema) (Table, error)
	GetTableRows(ctx context.Context) (Index, error)
	GetTableRowsWithDescriptors(ctx context.Context, kd, vd *val.TupleDesc) (Index, error)
	SetTableRows(ctx context.Context, rows Index) (Table, error)
	GetIndexes(ctx context.Context) (IndexSet, error)
	SetIndexes(ctx context.Context, indexes IndexSet) (Table, error)
	GetArtifacts(ctx context.Context) (ArtifactIndex, error)
	SetArtifacts(ctx context.Context, artifacts ArtifactIndex) (Table, error)
	GetAutoIncrement(ctx context.Context) (uint64, error)
	SetAutoIncrement(ctx context.Context, val uint64) (Table, error)
    DebugString(ctx context.Context, ns tree.NodeStore) string
}

Then, under the hood, we had two different implementations of a table: NomsTable for the old storage format, and DoltTable for the new one. The same pattern was repeated for all the other objects needed to materialize data to disk: schemas, indexes, foreign keys, commits, etc.

This all sounds fine so far, except that due to limitations on the time and effort we were willing to expend on this “temporary” state of affairs, these abstractions didn’t fully capture all the necessary differences between the two implementations. This is regrettable but understandable: various database operations tend to be tightly coupled to their on-disk representations for reasons of performance. This meant that, in practice, there were many places in library code where we switched on the storage type of a database. It looked like this:

		if types.IsFormat_DOLT(tm.vrw.Format()) {
			tbl, stats, err = mergeProllyTable(ctx, tm, mergeSch, mergeInfo, diffInfo)
		} else {
			tbl, stats, err = mergeNomsTable(ctx, tm, mergeSch, rm.vrw, opts)
		}

This situation made the “temporary” dual-state of two supported storage formats much harder to unwind. It wasn’t a simple matter of deleting the defunct interface implementations. Rather, we had to carefully disentangle hundreds of different library functions, most of which were not so helpfully named as in the above example, to determine which of them were still used by actual production code in the new storage format. And there were additional difficulties:

Hundreds of tests declared in the old storage format
Functionality spread across five different repositories
Thousands of databases shared publicly on DoltHub, including many of our own, using the old storage format. They would need to be migrated to the new format before we could remove support for it.

In short, removing support for the LD1 format was a daunting prospect. So why do it?

Why bother?#

Removing old code paths and deleting defunct code is a lot of work. It’s just sitting there, not hurting anyone, maybe making your binary slightly larger. Why bother?

Software engineers love clean code and they hate “tech debt”. But at DoltHub, we don’t work for ourselves; we work for our customers. Customers don’t see code, and they couldn’t care less about “tech debt.” They just want a product that works well and their features shipped on time. So if you propose to spend time “paying down tech debt” rather than delivering new features, fixing bugs, or improving performance, you need to justify it with a business reason.

In our case, the business reason was that the dual code paths made it very difficult to reason about what fuctionality was actually in use, which in turn made it very difficult to change and therefore build new features on top of them. This was especially true deep in the stack, such as where we serialize data to disk.

In particular: for the Dolt 2.0 release, I am adding support for adaptive encoding, which we implemented for the Postgres-compatible version of the database, Doltgres. But Dolt should have it too, because it makes the storage and retrieval of TEXT and BLOB data much faster in a majority of use cases. Customers have been continually surprised that TEXT types have a performance penalty relative to VARCHAR, but they do. Adapative encoding eliminates that penalty for many customers, so I want to add it.

But doing so in a way that works for existing customers requires the ability to change the encoding of a column independent of its declared SQL type. My first day digging around in the schema-encoding layer in pursuit of these changes left me with more questions than answers, and after a few more days of study I realized that a majority of the complexity and the code in this layer was in service of the old storage format. Even worse, I couldn’t change it without hunting down and eliminating the many, many places those interfaces were used. What started as a limited, targeted pruning of a single interface to make my alternate schema serialization scheme possible quickly spiralled into changes that would result in a panic if it encountered an LD1 format database.

When I saw just how far-reaching the changes required to accomplish my feature were, I became convinced it was time to bite the bullet and unwind the “temporary” dual code paths that had been in place for four years. Our 2.0 release was our last window to stop supporting the pre-1.0 storage format, which meant the time was now.

Making the changes#

Most of these changes were done the old-fashioned way: using an IDE and command line tools like grep to hunt down references to functions, then making changes by hand. There’s not really an automated way to do this kind of change at scale. Coding agents are happy to try, but because of the widespread and delicate nature of the change, you end up spending a lot of time closely examining their work, which for this kind of task can often be slower than simply using the functions of your IDE. But there were a couple exceptions where tool use sped me up:

Hundreds of test cases had been effectively defunct for several years, since they were running on a storage format that wasn’t used in production anywhere. They were testing… something. But not what we wanted. Coding agents were able to convert many of these tests for me in the background while I did other work. Because these tests weren’t doing anything useful in the first place, errors or omissions in their conversion didn’t bother me much, making this an ideal task for an LLM.
The deadcode command was useful throughout the project for finding functions and methods that were unused by any main program.
To migrate the few thousand old-format databases still on DoltHub, I wrote a bunch of scripts that called an admin-only endpoint to migrate them automatically. Where this failed due to size or other unreliability, I migrated them on my local machine with similar scripts.
Once the top-level usages of the old storage format had been safely removed, it became much more tractable to instruct coding agents to begin pruning the now-unused parts of the storage layer. Because the code base was structured in such a way to restrict this part of the code to its own packages, it was relatively quick work to see that the agent hadn’t made any inappropriate changes that could impact a production database simply by reviewing the file paths changed.

After being impressed with Claude’s result in migrating some tests to the new format, I told it to reward itself with a poem, which likewise impressed me enough to share here:

claude code poem

The final result: since last month, Dolt has shed around 100k lines of code, or around 20% of the repo.

% git diff --shortstat v1.81.6
   736 files changed, 22856 insertions(+), 114385 deletions(-)
% sloc .

---------- Result ------------

            Physical :  425520
              Source :  330333
             Comment :  48633
 Single-line comment :  47115
       Block comment :  1521
               Mixed :  2461
 Empty block comment :  81
               Empty :  49096
               To Do :  536

Number of files read :  1500

This makes our binary a bit smaller, which is always nice. But more importantly, it makes it much simpler to reason about various library functions and therefore to add new features. Overall I’ve spent well over a month in pursuit of this goal, which means I must be pretty certain I had a good reason for doing it. It’s satisfying work in its own right, but hard to justify unless coupled with an important business goal.

Conclusion#

The moral of this story: “temporary” changes are permanent and hard to unroll. Oftentimes, the best way to deal with them is to not deal with them at all, just live with the consequences of the past. It’s only when the weight of those past decisions becomes impossible to bear that you should take action this drastic. And even then, you should have a really important reason for doing so.

Want to learn more abou Dolt, the world’s first version-controlled SQL database? Visit us on the DoltHub Discord, where our engineering team hangs out all day. Hope to see you there.

Blog