Did you know that Git has a garbage collector? It does, but most Git users will never encounter it because most usage patterns never need it. Even when it runs, it often runs transparently to the user.
Garbage collection is one of those features that when it works well, users don’t have to think about it at all. It works silently in the background and the user never realizes it’s there. If a user even notices that garbage collection exists, usually that means something has already gone wrong.
Dolt is the first SQL database to have Git’s version control semantics. Like Git, Dolt’s data model is a repository of content-addressed chunks. When a chunk no longer has any references to it, it can safely be deleted. And like Git, Dolt accomplishes this cleanup via garbage collection.
Dolt does a lot of things right with garbage collection:
- Dolt’s GC runs automatically in the background while the server is running.
- Dolt’s GC doesn’t “stop the world”. It doesn’t disconnect or stall existing clients, and it doesn’t block new connections. Clients can continue to read and write to the database while GC is running.
- Dolt uses a generational GC in order to speed up subsequent GC runs: chunks that appear in a Dolt commit are assumed to stick around and get moved to a separate “oldgen” file. Oldgen chunks are exempted from future automatic GC runs.
Dolt uses a standard tree-walk approach: walk the graph, copy any encountered chunks into a new file, then delete the old file. The core logic behind the garbage collector is straightforward and simple and fits in a single 80-line method. All of the complexity comes from managing the fact that there are active connections reading and writing data while GC is running.
In the vast majority of cases, this just worked, but there was one particular case where the garbage collector could cause issues for users. If a user made many transactions or writes a ton of data without running GC, then the time to run the garbage collector began to scale with the number of transactions and with the total amount of data written. And worse, the memory requirements of running GC also began to scale this way.
To be clear, this was not a common occurrence. I’m talking about hundreds of millions of transactions or hundreds of gigabytes of data written since the last GC. Since Dolt’s server mode runs GC automatically, this is simply impossible for most use cases.
But there were fringe cases where it could happen: for example, the server might be running a really old version of Dolt, from before GC was automatic. Or a user was using Dolt’s command-line utilities to import a colossal amount of data while the server wasn’t running. And in those fringe cases, users could theoretically end up in a scenario where the next GC run would become the bottleneck.
In the absolute worst case, the user ends up in a state where GC uses so much memory that the operating system terminates the server. And since GC wasn’t resumable, it would restart from the beginning the next time the user launched the server, where it could get terminated again. It was possible to end up in a situation where if there was too much activity since the last GC, it became impossible for GC to ever complete because it required more memory than the system had available.
We weren’t satisfied with this outcome. We want users to use Dolt in all manner of ways, not just the common use cases. We want to ensure that no matter what happens, Dolt users never end up in a state where they can’t run their database. So we fixed this.
Introducing Incremental GC#
As of Dolt v1.86.6, we’ve added a new configuration setting for garbage collection: incremental garbage collection.
When this feature is enabled, Dolt changes how it writes to storage during GC. Instead of writing a single large file containing all of the processed chunks, Dolt will write multiple smaller files. In the event that GC is interrupted, Dolt will use any already-written files to prevent redundant work.
If combined with the configuration setting for memory-mapped chunk files, incremental GC also reduces the memory requirements of GC. By using these files as a record of which chunks have already been written, Dolt removes the need to keep that record in memory.
For Dolt servers, this setting is enabled by modifying your server’s config.yaml, to include a line that sets the size threshold from Dolt to write incremental chunk files. An example config that sets the threshold to 1GB would look like this:
behavior:
auto_gc_behavior:
incremental_file_size: 1000000000
If you’re manually running GC, you can also configure this by passing the --incremental-file-size flag to your invocation.
Here’s what it looks like from a MySQL client:
> CALL DOLT_GC('--incremental-file-size', 1000000000)
And here’s what it looks like in Dolt’s command line utilities:
dolt gc --incremental-file-size 1000000000
Again, most users don’t need to worry about this setting at all. But the existence of this setting provides an escape hatch in the event that a database has so many garbage chunks that it creates performance issues.
Currently, the default behavior is to not enable incremental garbage collection unless an incremental file size is explicitly provided. In future versions of Dolt, we may change this to make incremental garbage collection the default behavior.
And that’s all for today. As always, if you have any questions, curiosities, feature requests, or if you just want to chat, you should always feel free to join our Discord server. We’d love to chat with you.