Hunting a Memory Leak

TECHNICALGOLANG
7 min read

Here at DoltHub, we built Dolt, the world's first version-controlled database. Dolt lets you branch, fork, clone, merge, and diff your relational data, in all the same ways Git lets you work with files.

Recently, a customer reported that they were seeing memory usage growing slowly, but consistently, for their Dolt SQL Server. Even though Dolt is built on Golang's managed runtime and doesn't require applications to explicitly release allocated memory, that certainly doesn't mean Go applications are immune from memory leaks! In this blog post, we show how we worked with this customer to collect additional information and ultimately track down and fix the leak.

Collecting Information

The first thing we always want to do with any customer-reported issue is to get a repro case. Once we can reliably repro an issue, we can get it under the debugger and make quick progress on the fix. However, sometimes for complex issues it can be difficult to create a repro case without access to the same data and environment where the customer is seeing the issue.

Fortunately, we have some good tools for collecting information in these cases...

To better understand memory usage, we can use the memory profiling tools in the Golang runtime. In particular, we can collect multiple heap memory profiles as the process' memory is growing and then see a diff of the memory profiles to better understand where the memory is going. Let's take a look at how to enable memory profiling in a Dolt SQL Server and how to collect a profile.

Enable Profiling for a Dolt SQL Server

Two arguments must be specified on the Dolt command line, and must appear before the sql-server subcommand:

  • --prof mem tells Dolt to enable memory profiling in the Go runtime
  • --pprof-server causes the Go runtime to listen on port 6060 for requests from go tool pprof for profiles

Here's an example invocation: dolt --prof mem --pprof-server sql-server --config config.yaml

If the profiler is enabled, you will see output like this:

Starting pprof server on port 6060.
Go to http://localhost:6060/debug/pprof in a browser to see supported endpoints.

Collecting Profiles

Once the Dolt SQL server is running with memory profiling enabled, you can use go tool pprof to collect a profile. You’ll need to install the Go SDK to use this tool. Ideally, you should run this directly on the host running the Dolt SQL server with profiling enabled, so that you don’t have to expose access to port 6060. You’ll need the following arguments:

  • -proto tells Go to capture the profile in the standard protobuf output format
  • -output=heap.pprof tells Go where to write the output file containing the report
  • http://localhost:6060/debug/pprof/heap tells Go where to find the profile server, as well as the resource being profiled (i.e. the heap)

Here's an example showing how to run go tool pprof to collect a heap memory profile:
go tool pprof -proto -output=heap.pprof http://localhost:6060/debug/pprof/heap

If this runs correctly, you should see output like this:

Fetching profile over HTTP from http://localhost:6060/debug/pprof/heap
Saved profile in /Users/jason/pprof/pprof.dolt.alloc_objects.alloc_space.inuse_objects.inuse_space.006.pb.gz
Generating report in heap.pprof

Notice that two files are mentioned in the command output. The first file is the raw profile information from the server, stored in a gzipped protobuffer format. Because we specified -proto on the command line, the generated report is also in a gzipped protobuffer format. In this case, the two files are identical, so it doesn’t matter which one you use. Generally, it’s good practice to use the report file, but in this case it won’t make a difference.

Limitations of Heap Memory Profiling

So, what happened after we got some memory profiles from the customer and took a look at how memory was changing? Unfortunately, there wasn't an obvious answer, so we started looking for other clues.

Another good source of information is the Prometheus metrics exposed by a Dolt SQL Server. Using this, we took a closer look at the go_memstats_heap_alloc_bytes metric, which shows us how much active memory in the heap is in use. When we looked at this metric, we were a bit surprised to see that the heap memory usage for Dolt SQL Server process wasn't actually growing. The customer had already confirmed that the overall memory usage on the host was steadily increasing, so where was this memory going and what was allocating it?

It turns out there are a few limitations with Golang's memory profiling that are important to understand. One notable point that may or may not already be obvious is that Golang memory profiles can only report on the memory that the Golang runtime is managing. This means if an application is using any memory allocated from outside the Golang runtime, such as through a CGO dependency, then it's not going to show up in the memory profile. Another corollary of this is that if your application directly or indirectly causes the kernel to allocate memory, that memory is allocated in the kernel space, and therefore won't show up in the Golang memory profile either.

Kernel Memory Slabs

If the Golang process' heap usage isn't growing, then we need to start expanding our search for where the growing memory on the host is coming from... Dolt does have a CGO dependency, so that was one potential area to investigate, but before digging in there, we took a look at the overall memory usage on the host, by looking at /proc/meminfo. This virtual file is generated by the kernel to show stats on RAM, swap, slab allocation, caching, and more. Here's an example of what the data in /proc/meminfo looks like:

MemTotal: 16.00 GB
MemFree: 6.00 GB
MemAvailable: 12.00 GB
Buffers: 0.26 GB
Cached: 4.21 GB
SwapCached: 0.00 GB
Active: 1.16 GB
Inactive: 4.54 GB
Active(anon): 0.00 GB
Inactive(anon): 1.24 GB
Active(file): 1.16 GB
Inactive(file): 3.30 GB
Unevictable: 0.00 GB
Mlocked: 0.00 GB
SwapTotal: 0.00 GB
SwapFree: 0.00 GB
Dirty: 0.00 GB
Writeback: 0.00 GB
AnonPages: 0.0 GB
Mapped: 0.64 GB
Shmem: 0.00 GB
KReclaimable: 2.17 GB
Slab: 3.40 GB
SReclaimable: 2.17 GB
SUnreclaim: 1.23 GB

When we looked at /proc/meminfo over a few samples while the host memory was increasing, what stood out was growth in the slab cache. In a Unix system, the slab cache is a kernel memory management mechanism used to efficiently allocate memory for frequently used objects, such as inodes, file descriptors, or process structures. Instead of repeatedly allocating memory from the general pool (which is slow and may lead to fragmentation) the kernel uses slab caches to pre-allocate chunks of memory for specific object types. Each "slab" is a contiguous block of memory that holds multiple initialized instances of a particular object, ready for immediate use.

This approach significantly speeds up kernel operations and reduces memory waste. For example, when the system needs a new inode (a data structure that describes a file), it can pull one from the inode slab cache instead of constructing it from scratch. While this improves performance, the slab cache can grow large over time, especially on systems with high I/O activity, as objects are retained for reuse. The kernel will typically reclaim this memory under pressure, but in some long-running systems or those with aggressive caching, it can lead to surprisingly high memory use that's technically still available but reserved for potential future use.

To take a closer look at what's going on in the slab cache, we can use the slabtop tool to see a live display of slab usage or we can look in /proc/slabinfo to see stats for the various slab caches. After looking in /proc/slabinfo, we noticed that the two largest caches that were growing were the dentry and ext4_inode_cache caches, which tells us that filehandles are consuming the growing kernel memory.

Finding the Root Cause

We're closing in on the culprit now... We know the growing memory on the host is in kernel space and something is causing the slab cache to allocate more and more space for filehandles. A great tool for looking at what filehandles a process has opened is losf. When we took a look at the output of lsof, we saw open file handles to database files that we expected to see, but we also saw a few hundred unexpected open file handles to deleted LOCK files from the stats subsystem.

Here's an excerpt from the lsof output showing some of these deleted LOCK files:

dolt      1 root 1146uW     REG     259,3         0    932002 /mydb/.dolt/stats/.dolt/noms/LOCK (deleted)
dolt      1 root 1147uW     REG     259,3         0    932176 /mydb/.dolt/stats/.dolt/noms/LOCK (deleted)
dolt      1 root 1148uW     REG     259,3         0    932348 /mydb/.dolt/stats/.dolt/noms/LOCK (deleted)
dolt      1 root 1149uW     REG     259,3         0    932521 /mydb/.dolt/stats/.dolt/noms/LOCK (deleted)

This tells us where the growing memory is coming from and points very clearly to a leak of an open filehandle that the stats subsystem isn't properly closing.

After digging into the stats code to look for this filehandle leak, we identified that when the stats system rotated storage, it wasn't properly closing the LOCK file, which caused these filehandles to grow in the kernel slab cache. We fixed the root cause and added regression tests to ensure these filehandles don't leak again and released the fix in Dolt v1.57.1.

Conclusion

There's always something to learn from every bug hunt. I hope you found something in this post interesting and useful for any future bugs you have to hunt down. Knowing about tools like Golang's memory profiling tools, as well as how Unix memory management works and what metadata it exposes were all important in fixing this issue. If you want to talk about Golang memory management, debugging, or just geek out about data versioning, come join our Discord and say hello.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.