Dolt Gets a cgo Dependency

GOLANGTECHNICAL
10 min read

At DoltHub, we're building Dolt, a MySQL-comptaible SQL database with Git-like functionality, including branch, diff, merge, push, pull and clone. Dolt is written in Go. For a variety of reasons, we have traditionally avoided cgo dependencies in Dolt. We recently took a dependency on cgo in Dolt to achieve better disk compression. In this blog, we take a look at the trade-offs involved in adding cgo dependencies to a Go project and why we decided to depend on cgo in the end.

The Value of Simplicity

Go's toolchain has first class support for compiling and calling into dependencies written in C. When cgo is enabled, go build will call out to the configured (or discovered) C compiler, C++ compiler and assembler for any C, C++ or assembly source files which appear next to the Go source files in a package. The resulting object files will all be linked together to make up the compiled Go package. In addition, the Go compiler has some support for inline definitions in C/C++ code. For example, the following will compile the commented C code as C code using the platform's C compiler and call into it from the go code.

package main

// int add(int a, int b) {
//   return a + b;
// }
//
// long stdver() {
//   return __STDC_VERSION__;
// }
import "C"

import "fmt"

func main() {
	fmt.Println(C.add(C.int(16), C.int(16)))
	fmt.Println(C.stdver())
}

In general, creating bindings to C libraries and consuming them from Go is about as easy as could be expected. But ergonomically, it still comes with a multitude of trade-offs compared to native Go. Here are a few that loom large for us at DoltHub.

Dependencies for Developers

Today, a new developer at DoltHub, or someone coming across the project in an open-source context, can clone the repository and get started right away. The only major dependency for building Dolt and running its unit tests is a modern Go SDK. Things like go test ./... and go install ./cmd/dolt work without any configuration.

As one gets deeper into Dolt development, there are further platform dependencies we take, but they are largely confined to tests. For example, running the full bats test suite requires a number of external dependencies, including bats itself. But the first touch experience and being able to make significant progress and full contributions without almost any environment setup is big win for first time contributors and even for our internal tooling and CI/CD workflows.

Adding a C/C++ toolchain to the mix, and potentially dependencies on external C/C++ libraries, has the chance to change things dramatically. C/C++ toolchains are quite platform specific and installing and configuring them is a different process from the Golang SDK.

Entrypoints for Developers

Depending upon how the C/C++ dependency is taken, it can end up requiring installation of external third-party libraries, or a separate build process, outside of go build, that prepares all dependencies in an appropriately reproducible and versioned way.

Transitive Dependencies

Transitive dependencies when cgo is involved can be more complicated. Go's module system explicitly handles version resolution and building consistent dependencies for the entire closure of the Go dependencies. Explicitly addressing the diamond-problem around C/C++ dependencies will need to be accounted for if two different Go packages take dependencies on the same C library.

Compatibility as a Go Module

If a cgo dependency requires coordination beyond go build to get C libraries built for a target platform, then the resulting Go module is no longer consumable as a dependent module without further coordination. This affects the ergonomics of consuming projects, which even in supported cases may need to set specific build tags and specific environment variables to get the right behavior for their build. It's nice to just import "github.com/dolthub/driver from your Go code and get Dolt embedded as a library.

Heterogeneous Dependencies

The toolchains used to compile Dolt and its dependent libraries become inputs to its build process. So does their configuration. C/C++ toolchains are much more heterogeneous and configurable than we typically get into with go build. As a simple example, just the optimization level of the C compilation can have a big impact on the resulting code and its performance, whereas in Go it's not typical to create separate debug and release builds at all. Another example would be preprocessor defines, which are typical when configuring compile-time options for C libraries. The equivalent in Go land is build tags, but we have gotten far into Dolt's development without taking any dependencies on build tags for our typical development, test and release workflows.

In the case of cgo dependencies, this naturally extends to the system libraries that make up the toolchain, including the C and C++ standard libraries, the compiler support libraries, etc. The reality is that CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build ./cmd/dolt with a specific version of the Go SDK has quite predictable behavior across a wide range of developer environments. The model for how to determine the transitive closure of dependencies that are inputs to the compile and link is quite simple. Once a C/C++ toolchain with system libraries, compiler support libraries, default settings, etc., is brought into the mix, things get quite a bit more complicated.

These external dependencies and their impact on the resulting binary can have negative consequences for developers and tooling. Across our organization we have processes that compile Go code in order to test it, profile it, benchmark it, package it for release, etc. Ideally all of those processes would use the same dependencies and result in the same artifacts. You want your regression benchmarks to look like your release binaries, for example. In a cgo-less world, it's easy to take this amazing property of the Go SDK for granted.

Cross-Compilation

Go without cgo has one of the best cross-compilation stories of any native code compiled language I've come across. Any Golang SDK installation can create statically linked binaries for any supported target operating system and architecture, without much configuration at all. One sets the environment variables GOOS=linux, GOARCH=arm64, and CGO_ENABLED=0 and they run go build ./cmd/dolt, and no matter what platform they are on, they get a statically linked binary for aarch64-linux.

Once cgo is involved, you need a fully configured C toolchain to be able to build and link binaries for a given target platform. At DoltHub, we have developers on Linux, macOS and Windows. And our existing development workflows do make some use of cross-compilation. For example, our infrastructure build systems are agnostic to the target architecture when building Dolt into container images. For our developers, we have lambdabats, which runs Dolt's bats tests with massive parallelism using AWS Lambda. It currently works by compiling a Dolt binary from the developer's workstation and running the target bats tests in an aarch64-linux container with the compiled Dolt version installed. Another example is our release process, which is able to run without too much toolchain or platform configuration in a simple Linux container to create binaries for x86_64-linux, aarch64-linux, x86_64-darwin, aarch64-darwin and x86_64-windows.

Release Artifact Dependencies

With cgo disabled, the Go toolchain creates static or minimal-dependency binaries for every target platform it targets. For example, on Linux the binaries will be completely static, and on macOS the binaries will be dynamic, but with minimal dependencies on things like libSystem.dylib just for making system calls.

Once cgo is involved, the linkage and runtime dependencies of the resulting binary depend heavily on the specific C libraries linked, the specific flags passed to the Go toolchain while building, and various build tags that were set or unset during the build. By default, building a Linux binary with the Go SDK while on a Linux/GNU host, with cgo enabled, will not create a fully statically linked binary. It will dynamically link glibc.so and will take runtime dependencies on glibc.so for things like DNS name resolution and for implementing some of the functionality exposed in os/user.

Dynamically linking against glibc.so can result in more uniform behavior for resulting binary, when compared to other binaries on the host to which it's deployed. But it's also possible to easily create artifacts that take a dependency on a newer version of glibc.so than what's available at the deployment site. For precompiled binaries targetting generic glibc-based Linux systems, you probably want to build for the lowest common denominator amongst the supported platforms. And even then, you will exclude more esoteric systems like musl-based distributions.

A way forward, if one wants deliver static binaries on Linux, is to statically link against a different libc implementation. A recommended configuration might be to use cgo enabled with a musl libc implementation and certain toolchain flags resulting in a fully statically linked binary. Musl libc does support static linking by design.

Source Dependencies

Go as a language ecosystem has a strong bias for source dependencies which get built to release artifacts as part of the overall build of the top level target involved in a high level command like go install, go build or go test. With fast compilation speeds and strong input-based caching, performance remains very reasonable and the reliability of the end-to-end process is high.

In many contexts, traditional C and C++ projects will take a different approach. C/C++ packages will take dependencies on release artifacts (static and dynamic libraries) which are produced by a separate build process of the dependend-upon library. In addition, many C/C++ projects will have separate configure and install steps, which will rely on self-probe tests and are sometimes not able to compellingly support cross-compilation.

In the context of cgo, this can mean needing access to externally-compiled C/C++ libraries which the Go package will link against. To maintain control over your dependencies and your build process, those will typically need to come from a higher level of build coordination, and an external tool outside of go {build,install,test} will be necessary to coordinate it all. Even a dependency on something like GNU Make is less compelling for our Windows developers than just being able to run go test ./....

An option is to make the Go package which is using cgo to itself be responsible for the build of the C/C++ code. This can be made to work for some use cases just sticking to the functionality that's available in go build for building C using the configured C toolchain. If a third-party package can be made to compile for your target platforms this way, it can be compelling. But even when it works well, it often requires massaging a third-party package in particular ways—moving files around, changing include paths, shipping hard-coded config.hs, etc.—which causes deviation from the upstream third-party code and requires maintenance going forward.

IDE Support

Related to Source Dependencies, and the potential need for an external build system to coordinate C/C++ dependencies, native Go projects are easier to get working in a wide variety of IDEs than projects which rely on cgo. This is especially true for projects which adopt an external build system to coordinate the building of the C code. Sometimes this can be made to work with pre-built platform-specific libraries at specific locations.

Profiling Support

At DoltHub, we make extensive use of Go's heap and CPU profiling to investigate and improve the runtime behavior of Dolt. Getting compelling profiles out of the running C/C++ code within a cgo program is a whole different affair, typically involving different tools which are much more platform specific than runtime/pprof.StartCPUProfile. Understanding the performance impact of the cgo calls and the overhead of the stack bridging which the runtime perforamces to make the call is yet another area of knowledge and expertise which developers working on Dolt will need to master.

The Case For cgo

The primary reason to take a cgo dependency is to depend on an implementation that exists in C or C++ and does not exist in Go. You also might take a cgo dependency if you wanted additional performance that was only available by stepping outside of Go.

At DoltHub, we've come across a case of needing a C library depedency in the past, when we were adding better collation support go go-mysql-server. In that case, we explicitly jumped through a number of hoops avoid cgo. You can read about how we integrate certain functionality from ICU4C into Dolt using WASM. But just this week we've intentionally taken our first cgo dependency in Dolt and we intend to keep it going forward. We took a dependency on facebook/zstd in order to get access to lightning fast and highly effective compression for our block storage layer. What's different between when we evaluated cgo for the ICU4C dependency and depending on facebook/zstd?

  1. Block compression and decompression is on the critical path of everything Dolt does. We are much more sensitive to its performance than we are to the performance of Unicode aware collation handling.

  2. facebook/zstd is in some ways a much smaller library with a much simpler build process than ICU4C. It was easier to achieve source-only dependencies and maintain a build compatible with go build when taking the cgo depedency.

  3. Most of the ergonomic pain points mentioned above are problems for the developers of Dolt, not for the users of Dolt. Our users rely on us to build a world class SQL server that supports branch, merge, diff, push, pull and clone. They are not sensitive to how we get there.

These points combined meant that we felt it was the right tradeoff to take a cgo dependency on facebook/zstd. The go module github.com/dolthub/dolt/go will continue to be consumable as a normal Go dependency on all platforms that facebook/zstd supports. Building dolt from now on will require cgo to be enabled, which will in turn require a fully configured C toolchain for the target platform.

We will continue to deliver fully-statically linked Dolt binaries for {x86_64,aarch64}-linux in the releases section of Dolt's GitHub repository.

Conclusion

Even with native integration and broad support for cgo from within the Go SDK, cgo can represent a significant cost in ergonomics, tooling support, and ongoing maintenance. As part of taking a cgo depedency, you need to solve all the issues inherent in deploying production C/C++ code to your target environments. For firms that already have substantial investments in C/C++, that cost itself may be small. But for firms that are not currently delivering C/C++ to their target platforms, they will need to pay the startup costs associated with deploying C/C++ code, including building, testing, profiling, benchmarking, license auditing, etc. In addition to the new language support, cgo itself comes with some ongoing ergonomic costs compared to developing pure Go. A decision to take a cgo dependency should be considered carefully.

If you would like to discuss Go and cgo, and the tradeoffs regarding the tooling and ergonomics of developing with cgo dependencies in general, please stop by our discord.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.