Generating Golang Source Files at Build-Time with go:generate
If you want to start a shouting match over Thanksgiving dinner, try asking your drunk uncle if generating source code from templates is a code smell.
I've seen some true horror stories in my time about generated source files. I've seen generated files bloated with 100K unused symbols that brought builds to a crawl. I've stepped through machine-generated dependency injection classfiles that read like they were designed by MC Escher. And I once beheld a project that used Bazel build rules to generate other Bazel files dynamically at build-time.
Generated source code has a well-deserved stigma. It's not readable. It's not maintainable. If you find yourself using code generators in order to avoid writing the same nearly-identical code over and over and over again, then there's often either a flaw in your design or a limitation in your language.
And yet sometimes we need to use it anyway. Because sometimes it really is the best tool for the job. Sometimes you really need to do some compile-time computation and hard-code the results. Sometimes you really need to define a type from some definition file. And sometimes you're writing Golang code in 2021 and generics haven't been invented yet.
So sometimes we begrudgingly accept machine generated source code, because for all its faults, there's situations where manual code is worse. For example, in any code that is largely mechanical boilerplate, it's very easy for a human coder to accidentally introduce a devastating typo or copy-paste error. In these cases, it's safer to just let the machine to do all the boring work for you, since machines don't get tired and aren't as easily distracted by Reddit. The most common places to find generated code in real software are things like foreign function interfaces, remote procedure calls, and serialization/deserialization logic.
Generating Code at Build-Time
If you're using a generic build system like make
, it's easy to add generated code to your project since you can run arbitrary commands as build rules. But this also makes the side-effects of building your project difficult to reason about, and allowing a build system to run any command or access any file could even be an attack vector.
So it's not really surprising that languages that provide a package system often provide their own build tooling as well that's tightly integrated with the language. These systems can be highly configurable and even allow for custom build-time logic, but in a way that's hopefully sandboxed, constrained, and easy to reason about.
So what does Golang do?
Go has go install
as the go-to command to both download packages from the package manager and install them from source. It manages go dependencies, but it doesn't do any kind of pre-processing or code generation. This is potentially for the best, since it keeps it simple and reduces potential vulnerabilities for users installing go packages. Instead, the standard procedure is for any generated code to be included in the source repo by the developer. But since go install
can't do code generation itself, developers need another tool.
For the first seven years of Go's life, developers who wanted to use generated code in their package would need to use a build system like make
, with all the worrisome implications of running arbitrary scripts at build time.
But then in 2014, we got go generate
, which promised to be an exciting and novel way to... run arbitrary scripts at build time.
To be clear, I don't think that go generate
is bad. I know I've been a harsh critic of Golang in the past, but in this case I think go generate
is perfectly fine at what it does: that is, being an incremental improvement over make
for projects that don't need the full complexity of make
. One of its stated goals was to remove some uses of make
in the Go repo, without trying to be a full replacement for it. And in places where go generate
is used, it's definitely a preferable alternative to make
. It succeeds at that goal. The ways in which it differs from make are notable.
For starters, unlike make, go generate
is deliberately very simple. It scans your project for comments that begin with //go:generate
, and then it executes the rest of the comment as a command.1
Effectively, this makes your build directives part of your source files. As the name suggests, the purpose of the directives are to generate source files, although this isn't enforced: go generate
can run any command with the environment and permissions of the Go compiler.
So while the simplest usage is to write a Go module with a main()
and a directive to run itself:
//go:generate go run . -o ./generated.og.go
package main
import (
"flag"
)
func main() {
flag.String("out", "", "output file")
flag.Parse()
var w io.Writer
file, err := os.Create(*out)
if err != nil {
fmt.Fprintf(os.Stderr, "ERROR: %v\n", err)
os.Exit(2)
}
defer file.Close()
fmt.Fprintf(file, "// Code generated by scriptgen; DO NOT EDIT.\n\n")
// Write the rest of source file here
}
It's also common to see directives that run shell scripts, or multiple directives to run the same code with different parameters:
// The go file containing this directive doesn't matter, but should be related to the functionality
// of the referenced script.
//go:generate ./regenerate.sh ./firstDirectoryToRegenerate
//go:generate ./regenerate.sh ./secondDirectoryToRegenerate
Tying build directives to relevant files also makes sense. It means that you can pass tags or regexes to the go generate
command to control what runs. But it also means that you can run go generate
within a module to just regenerate files relevant to that module. But most importantly, it means that the build rules are closer to the code that actually uses them, instead of all stored together in a makefile somewhere, making it harder for the directives to drift from the code that uses them.
It's worth noting that go generate
is completely separate from go install
or go mod
. It does not do any dependency analysis. It will not run automatically when you install your package. It will not install any dependencies, even if your directives only depend on go tools. This means it likely won't run on clients that don't have your dev dependencies installed, but that's by design: users shouldn't be running go generate
as part of the build process: developers should check the generated files into their source repo. Users running go generate
on untrusted code is a major attack vector.
go generate
in the Wild
We develop Dolt, the first SQL database with Git-inspired version control capabilities. Dolt is written entirely in Go, which makes it a good case study of how go generate
is used in the wild. And Go relies on generated source code, in many of the situations outlined above where it makes sense:
-
Dolt uses flatbuffers as a serialization format, and needs
flatc
to generate go source files for interfacing with them. We have a shell script to generate the necessary source files. This script isn't currently used in ago generate
directive, but we could add one to thedoc.go
file in that module if we wanted to automate rebuilding the generated files. -
Dolt uses GRPC in order to run commands against a remote server. The endpoints receive and send protobufs. We use code generation for accessing these endpoints and interacting with protobufs.
-
Our custom test harness runs SQL scripts on test databases and asserts that queries have specific results. A custom code generator called ScriptGen allows us to write these test scripts in a syntax much closer to bare SQL, and then generate a go source file containing structured data that's more amenable to the test harness.
-
Additionally, we have "plan tests" that assert that specific queries result in the analyzer producing specific physical plans. These tests have some churn: sometimes changes to the analyzer can slightly modify the physical plan that gets produced. We want to be able to detect these changes when they happen, review them, and potentially update the test to use the new plan. So we wrote a custom code generator called PlanGen that runs the plan tests and then modifies the test file to update any tests who's output changed. Reviewing the resulting diff allows us to quickly see which tests had churn, and if we decide the change is acceptable, we can commit the changes.
-
A large part of Dolt's query analyzer involves building an expression tree, normalizing it, and then applying transformations to it. This involves creating lots of types for the various nodes of the expression tree, and these types are just different enough that we can't efficiently express their definitions in pure Go without a lot of redundancy. So we wrote a Optgen, a tool to generate type definitions for the expression tree, based on CockroachDB's generator of the same name.
-
Finally, Dolt makes use of Stringer, a built-in Go tool for generating space-efficient string representation for enum types. The most common use case is for this is printing enum values for debugging purposes. Here's an example of a type that uses Stringer to generate its
String()
method, and here's the generated code.
In addition, Dolt depends on both gRPC-go and protobuf-go as submodules, which are both developed by Google and thus can reasonably be assumed to model good practices for both Golang in general and go generate
specifically. These modules also use go:generate
to produce test code. You can see an example of protobuf-go
using a generate directive to produce test code here, and gRPC-go
uses as directive here to run this script to generate tests.
That's All For Now
I'm curious what you think. Is go generate
a reasonable way to generate source files in go projects? Is it better to just use a more general purpose build framework? Or is wanting source code generation at all a sign that you need to rethink your program's design?
If you have really strong opinions about it, you can always join our Discord and let us know. We always loving meeting passionate developers.
- Techincally, it scans your project for lines beginning with the string
//go:generate
. It doesn't actually parse Go, which keeps the implementation simple but also means that it will match against lines in multi-line strings. Whether this is actually a problem or not is largely a theoretical debate, since it's not like people are actually writing any code where this would make a difference.↩