Generics and Value Types in Golang

April 1, 2022

9 min read

Go 1.18 has been released and along with it comes long-awaited support for Generics! Generics are the most significant change to the language in years. They add a new dimension to what is otherwise a minimalist type system. From the very beginning, Golang has supported dynamic polymorphism via Interfaces. Generics now provide static polymorphism to Golang.

Here at DoltHub, we're big fans of Golang. We used it to build Dolt, a version-controlled MySQL-compatible Database you can branch, diff, and merge. Dolt is continuously improving, with the goal of reaching parity with MySQL in both performance and correctness. For this reason, we're particularly interested in the performance of Golang's various features. Today we're going to talk about Golang's implementation of Generics and what it means for CPU and memory utilization.

Update 2022-04-02

The initial post of this blog featured benchmarks that incorrectly used the Goland micro benchmarking framework. As a result, the reported benchmark metrics were off by a few decimal places. An update version of the benchmarks has been posted on Github and an updated version of the results are included below.

Generics are Slow?

Following this major release, there has been a wave of content published about Generics in Go and how they will impact existing Go projects. Among the best analysis so far is this deep-dive written by Vincent Marti. If you haven't yet read it, I highly recommend you take a look. The primary focus of the post is how Generics are implemented by the Go compiler and how those design decisions can negatively impact performance of generic go code. The post concludes by summarizing its findings in a list of best-practices for Golang Generics. Among them is "DO NOT rewrite interface-based APIs to use Generics." I'd like to add a caveat to that recommendation, but first we need to take a look under that hood and figure out what (can) make Generics slow in the first place.

Monomorphization

In most commonly used languages, static and dynamic polymorphism have little in common from an implementation standpoint. Dynamic polymorphism is most commonly implemented using virtual method tables, or vtables for short, to dynamically resolve method calls at runtime. Static polymorphism is typically implemented entirely at compile-time by generating a new version of the polymorphic function for each type used to invoke it, a process known as monomorphization. In implementing Generics, the Golang team choose a middle path they call "Dictionaries and Gcshape Stenciling". It's a combination of static monomorphization ("stenciling") and dynamic calls via vtables ("dictionaries"). Rather than stenciling a new function for every type used to invoke the function, the compiler groups invocation types by "gcshape" and generates one copy of the function for each gcshape. Generally speaking, value types each have a unique gcshape, but reference types (pointers and interfaces) all share a single gcshape. When you pass reference types to generic functions in Golang, static polymorphism turns into dynamic function dispatch.

The consequence of this design decision is a major impact on performance. It's the reason why Vincent Marti's post concludes that Generics can make your code slow. Its analysis focuses specifically on reference types and explains in great detail how and why performance degrades when they're used in combination with Generics. However, missing from the discussion is how generics interact with value types.

Generics are Fast?

We have a sense of how generics work, and we know they don't provide any clear performance wins for reference types. So the question now is if the story is any different for value types. I've previously written about the performance of value types and how they can play better with Golang's memory model than reference types. Long story short, value types make it much easier to reduce memory allocations because escape analysis in the Go compiler almost always places reference types on the heap.

Let's make this discussion more concrete by looking at some examples. First let's create some simple data structures: an interface with two implementations, one value type and one reference type.

type Array interface {
    Get(i int) uint64
    Len() int
}

type ArrayVal struct {
    elems []uint64
}

var _ Array = ArrayVal{}

func (v ArrayVal) Get(i int) uint64 {
    return v.elems[i]
}

func (v ArrayVal) Len() int {
    return len(v.elems)
}

type ArrayRef struct {
    elems []uint64
}

var _ Array = &ArrayRef{}

func (p *ArrayRef) Get(i int) uint64 {
    return p.elems[i]
}

func (p *ArrayRef) Len() int {
    return len(p.elems)
}

Next we'll implement a basic binary search algorithm (borrowed from sort.Search), first using the Array interface for dynamic polymorphism and again using the interface a type constraint for static polymorphism.

func InterfaceSearch(target uint64, arr Array) int {
	i, j := 0, arr.Len()
	for i < j {
		h := int(uint(i+j) >> 1)
		if arr.Get(h) < target {
			i = h + 1
		} else {
			j = h
		}
	}
	return i
}

func GenericSearch[T Array](target uint64, arr T) int {
	i, j := 0, arr.Len()
	for i < j {
		h := int(uint(i+j) >> 1)
		if arr.Get(h) < target {
			i = h + 1
		} else {
			j = h
		}
	}
	return i
}

Now let's add a microbenchmark to understand the differences between the data structures and algorithms. We'll start with ArrayRef and try to confirm our suspicions about generics being slow:

func init() {
    elems = make([]uint64, 500_000)
    for i := range elems {
        elems[i] = uint64(i)
    }
}

var elems []uint64

func BenchmarkReferenceSearch(b *testing.B) {
    b.Run("interface search", func(b *testing.B) {
        for idx := 0; idx < 1_000_000; idx++ {
            arr := &ArrayRef{elems: elems}
            target := uint64(idx % arr.Len())
            guess := InterfaceSearch(target, arr)
            assert.True(b, uint64(guess) == target)
        }
    })
    b.Run("generic search", func(b *testing.B) {
        for idx := 0; idx < 1_000_000; idx++ {
            arr := &ArrayRef{elems: elems}
            target := uint64(idx % arr.Len())
            guess := GenericSearch(target, arr)
            assert.True(b, uint64(guess) == target)
        }
    })
}

 % go test -bench BenchmarkReferenceSearch
goos: darwin
goarch: amd64
pkg: github.com/dolthub/dolt/go/performance/generics
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkReferenceSearch/interface_search-12            1000000000               0.1089 ns/op
BenchmarkReferenceSearch/generic_search-12              1000000000               0.1089 ns/op
PASS
ok      github.com/dolthub/dolt/go/performance/generics 2.331s

So far this makes sense. The generic function is still using dynamic dispatch under the hood, so we don't expect any performance benefits. Now let's measure ArrayVal with the same benchmark:


func BenchmarkValueSearch(b *testing.B) {
	b.Run("interface search", func(b *testing.B) {
		for idx := 0; idx < iters; idx++ {
			arr := ArrayVal{elems: elems}
			target := uint64(idx % arr.Len())
			guess := InterfaceSearch(target, arr)
			assert.True(b, uint64(guess) == target)
		}
	})
	b.Run("generic search", func(b *testing.B) {
		for idx := 0; idx < iters; idx++ {
			arr := ArrayVal{elems: elems}
			target := uint64(idx % arr.Len())
			guess := GenericSearch(target, arr)
			assert.True(b, uint64(guess) == target)
		}
	})
}

 % go test -bench BenchmarkValueSearch
goos: darwin
goarch: amd64
pkg: github.com/dolthub/dolt/go/performance/generics
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkValueSearch/interface_search-12                1000000000               0.1130 ns/op
BenchmarkValueSearch/generic_search-12                  1000000000               0.08525 ns/op
PASS

Update 2022-04-02

The initial version of these benchmarks failed to use the benchmark iteration parameter b.N. The initial metrics reported were off by a few decimal places, but they did capture the relative performance of each of these function implementations. If you're interested, you can checkout the updated version of the benchmarks on Github. Here are the updated metrics after correcting the iteration error:

% go test -bench .
goos: darwin
goarch: amd64
pkg: github.com/dolthub/dolt/go/performance/generics
cpu: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
BenchmarkSearch/interface_search_value_type-12          10129728               109.8 ns/op
BenchmarkSearch/generic_search_value_type-12            13984368                83.19 ns/op
BenchmarkSearch/interface_search_reference_type-12      10951918               110.1 ns/op
BenchmarkSearch/generic_search_reference_type-12        11169992               105.7 ns/op
PASS

The generic function implementation is faster! So what's going on, are we getting the optimized, monomorphized codegen we've been dreaming of?! Let's look at the disassembly and find out. In order to generate the stenciled versions of GenericSearch we need a helper function to invoke it with each type. Then we can use go tool compile -S to inspect the output of the generated code.

func CallGenericSearch() {
	a1 := ArrayVal{}
	GenericSearch(2, a1)
	a2 := &ArrayRef{}
	GenericSearch(2, a2)
}

For the sake of simplicity, let's look just the output corresponding to the lineif arr.Get(h) < target { from the binary search algorithm. This is where the generic function interacts with its parameterized type by calling one of the methods defined in the type constraint.

"".GenericSearch[go.shape.struct { "".elems []uint64 }_0] STEXT dupok size=299 args=0x28 locals=0x78 funcid=0x0 align=0x0
	...
	0x0092 00146 (generic_search.go:90)	MOVQ	""..dict+128(SP), R8
	0x009a 00154 (generic_search.go:90)	PCDATA	$0, $-1
	0x009a 00154 (generic_search.go:90)	MOVQ	32(R8), R8
	0x00aa 00170 (generic_search.go:90)	MOVQ	24(R8), R8
	0x00ae 00174 (generic_search.go:90)	LEAQ	""..autotmp_10+88(SP), AX
	0x00b3 00179 (generic_search.go:90)	CALL	R8
	0x00b6 00182 (generic_search.go:90)	PCDATA	$0, $-2
	0x00b6 00182 (generic_search.go:90)	MOVQ	"".target+136(SP), CX
	0x00be 00190 (generic_search.go:90)	NOP
	0x00c0 00192 (generic_search.go:90)	CMPQ	AX, CX
    ...
"".GenericSearch[go.shape.*uint8_0] STEXT dupok size=201 args=0x18 locals=0x30 funcid=0x0 align=0x0
    ...
	0x004f 00079 (generic_search.go:90)	MOVQ	""..dict+56(SP), DX
	0x0054 00084 (generic_search.go:90)	PCDATA	$0, $-1
	0x0054 00084 (generic_search.go:90)	MOVQ	32(DX), DX
	0x0064 00100 (generic_search.go:90)	MOVQ	24(DX), DX
	0x0068 00104 (generic_search.go:90)	MOVQ	"".arr+72(SP), AX
	0x006d 00109 (generic_search.go:90)	CALL	DX
	0x006f 00111 (generic_search.go:90)	PCDATA	$0, $-2
	0x006f 00111 (generic_search.go:90)	MOVQ	"".target+64(SP), CX
	0x0074 00116 (generic_search.go:90)	CMPQ	AX, CX
	...

There are some differences in how the functions' arguments are stored in the registers, but otherwise the generated code for both stencils is identical. Both versions of the function are making a call to the array to get the next element. We might have thought that function inlining was responsible for the acceleration. After all, our value type has a unique gcshape and a unique function stenciled for it GenericSearch[go.shape.struct { "".elems []uint64 }_0]. Ostensibly the compiler should know that it will only ever dispatch calls to ArrayVal.Get(), but is doesn't seem like the Go compiler knows how to make these optimizations yet.

So why is one faster than the other? Let's use a different type of compile inspection, escape analysis to find out. If we run go build -gcflags=-m ./... we can have the compiler output information about "escape analysis", or whether it chooses to place objects on the heap vs the stack. Let's use another helper function to compare both functions and both data structures:

func CallSearch() {
	v1 := ArrayVal{}
	GenericSearch(2, v1)
	r1 := &ArrayRef{}
	GenericSearch(2, r1)
	v2 := ArrayVal{}
	InterfaceSearch(2, v2)
	r2 := &ArrayRef{}
	InterfaceSearch(2, r2)
}

./generic_search.go:27:8: &ArrayRef{} escapes to heap
./generic_search.go:30:18: v2 escapes to heap
./generic_search.go:31:8: &ArrayRef{} escapes to heap

We've found our answer. Using the generic implementation of the algorithm with a value type data structure lets us keep our data on tht stack and avoid the overhead of a memory allocation. Our reference type data structure escapes to the heap regardless of the function used. This isn't strictly necessary because the reference we create here is only passed down the stack, never up the stack. However, the compiler's escape analysis fails to leverage this fact and puts it on the heap anyway. Similarly, we pass our value type data structure to the interface implementation of the function, we implicitly convert it to a reference type, and it then escapes to the heap. Only when we pass the value type to its stenciled copy of the generic function does the compiler avoid this allocation.

Wrapping Up

Static polymorphism is often thought of as a performance oriented feature. Like strong-typing, it gives the compiler with additional information, providing an opportunity to make more aggressive optimization when generating code. This is certainly true in languages like C++ that fully monomorphize their generic functions. But like the language itself, Golang's Generics don't follow the established rules about how things are supposed to be.

Golang's Generics can actually help improve performance, but again, not like you'd expect. Writing performant Go code has often meant restricting interface types to high-level constructs for exactly the reason we saw earlier: they don't play well with escape analysis and almost always cause expensive memory allocations. This leaves programmers with only concrete types and no facilities for polymorphism. Generics may be an answer to that problem.

Generics are a brand new feature in Golang and support for them is only going to improve in time. Our discussion today was entirely focused on the current implementation, but there's nothing in the language specification that keeps the current design from changing. I, for one, am excited to have my hands on them and put them into practice. If you found this post interesting, checkout out the rest of our technical series on Golang. And if you'd like to learn more about Dolt, come join us on our Discord!

Blog

Update 2022-04-02

Generics are Slow?

Monomorphization

Generics are Fast?

Update 2022-04-02

Wrapping Up

Get started with Dolt