Error handling with errgroups

October 29, 2021

6 min read

Here at DoltHub, we write a lot of software in Go. We've previously blogged a little bit about error handling and errgroup in particular, but today I wanted to share two patterns regarding errgroups that came up recently in some code changes.

Be Careful with Nested Errgroups

We love errgroups so much that sometimes we nest them! We recently came across the following pattern in our code:

type MapBuilder interface {
	// Call after all entries have been added with |AddEntry|. Returns a
	// non-nil error if any error occurs.
	Wait() (Map, error)

	// Add an entry to the Map being built. Thread-safe.
	AddEntry(MapEntry) error
}

func NewMapBuilder(ctx context.Context) MapBuilder {
	eg, ctx := errgroup.WithContext(ctx)
	ch := make(chan MapEntry)
	mb := &AsyncMapBuilder{eg, ch}
	eg.Go(func() error {
		for {
			select {
			case e, ok := <-ch:
				if !ok {
					return mb.finalizeMap()
				}
				mb.addEntry(e)
			case <-ctx.Done():
				return ctx.Err()
			}
		}
	})
	return mb
	...
}

func (b *AsyncMapBuilder) Wait() (Map, error) {
	close(b.ch)
	err := b.eg.Wait()
	if err != nil {
		return nil, err
	}
	return b.map, nil
}

Internally, the MapBuilder was using an *errgroup.Group to coordinate background goroutines responsible for I/O work, in-memory sorting and buffering, etc.

MapBuilder was being used by a higher level process which was implementing a core piece of the dolt merge logic. That process, in turn, was using an *errgroup.Group:

func Merge(ctx context.Context, base, left, right Map) (value Map, conflicts Map, err error) {
	eg, ctx := errgroup.WithContext(ctx)

	leftDiff, rightDiff := make(chan DiffEntry), make(chan DiffEntry)
	eg.Go(func() error {
		defer close(leftDiff)
		// Sends diff entries to leftDiff
		return diff.DiffMaps(ctx, base, left, leftDiff)
	})
	eg.Go(func() error {
		defer close(rightDiff)
		// Sends diff entries to rightDiff
		return diff.DiffMaps(ctx, base, right, rightDiff)
	})

	conflictsBuilder := NewMapBuilder(ctx)

	eg.Go(func() error {
		var err error
		// Builds result map by applying diffs of left and right to base.
		// Returns when |leftDiff| and |rightDiff| are closed.
		// Sends conflicts to conflictsEntries
		value, err = buildMergeMapFromDiffs(ctx, base, leftDiff, rightDiff, conflictsBuilder)
		return err
	})

	err = eg.Wait()
	if err != nil {
		return
	}

	conflicts, err = conflictsBuilder.Wait()
	return
}

When I first came across a failing test associated with this code, it didn't appear glaringly wrong. But there's definitely a problem.

The code as presented has a race condition because it calls Wait() on eg before it calls Wait() on conflictsBuilder. When we Wait() on the errgroup, it will cancel its context even when every spawned go routine has returned nil. That means the subcontext that conflictsBuilder is using also gets canceled.

We need to always performs Waits on sub-work within goroutines that belong to the errgroup itself. That way the context will not get canceled until all the work associated with the subcontexts have had an opportunity to complete successfully. A simple change is just:

	eg.Go(func() (err error) {
		// Builds result map by applying diffs of left and right to base. 
		// Sends conflicts to conflictsEntries
		value, err = buildMergeMapFromDiffs(base, leftDiff, rightDiff, conflictsEntries)
+		if err == nil {
+			conflicts, err = conflictsBuilder.Wait()
+		}
		return
	})

	err = eg.Wait()
	if err != nil {
		return
	}

-	conflicts, err = conflictsBuilder.Wait()
	return
}

A good rule of thumb for an errgroup confined to a lexical context is: anything that can error within that context should return its error from within an errgroup go routine. That way there is one single place for an error to be signaled and propagated.

Canceling an errgroup

Let's say you have an iterator interface for a SQL engine:

type RowIter interface {
	Next() (Row, error)
	Close() error
}

And you want to build an asynchronous buffering implementation that will allow for some read ahead. We can do it just with errgroups with something like:

type AsyncReadAheadRowIter struct {
	iter   RowIter
	ch     <-Row
	eg     *errgroup.Group
	cancel func()
}

func (i *AsyncReadAheadRowIter) Next() (Row, error) {
	r, ok := <-i.ch
	if !ok {
		return nil, i.eg.Wait()
	}
	return r, nil
}

func (i *AsyncReadAheadRowIter) Close() error {
	i.cancel()
	// We do not return the errgroup error here.
	// Read-ahead is only calling Next(), and if it encountered an error,
	// it should be returned at the correct place in calls to Next().
	i.eg.Wait()
	return i.iter.Close()
}

func NewAsyncReadAheadRowIter(ctx context.Context, iter RowIter) *AsyncReadAheadRowIter {
	ctx, cancel := context.WithCancel(ctx)
	eg, ctx := errgroup.WithContext(ctx)
	ch := make(chan Row, ReadAheadBufferSize)
	eg.Go(func() error {
		defer close(ch)
		for {
			r, err := iter.Next()
			if err != nil {
				return err
			}
			select {
			case ch <- r:
			case <-ctx.Done():
				return ctx.Err()
			}
		}
	})
	return &AsyncReadAheadRowIter{iter, ch, eg, cancel}
}

Special care has been taken to cancel the read ahead context from within Close() so that we don't leak the goroutine attempting to write on the results channel.

If we're familiar with the errgroup implementation, we can see that we've created two subcontexts with two cancel functions one after the other in the implementation of NewAsyncReadAheadRowIter. Having a separate context for cancellation vs the errgroup context itself can be useful in some cases, but in this case it's unnecessary. We can achieve the same result by using the errgroup itself:

type AsyncReadAheadRowIter struct {
	iter   RowIter
	ch     <-Row
	eg     *errgroup.Group
-	cancel func()
}

func (i *AsyncReadAheadRowIter) Close() error {
-	i.cancel()
+	eg.Go(func() error {
+		return context.Canceled
+	})
	// We do not return the errgroup error here. 
	// Read-ahead is only calling Next(), and if it encountered an error,
	// it should be returned at the correct place in calls to Next().
	i.eg.Wait()
	return i.iter.Close()
}

func NewAsyncReadAheadRowIter(ctx context.Context, iter RowIter) *AsyncReadAheadRowIter {
-	ctx, cancel := context.WithCancel(ctx)
	eg, ctx := errgroup.WithContext(ctx)
	ch := make(chan Row, ReadAheadBufferSize)
	eg.Go(func() error {
		defer close(ch)
		for {
			r, err := iter.Next()
			if err != nil {
				return err
			}
			select {
			case ch <- r:
			case <-ctx.Done():
				return ctx.Err()
			}
		}
	})
-	return &AsyncReadAheadRowIter{iter, ch, eg, cancel}
+	return &AsyncReadAheadRowIter{iter, ch, eg}
}

In this example, this may very well not be right tradeoff — our Close() function took on a fair amount of overhead to avoid carrying around the cancel func. But the pattern of injecting an error returning function into an errgroup to achieve cancellation is a tool that can be useful sometimes.

One particular case it has come up in our code is when you have an errgroup that the code needs to be able to cancel explicitly. On the surface, just using context.WithCancel seems to achieve that, but it can be quite subtle to structure the code and all dependent goroutines in such a way that they can distinguish between the cancellation context that is actually of interest being canceled, vs. a parent of that context being canceled.

If we are using errgroups, and we actually want fast finalization and the ability to distinguish between an explicit shutdown at this layer of the software, vs. a canceled context or a deadline exceeded from above us, we can explicitly inject an error of a distinguishable value into our errgroup:

var ErrShutdownRequested = errors.New("shutdown requested")

func Shutdown(eg *errgroup.Group) {
	eg.Go(func() error {
		return ErrShutdownRequested
	})
}

func WaitIgnoringShutdown(eg *errgroup.Group) error {
	err := eg.Wait()
	if err == ErrShutdownRequested {
		return nil
	}
	return err
}

The end result is that if the ErrShutdownRequested error is the first error the errgroup sees, that's the error the errgroup returns. And the errgroup gets to see that error before it cancels its associated context, so there is no racing on something like checking a parent context's Err() and then canceling a child context. On the balance, we find this to be a reasonable way to achieve quick shutdown and distinguish between various error cases at certain places in our software.

Conclusion

We looked at two handling patterns involving errgroups in Go. One is a caveat to be on the lookout for, where asynchronous work captures or takes a dependency on the errgroup context but the result of that work is not actually finalized within the errgroup itself. Such cases should be avoided because they are inherently racey. The second was a simple pattern for achieving recognizable shutdown of an errgroup and its Context.

We hope that seeing these patterns in the example contexts above helps you recognize them and use them in the future in your go code.

Questions/comments on how to best achieve robust error handling in your go code? Join us on Discord.

Blog

Be Careful with Nested Errgroups

Canceling an errgroup

Conclusion

Get started with Dolt