Testing Golang Code with Toxiproxy

GOLANG
9 min read

Here at DoltHub, we use Golang to build Dolt DB, the world's first version-controlled, SQL relational database – it gives you the power and expressiveness of a SQL relational database, combined with all the versioning features you love from Git. Every three weeks we post about Golang in our blog, often talking about tips or tricks with Golang, new Golang features we're exploring, or interesting side projects we make, like SwissMap. In today's post, we're highlighting a pretty indispensable testing tool that we enjoy using to test various network resiliency cases that would otherwise be very difficult to test.

Toxiproxy

Toxiproxy is "a TCP proxy to simulate network and system conditions for chaos and resiliency testing". It's a stable and reliable testing tool that's been around a while. Toxiproxy really shines in helping you test your code under failure conditions such as when a network connection dies unexpectedly or when the available bandwidth or latency of a network connection changes unexpectedly. Toxiproxy makes it really easy, and honestly pretty fun, to test these cases.

Hello Toxic World!

Let's start by looking at a very simple example to take Toxiproxy for a spin. We've got a Golang client application that opens up a connection to a remote endpoint and downloads a message. Our test runs the client and the "remote" endpoint locally, but we want to test what happens when there is a high amount of latency in the connection and make sure our client can still handle that. Toxiproxy makes this scenario really easy to test.

Here's what our test setup looks like before we add in Toxiproxy:

Simple test setup for TestClientDownload

Here's the test function:

func TestClientDownload(t *testing.T) {
    startTestRemoteEndpoint(t)

    message, err := DownloadMessage()
    require.NoError(t, err)
    require.Equal(t, "Hello, this is a test response from the server!\n", message)
}

The basic concept with Toxiproxy is that you create a proxy that sits between two TCP connected ends of an application and configure your code to talk to that proxy instead of the original endpoint. By default, that proxy will pass through all the data from either direction – the "downstream" direction represents data flowing from the remote endpoint, through the proxy, and into your application, while the "upstream" direction represents data flowing from your application, through the proxy, and into the remote endpoint. Things get interesting when you start adding "toxics" to your proxy. Toxics are the chaos makers that add additional latency, or stop data from flowing, or limit the connection's bandwidth. Toxiproxy provides many different toxics, including simulating some tricky networking conditions such as TCP resets and slow connection closes.

Here's what our set up looks like when we add in Toxiproxy:

Toxiproxy setup for TestClientDownload

To write this test with Toxiproxy, the first thing we need to do is pull in Toxiproxy as a dependency:

go get github.com/Shopify/toxiproxy/v2

Next we start up the Toxiproxy service. Note that this service is different from the actual proxies that Toxiproxy spins up. The Toxiproxy service is what controls and manages all the individual TCP proxies. So, no matter how many proxies you want to set up for your test, you'll always need to start up the Toxiproxy service that controls and manages those proxies. I put all of this code into a new function, called startToxiproxyService():

func startToxiproxyService(t *testing.T) {
    // Start up the Toxiproxy service that manages and controls the proxies
    var err error
    metrics := toxiproxy.NewMetricsContainer(prometheus.NewRegistry())
    toxiproxyService := toxiproxy.NewServer(metrics, zerolog.Nop())
    go func() {
        err = toxiproxyService.Listen(fmt.Sprintf("localhost:%d", toxiproxyServicePort))
    }()
    time.Sleep(500 * time.Millisecond)
    require.NoError(t, err)
    fmt.Printf("Toxiproxy service running on port %d \n", toxiproxyServicePort)

    // Initialize the client that lets us talk to the Toxiproxy service
    toxiClient = toxiproxyclient.NewClient(fmt.Sprintf("localhost:%d", toxiproxyServicePort))
}

After starting up the Toxiproxy service, we use it to create a TCP proxy that sits between our client and the remote endpoint:

func startTestProxy(t *testing.T) {
	proxy, err := toxiClient.CreateProxy("web",
		fmt.Sprintf("localhost:%d", toxiproxyProxyPort), // downstream
		fmt.Sprintf("localhost:%d", remoteEndpointPort)) // upstream
	if err != nil {
		panic(fmt.Sprintf("unable to create toxiproxy: %v", err.Error()))
	}
	fmt.Printf("Toxiproxy proxy started on port %d \n", toxiproxyProxyPort)

	// Configure a toxic on the proxy that adds latency
	_, err = proxy.AddToxic("high latency", "latency", "downstream", 1, toxiproxyclient.Attributes{"latency": 5_000})
	require.NoError(t, err)
}

This function creates a TCP proxy that listens on port toxiproxyProxyPort (the downstream endpoint) and connects to port remoteEndpointPort (the upstream endpoint). After we create the proxy, we call proxy.AddToxic to change its behavior. In this case, we're adding a latency toxic that adds 5 seconds of latency to the downstream connection.

Here's what our new test function looks like:

func TestClientDownloadWithToxiproxy(t *testing.T) {
	startTestRemoteEndpoint(t)
	startToxiproxyService(t)
	startTestProxy(t)

	// Tell our Downloader to connect to the proxy instead of the remote endpoint
	OverridePort(toxiproxyProxyPort)

	// Try to download the message
	message, err := DownloadMessage()
	require.NoError(t, err)
	require.Equal(t, "Hello, this is a test response from the server!\n", message)
}

Notice that after we've set up the proxy, we need to call OverridePort to specify the port where our proxy is running. This is typical with Toxiproxy usage – when you insert a proxy between two network endpoints, you need to tell the client (i.e. the side creating the connection) to connect to the proxy instead of the original endpoint. After that, our test code is the same as before. When we run this code, we see that the test passes, and does indeed include 5 seconds of extra latency to run!

A Real World Example

Now that we've seen a simple example of Toxiproxy usage, let's take a look at a real world example from our codebase of using Toxiproxy...

One of the places where we use Toxiproxy is in our tests for MySQL replication. When Dolt is set up as a MySQL replica, it opens a connection to a MySQL primary server, and consumes binlog events. As part of that work, we need to ensure that Dolt will correctly handle the connection dropping and will properly reconnect to the server to continue reading replication events. It's a scenario we don't expect to happen very often, but when it does happen, it's critical that Dolt can recover properly and restart the replication event stream, without messing up any data.

Let's dig into how we test this case. You can find the full test implementation in the TestBinlogReplicationAutoReconnect function. We'll walk through that code and break it down into smaller bites as we explain what's happening.

Start the Toxiproxy Service and Create a Proxy

As in the first example, we need to get our Toxiproxy service running before we do anything else. Let's look at a couple of test helper functions that configure our test environment. The configureToxiProxy() function starts the Toxiproxy service and creates a proxy that sits between a MySQL primary server and a Dolt replica server.

func configureToxiProxy(t *testing.T) {
	toxiproxyPort := findFreePort()

	metrics := toxiproxy.NewMetricsContainer(prometheus.NewRegistry())
	toxiproxyServer := toxiproxy.NewServer(metrics, zerolog.Nop())
	go func() {
		toxiproxyServer.Listen("localhost", strconv.Itoa(toxiproxyPort))
	}()
	time.Sleep(500 * time.Millisecond)
	t.Logf("Toxiproxy control plane running on port %d", toxiproxyPort)

	toxiClient = toxiproxyclient.NewClient(fmt.Sprintf("localhost:%d", toxiproxyPort))

	proxyPort = findFreePort()
	var err error
	mysqlProxy, err = toxiClient.CreateProxy("mysql",
		fmt.Sprintf("localhost:%d", proxyPort), // downstream
		fmt.Sprintf("localhost:%d", mySqlPort)) // upstream
	if err != nil {
		panic(fmt.Sprintf("unable to create toxiproxy: %v", err.Error()))
	}
	t.Logf("Toxiproxy proxy started on port %d", proxyPort)
}

Just like in the previous example, the first thing we need to do is launch the Toxiproxy service with the toxiproxy.NewServer() function. Remember that the Toxiproxy service runs on a local port and is what we communicate with in order to start, modify, or stop any Toxiproxy proxies. We instantiate a client to talk to that service using the toxiproxyclient.NewClient() function, and then we use that client to talk to the Toxiproxy service and create a proxy that will sit between our MySQL primary server and our Dolt replica server.

The other helper function that'll we'll need shortly is called turnOnLimitDataToxic(t):

// turnOnLimitDataToxic adds a limit_data toxic to the active Toxiproxy, which prevents more than 1KB of data
// from being sent from the primary through the proxy to the replica. Callers MUST call configureToxiProxy
// before calling this function.
func turnOnLimitDataToxic(t *testing.T) {
	require.NotNil(t, mysqlProxy)
	_, err := mysqlProxy.AddToxic("limit_data", "limit_data", "downstream", 1.0, toxiproxyclient.Attributes{
		"bytes": 1_000,
	})
	require.NoError(t, err)
	t.Logf("Toxiproxy proxy with limit_data toxic (1KB) started on port %d", proxyPort)
}

This function uses the mysqlProxy instance we created in the configureToxiProxy() function, and adds the limit_data toxic to it. This toxic closes a connection after a set amount of data has been passed through it. In our case, we've configured it to kick in after 1KB of data has been sent on the downstream side of the connection. In other words, once the MySQL server has sent 1KB of data to the Dolt replica, the limit_data toxic will kick in and break the connection.

Create the Test Environment

Let's look at the first few lines of the TestBinlogReplicationAutoReconnect test function and see how we put those together to test our MySQL replication code:

// TestBinlogReplicationAutoReconnect tests that the replica's connection to the primary is correctly
// reestablished if it drops.
func TestBinlogReplicationAutoReconnect(t *testing.T) {
	defer teardown(t)
	startSqlServers(t)
	configureToxiProxy(t)
	startReplication(t, proxyPort)

This code launches the MySQL primary SQL server and the Dolt replica SQL server processes, then launches the Toxiproxy service. After that, it uses the Toxiproxy client to create a proxy that sits between the MySQL primary server and the Dolt replica server. Notice that we talk through the Toxiproxy service in order to manage the individual TCP proxy instances. Here's what our test environment looks like at this point:

Toxiproxy setup for TestBinlogReplicationAutoReconnect

Test Connection Reestablishment

Now that we've got our test environment set up with a Toxiproxy proxy inserted into the middle of the communication between the two SQL servers, we can start actually testing the behavior we want to test.

Let's go back to TestBinlogReplicationAutoReconnect and look at the meat of our test code:

	// Get the replica started up and ensure it's in sync with the primary before turning on the limit_data toxic
	testInitialReplicaStatus(t)
	primaryDatabase.MustExec("create table reconnect_test(pk int primary key, c1 varchar(255));")
	waitForReplicaToCatchUp(t)
	turnOnLimitDataToxic(t)

	// Send 1k inserts to the primary database; the LimitData toxic will kick in during this and prevent the
	// replica from receiving all the data.
	for i := 0; i < 1000; i++ {
		value := "foobarbazbashfoobarbazbashfoobarbazbashfoobarbazbashfoobarbazbash"
		primaryDatabase.MustExec(fmt.Sprintf("insert into reconnect_test values (%v, %q)", i, value))
	}

	// Remove the limit_data toxic so that a connection can be reestablished
	err := mysqlProxy.RemoveToxic("limit_data")
	require.NoError(t, err)
	t.Logf("Toxiproxy proxy limit_data toxic removed")

	// Assert that all records get written to the table
	waitForReplicaToCatchUp(t)

    // Test data consistency...
    // ...
}

To start, this code asserts that the replication connection is set up successfully (i.e. testInitialReplicaStatus(), waitForReplicaToCatchup()). It gets more interesting as we turn on the LimitData toxic with the turnOnLimitDataToxic(t) function. Before we turned on that toxic, all the data was passing over our proxy without any modification. Once the LimitData toxic is enabled, it will eventually stop the flow of data and trigger a network error for the Dolt replica. We sent 1,000 insert statements to the primary server, and while those are replicating, the LimitData toxic will kick in. After we've sent all those inserts, we remove the LimitData toxic from our proxy and let the Dolt replica reconnect and catch up to the primary server.

There was a bit of set up to configure this test environment, but overall, Toxiproxy makes it really easy to test what would otherwise be a difficult case to test. Plus, once we have this test environment set up, we can reuse it to test all sorts of different connection related issues. Knowing that this code can correctly handle connection issues is a big win, and it was a pretty fun and satisfying test to write, too!

Conclusion

Toxiproxy is a super handy testing tool that is still as useful today as it was when it was initially released almost ten years ago. Kudos to Shopify for creating it, sharing it, and maintaining it! 🙏 We use Toxiproxy in a few places already, and we're working on a few projects where we plan to use it more. For example, we're adding support for a Dolt SQL server to act as a MySQL primary and stream replication events so tools like Debezium can watch those feeds for data changes. We're also adding support for Doltgres to consume replication events from a PostgreSQL primary server. Both of those are places where we'll use Toxiproxy to ensure those rare, but inevitable network problems are handled correctly.

Have you used Toxiproxy before? What other testing tools have you found to be indispensable? Come join our Discord and let us know! Our dev team hangs out on Discord during the work day, and we love talking about Golang, databases, and tools. Come say hi!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.