Harnessing our SQL engine tests to run on Dolt

June 10, 2020

9 min read

Introduction

Dolt is Git for Data, and its built-in SQL engine is an open source project we recently adopted, go-mysql-server. The engine is a general-purpose SQL execution engine that lets integrators read or write to their custom data source with SQL queries by implementing a handful of golang interfaces. Both go-mysql-server and Dolt have their own suite of SQL integration tests. This blog post is about how we generalized go-mysql-server's integration tests so that they can run on Dolt (or any other integrator's database implementation).

What's a test harness?

To get go-mysql-server's tests running on Dolt, we needed to abstract a test harness that both Dolt and go-mysql-server could implement, then rewrite the tests in terms of that harness's capabilities.

"Test harness" is a relatively general term in software engineering, and it typically refers to the framework that executes test code. From wikipedia:

In software testing, a test harness or automated test framework is a collection of software and test data configured to test a program unit by running it under varying conditions and monitoring its behavior and outputs.

The kind of harness we're talking about here is somewhat more abstract than this: it's not a framework to run tests per se, but an abstraction that allows the same set of test logic to run against many different implementations of a particular problem domain. It's easiest to explain with an example.

Here's the harness that we wrote to execute the sqllogictest suite against any SQL database implementation:

// A Harness runs the queries in sqllogictest tests on an underlying SQL engine.
type Harness interface {
	// EngineStr returns the engine identifier string, used to skip tests that aren't supported on some engines. Valid
	// values include mysql, postgresql, and mssql.  See test files for other examples.
	EngineStr() string

	// Init initializes this harness to begin executing query records, beginning with a clean state for the underlying
	// database. Called once per test file before any tests are run. Harnesses are re-used between test files for runs
	// that request multiple test files, so this method should reset all relevant state.
	Init() error

	// ExecuteStatement executes a DDL / insert / update statement on the underlying engine and returns any error. Some
	// tests expect errors. Any non-nil error satisfies a test that expects an error.
	ExecuteStatement(statement string) error

	// ExecuteQuery executes the query given and returns the results in the following format:
	// schema: a schema string for the schema of the result set, with one letter per column:
	//    I for integers
	//    R for floating points
	//    T for strings
	// results: a slice of results for the query, represented as strings, one column of each row per line, in the order
	// that the underlying engine returns them. Integer values are rendered as if by printf("%d"). Floating point values
	// are rendered as if by printf("%.3f"). NULL values are rendered as "NULL".
	// err: queries are never expected to return errors, so any error returned is counted as a failure.
	// For more information, see: https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki
	ExecuteQuery(statement string) (schema string, results []string, err error)
}

Anybody who wants to can implement this interface to use our test runner to evaluate their own database implementation against sqllogictest. We implemented a harness for Dolt in order to see how it does against this test standard (92% correct and climbing).

We applied this same principle to the existing tests of go-mysql-server: extract a harness interface and re-write the tests in terms of that interface. Then, write a harness implementation for each integration you want to test.

As a result, Dolt now has an additional 584 passing tests (and a few dozen skipped ones, more on that later).

Motivations: why write harnessed tests?

Writing tests this way is a lot of work. What's the point?

There are three main benefits of structuring your tests in this manner. Most of these arguments apply mainly to software like go-mysql-server, where we've defined a set of general interfaces and written the software in terms of those interfaces, that third parties are expected to then implement on their own data sources.

Correctness. Early in our process of integrating with go-mysql-server, we found a lot of bugs in the framework. Many of these bugs arose from a lack of tests, queries that just would not work. But many others were more subtle: they weren't triggered by the framework's own in-memory database implementation. They only manifested when we ran them on Dolt, due to subtle differences in implementation. In every case, after finding one of these bugs, we changed the in-memory database implementation and the tests to expose the bug. Running every new engine test we write against both database implementations dramatically increases the chance that we find bugs.
Verification. When you write a framework like go-mysql-server, there's usually no easy way for integrators to determine if their implementation is correct. They have to write their own set of tests, which might miss important cases that will only get caught later on in the integration process, or when a customer finds a bug. Writing harnessed tests means that your integrators have a built-in way to test that they've correctly implemented the interfaces you define.
Synergy. By writing harnessed tests, we've opened up the possibility of distributing the work of testing the framework to everyone else using the software, not just the core developer team. Testing can never prove the absence of bugs, only their presence. The more integrators that test their implementations against these harnessed tests, the better chance we have of finding bugs. Not only does every independent implementor increase the chance of finding a bug in the framework, but each of them now has a direct incentive to help improve the set of core tests. This is the true spirit of open-source software development. Writing tests that everyone can adopt and run themselves aligns everyone's incentives in the same direction.

But even if you aren't testing a framework that other people implement, writing harnessed tests can have benefits, by making tests easier to read and reason about. It provides a clean separation of pure test logic from implementation logic. Consider this chunk of code, where we test the queries on the engine under a variety of conditions. We express each test condition as a different harness implementation, then feed the harnesses into the same test method:

var numPartitionsVals = []int{
	1,
	testNumPartitions,
}
var indexBehaviors = []*indexBehaviorTestParams{
	{"none", nil, false},
	{"unmergableIndexes", unmergableIndexDriver, false},
	{"mergableIndexes", mergableIndexDriver, false},
	{"nativeIndexes", nil, true},
	{"nativeAndMergable", mergableIndexDriver, true},
}
var parallelVals = []int{
	1,
	2,
}

func TestQueries(t *testing.T) {
	for _, numPartitions := range numPartitionsVals {
		for _, indexInit := range indexBehaviors {
			for _, parallelism := range parallelVals {
				testName := fmt.Sprintf("partitions=%d,indexes=%v,parallelism=%v", numPartitions, indexInit.name, parallelism)
				harness := newMemoryHarness(testName, parallelism, numPartitions, indexInit.nativeIndexes, indexInit.driverInitializer)

				t.Run(testName, func(t *testing.T) {
					enginetest.TestQueries(t, harness)
				})
			}
		}
	}
}

Structuring the test this way makes it very easy to see the separation of these concerns.

Extracting a harness from the engine tests

Before starting this process, go-mysql-server's engine tests were a single 6,500 line file of test data and code. Like most software, it started in pretty reasonable shape, then slowly accreted features and cruft over time until its size and complexity became a major stumbling block for new developers trying to contribute. One of the major motivations for tackling this giant project was to bring some order to the existing engine tests, to make it more likely that other people would be able to understand and contribute to them.

So the first step was to break up the monolith into reasonable pieces. There's still work to do here, but we now have a pretty reasonable file-level separation of concerns for these tests, with the largest (the select test queries and expected results) clocking in at under 3,000 lines. Organizing these select tests into separate files based on theme and functionality is work for another day.

Next, we had to determine the common set of functionality that was being used for these tests, and how to represent it as an interface. Here's what we came up with:

// Harness provides a way for database integrators to validate their implementation against the standard set of queries
// used to develop and test the engine itself. See memory_engine_test.go for an example.
type Harness interface {
	// Parallelism returns how many parallel go routines to use when constructing an engine for test.
	Parallelism() int
	// NewDatabase returns a new sql.Database to use for a test.
	NewDatabase(name string) sql.Database
	// NewTable takes a database previously created by NewDatabase and returns a table created with the given schema.
	NewTable(db sql.Database, name string, schema sql.Schema) (sql.Table, error)
	// NewContext allows a harness to specify any sessions or context variables necessary for the proper functioning of
	// their engine implementation. Every harnessed engine test uses the context created by this method, with some
	// additional information (e.g. current DB) set uniformly. To replicated the behavior of tests during setup,
	// harnesses should generally dispatch to enginetest.NewContext(harness), rather than calling this method themselves.
	NewContext() *sql.Context
}

This is the base level of capability that integrators need to provide to be able to run the tests. They have to be able to provide a new instances of their Database implementation, and create new instances of their Table implementation on them. The test suite will handle creating tables for each test and inserting rows into them. A harness also needs to know how to create the right Context required for their database implementation to function -- things like creating the right kind of Session implementation that mirrors how you would instantiate an engine in production.

After defining the harness interface, we set about rewriting the test logic in terms of the harness, rather than using a particular database implementation directly. This was a grind to get right, but not actually difficult. It involved replacing lots of code like this:

tables["mytable"] = memory.NewPartitionedTable("mytable", sql.Schema{
        {Name: "i", Type: sql.Int64, Source: "mytable", PrimaryKey: true},
        {Name: "s", Type: sql.Text, Source: "mytable"},
    }, numPartitions)

insertRows(
    t, tables["mytable"],
    sql.NewRow(int64(1), "first row"),
    sql.NewRow(int64(2), "second row"),
    sql.NewRow(int64(3), "third row"),
)

With code that defers to the harness, like this:

table, err = harness.NewTable(myDb, "mytable", sql.Schema{
    {Name: "i", Type: sql.Int64, Source: "mytable", PrimaryKey: true},
    {Name: "s", Type: sql.Text, Source: "mytable", Comment: "column s"},
})

if err == nil {
    InsertRows(t, NewContext(harness), mustInsertableTable(t, table),
        sql.NewRow(int64(1), "first row"),
        sql.NewRow(int64(2), "second row"),
        sql.NewRow(int64(3), "third row"))
} else {
    t.Logf("Warning: could not create table %s: %s", "mytable", err)
}

Lots and lots of small transformations like this made the engine tests usable by any database implementation.

Partial passing

One wrinkle when developing a test suite for your integrators is that you want to give them an out if their implementation can't support the full capabilities of the framework. You don't want their tests to fail just because they only implemented the 95% of the functionality they needed. There are a couple ways to get this desirable effect.

The first way is simple: break the test suite into multiple discrete parts, so that integrators can choose which tests they run. For example, Dolt can't pass the CREATE TABLE tests because we don't support one of the types being created. So we skip them with a note as to why:

func TestCreateTable(t *testing.T) {
    t.Skipf("Skipping: no support for BLOB type")
    enginetest.TestCreateTable(t, newDoltHarness(t))
}

But this strategy doesn't work in all cases, most particularly in the massive set of queries and expected results. It's just not practical to make every query into its own skippable test method. For these cases, we introduce the SkippingHarness interface to let integrators choose which queries to skip:

// SkippingHarness provides a way for integrators to skip tests that are known to be broken. E.g., integrators that
// can't handle every possible SQL type.
type SkippingHarness interface {
    // SkipQueryTest returns whether to skip a test of the provided query string.
    SkipQueryTest(query string) bool
}

For Dolt, we skip a bunch of tests that we can't support yet. Over time we'll drive the number of skipped tests down, and it will be an important metric to how close we are to supporting the entire set of SQL functionality.

// Logic to skip unsupported queries
func (d *doltHarness) SkipQueryTest(query string) bool {
    lowerQuery := strings.ToLower(query)
    return strings.Contains(lowerQuery, "typestable") || // we don't support all the required types
        strings.Contains(lowerQuery, "show full columns") || // we set extra comment info
        lowerQuery == "show variables" || // we set extra variables
        strings.Contains(lowerQuery, "show create table") // we set extra comment info
}

We also define three extensions of the Harness interface for integrators that want to test particular optional parts of the framework, mostly related to indexes. The test framework will skip those particular tests unless the Harness is one of these extended interface types.

Finally, because the engine tests must test every type supported by the engine, we choose to be lenient in our error handling during test setup, trusting those errors to surface in the results of actual tests later on. For example, we ignore any errors when creating tables.

Conclusion

Writing harnessed tests is a lot of work, but the rewards can be very substantial. Dolt relies on go-mysql-server for its SQL functionality, and Dolt's success is very tightly coupled to the correctness and performance of the go-mysql-server project. So for us, automated tests that make sure that the engine works with Dolt are an obvious win. But the benefits don't stop there. We believe this testing strategy will ultimately make the go-mysql-server project much more successful, independent of Dolt itself.

Dolt is a great way to learn SQL, and also learn Git at the same time. Download Dolt today to see how satisfying it is to work with real data and not be scared of making a mistake.

Blog