Using Dolt to Find Test Regressions

6 min read

Dolt is Git for data. It's a database that lets you clone, fork, branch, merge and diff. This is a really cool technology that has a lot of uses, but today we're going to focus on just one: using Dolt SQL to find regressions in test results.

Dolt SQL performance repository

A few months ago, we blogged about getting Dolt up to one nine of SQL correctness, and how we publish the results of our sqllogictest performance in a public Dolt repository. Since then we've continued to refine our use of sqllogictest, including running a nightly job that automatically compares the result from the current tip of master with the latest release, and lets us know if any tests that used to pass broke. Let's dive into how we achieved this result.

sqllogictest refresher

sqllogictest is a set of over 6 million SQL queries and their expected results. It was originally developed by sqlite for testing sqlite, but we found that their test runner was badly out of date and didn't have the functionality we needed. So we forked the project, wrote a golang test runner, and updated the test results with MySQL 8's behavior. We wrote an extensible test harness that makes it easy for you to run the tests on your own database implementation, so check it out!

sqllogictests are defined in giant text files, with thousands of entries that look like this:

statement ok
INSERT INTO tab4 SELECT * FROM tab0

query I rowsort label-0
SELECT pk FROM tab0 WHERE col3 < 6 OR col3 < 1 AND col3 BETWEEN 2 AND 7
----
1
2
3

To track how Dolt performs against these tests, we record results into the following schema:

dolt_sql_performance> desc nightly_dolt_results;
+---------------+--------------+------+-----+---------+-------+
| Field         | Type         | Null | Key | Default | Extra |
+---------------+--------------+------+-----+---------+-------+
| test_file     | LONGTEXT     | NO   | PRI |         |       |
| line_num      | BIGINT       | NO   | PRI |         |       |
| duration      | BIGINT       | NO   |     |         |       |
| query_string  | LONGTEXT     | YES  |     |         |       |
| result        | LONGTEXT     | NO   |     |         |       |
| error_message | LONGTEXT     | YES  |     |         |       |
| version       | VARCHAR(255) | NO   |     |         |       |
+---------------+--------------+------+-----+---------+-------+

As you can see, the results table is keyed by the name of the test file and line of the test query, and the rest of the columns record the results of that test statement.

Here's a sample row for a passing test:

dolt_sql_performance> select * from nightly_dolt_results where result = 'ok' limit 1;
+-------------------+----------+----------+-----------------+--------+---------------+------------------------------------------+
| test_file         | line_num | duration | query_string    | result | error_message | version                                  |
+-------------------+----------+----------+-----------------+--------+---------------+------------------------------------------+
| evidence/in1.test | 30       | 0        | SELECT 1 IN (2) | ok     | <NULL>        | ab45d7febf3c27c975e9e7c8a78a83d6fd827756 |
+-------------------+----------+----------+-----------------+--------+---------------+------------------------------------------+

And here's a sample row for a failing test:

dolt_sql_performance> select * from nightly_dolt_results where result = 'not ok' limit 1;
+-------------------+----------+----------+-----------------------------------+--------+------------------------------------------------------+------------------------------------------+
| test_file         | line_num | duration | query_string                      | result | error_message                                        | version                                  |
+-------------------+----------+----------+-----------------------------------+--------+------------------------------------------------------+------------------------------------------+
| evidence/in1.test | 99       | 0        | SELECT null IN (SELECT * FROM t1) | not ok | Incorrect result at position 0. Expected 0, got NULL | ab45d7febf3c27c975e9e7c8a78a83d6fd827756 |
+-------------------+----------+----------+-----------------------------------+--------+------------------------------------------------------+------------------------------------------+

Analyzing regressions in the nightly release

Storing test results in a queryable form is powerful, because you can write arbitrary queries to analyze what's changed. We recently had a correctness regression and needed to figure out what had gone wrong. So we just cloned the repository and ran a query:

dolt_sql_performance> select nightly.query_string, nightly.error_message 
                   -> from nightly_dolt_results nightly join releases_dolt_results release 
                   -> on nightly.test_file = release.test_file and nightly.line_num = releases.line_num 
                   -> where nightly.result = 'not ok' and release.result='ok' limit 1;
+-------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+
| query_string                                                                                                                  | error_message                         |
+-------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+
| SELECT * FROM tab0 WHERE - + 63 IN ( col3, - col3 * col1 + + ( + + col4 ), 67 * col3 + - + CAST( - col1 AS DECIMAL ) + col0 ) | Unexpected error invalid type: string |
+-------------------------------------------------------------------------------------------------------------------------------+---------------------------------------+

This is suggestive of what the error is. But I know there were over a thousand tests that started failing from this regression, and I'd like to examine all of them. So to analyze their various causes, I'll rerun the query, but group by error message:

dolt_sql_performance> select nightly.error_message, count(*) 
                   -> from nightly_dolt_results nightly join releases_dolt_results release 
                   -> on nightly.test_file = release.test_file and nightly.line_num = release.line_num 
                   -> where nightly.result = 'not ok' and release.result='ok' 
                   -> group by 1 order by 2 desc limit 10;
+-----------------------------------------------------------------+----------+
| error_message                                                   | COUNT(*) |
+-----------------------------------------------------------------+----------+
| Unexpected error invalid type: string                           | 1878     |
| Incorrect result at position 0. Expected 1, got 0               | 6        |
| Incorrect result at position 0. Expected 3, got 0               | 6        |
| Incorrect result at position 0. Expected -1, got 0              | 4        |
| Incorrect result at position 0. Expected -3, got 0              | 3        |
| Incorrect result at position 0. Expected -61, got -62           | 1        |
| Incorrect result at position 0. Expected -8, got -9             | 1        |
| Incorrect result at position 0. Expected 150, got 3             | 1        |
| Incorrect result at position 1. Expected 3, got 0               | 1        |
| Incorrect result at position 0. Expected -468.000, got -0.000   | 1        |
+-----------------------------------------------------------------+----------+

With this query, it's easy to see that nearly all the regressed tests share a single cause. Digging a little deeper reveals that all the newly failing test queries involve the DECIMAL type, which we added better support for in our last release. So with just two queries, we have achieved high confidence of the root cause of the regression.

Verifying development results with Dolt diff

Storing test results in a queryable form is cool enough on its own. But Dolt's unique capabilities as a diffable database with versions makes this pattern especially powerful.

Earlier this week I made some pretty extensive changes to our MySQL engine, and I wanted to know if I broke anything. So I re-ran the logictest suite against my local build, and exported the results into a JSON file. Then I updated my local clone of the performance repository with the new results:

% dolt table import -u releases_dolt_results ./dolt.json

This command updates the table named with data from the file given, updating my working set:

% dolt status
On branch regressions
Changes not staged for commit:
(use "dolt add <table>" to update what will be committed)
(use "dolt checkout <table>" to discard changes in working directory)
    modified:       releases_dolt_results

To see what's changed, I can query the dolt_diff system table for the updated table like so:

dolt_sql_performance> select to_test_file, to_line_num, to_query_string, to_error_message 
                   -> from dolt_diff_releases_dolt_results 
                   -> where from_result = 'ok' && to_result = 'not ok';

This will give me all updated rows where the previous commit had an ok result and the working set has a not ok result. (There weren't any).

Conclusion

We hope you'll agree that using Dolt to store and analyze test results is a powerful tool for any software project. It lets you analyze results and regressions just by writing plain SQL queries: no need to build a dashboard or write custom log scraping logic. Download Dolt today to try it out yourself!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt