AI SQL Testing

October 21, 2025

6 min read

We've been collaborating a lot lately with the University of California Berkeley Computer Science department. First, they published a paper about how databases need branches to support agents, which we obviously agreed with. Unfortunately, UC Berkeley missed Dolt in their research. We reached out to their team, and they are currently investigating Dolt.

Now, another group from UC Berkeley CS published a new paper about AI SQL testing. We've been working with the authors, Qiuyang Mang, GitHub user joyemang33, and Runyuan He, GitHub user momoway, fixing bugs their testing tool Argus identified for over a month. This article summarizes the paper and outlines some of the Dolt bugs their unique testing approach identified.

The Approach

I highly encourage you to read the original paper as it dives deeper than I will in this article. For a technical paper, the prose is fairly accessible.

Current state of the art in SQL logic testing is to create N equivalent queries, run them all, and make sure they all return the same results. A set of equivalent queries is called an Oracle. The example from the paper is this table setup and two SQL queries:

/* Set up */
CREATE TABLE t1(c0 BOOLEAN, c1 INT, c2 INT ...);
CREATE TABLE t2(c0 BOOLEAN, ...);
CREATE TABLE t3(c0 BOOLEAN, ...);

/* Query 1 */
SELECT t2.c0 FROM t2, t3 LEFT JOIN t1 ON [t1:BOOLEAN];

/* Query 2 */
WITH c AS SELECT * FROM t1 WHERE [t1:BOOLEAN]
SELECT t2.c0 FROM t2 CROSS JOIN t3 CROSS JOIN c
UNION ALL
SELECT t2.c0 FROM t2 CROSS JOIN t3
WHERE NOT EXIST (SELECT 1 FROM c);

The [t1:BOOLEAN] syntax means any SQL snippet that returns a Boolean value. t1 in this case is the variable name, not to be confused with the table t1. Then, as a way to increase test coverage, you make a number of equivalent substitutions using equivalent SQL snippets. These are snippets like false, 1 != 0, and 2 <> 1 for the above boolean example.

So the full test suite becomes:

/* Set up */
CREATE TABLE t1(c0 BOOLEAN, c1 INT, c2 INT ...);
CREATE TABLE t2(c0 BOOLEAN, ...);
CREATE TABLE t3(c0 BOOLEAN, ...);

/* Query Set 1 */
SELECT t2.c0 FROM t2, t3 LEFT JOIN t1 ON false;
SELECT t2.c0 FROM t2, t3 LEFT JOIN t1 ON 1 != 0;
SELECT t2.c0 FROM t2, t3 LEFT JOIN t1 ON 2 <> 1;

/* Query Set 2 */
WITH c AS SELECT * FROM t1 WHERE false
SELECT t2.c0 FROM t2 CROSS JOIN t3 CROSS JOIN c
UNION ALL
SELECT t2.c0 FROM t2 CROSS JOIN t3
WHERE NOT EXIST (SELECT 1 FROM c);
WITH c AS SELECT * FROM t1 WHERE 1 != 0
SELECT t2.c0 FROM t2 CROSS JOIN t3 CROSS JOIN c
UNION ALL
SELECT t2.c0 FROM t2 CROSS JOIN t3
WHERE NOT EXIST (SELECT 1 FROM c);
WITH c AS SELECT * FROM t1 WHERE 2 <>1
SELECT t2.c0 FROM t2 CROSS JOIN t3 CROSS JOIN c
UNION ALL
SELECT t2.c0 FROM t2 CROSS JOIN t3
WHERE NOT EXIST (SELECT 1 FROM c);

You run this SQL and assert all six test queries return the same results. If not, you have identified a logic bug.

The problem with this testing approach is that Oracles and equivalent SQL snippets are traditionally manually generated. This is time-consuming and limits the amount of test cases.

UC Berkeley's new paper proposes the use of Large Language Models (LLMs) to generate Oracle queries and equivalent SQL snippet. The paper calls this tool Argus. Argus was successful for the testing done in the paper, finding new bugs in Dolt, DuckDB, MySQL, Postgres, and TiDB. Generating an Oracle costs $3 on average, and a reusable suite of 100,000 equivalent SQL snippets costs $12 to build. Moreover, the test case throughput achieved was two decimal orders of magnitude greater than the traditional approach.

The Bugs

Now, let's focus on the bugs Argus found in Dolt. Bugs generated by the Argus testing process were created against the Dolt GitHub repository by user momoway. We have a 24-hour correctness bug fix pledge so after the first few bugs, we reached out to momoway to ask if he was running a test suite or whether these bugs were real queries blocking adoption. Given the nature of the queries, we suspected the former. momoway turned out to be Runyuan He of UC Berkeley, who confirmed indeed, he was running a test suite. Runyuan introduced us to Qiuyang Mang, the primary author of the paper.

Over the next two weeks, 19 bugs were submitted and we've closed all 19 as of today. The bugs contained concise reproduction steps. Indeed, Argus detected weaknesses in Dolt's SQL analyzer, the part of the code that handles SQL logic.

Several of these bugs exposed errors in our join logic. Let's take a look at an example.

No Results for Left and Right Joins on Empty Table Subquery

Runyuan filed this bug for the following join queries, which were returning no results when instead they both should have been returning all the rows of table t.

CREATE TABLE t(c INT);
INSERT INTO t VALUES (1);
SELECT t.c FROM t LEFT JOIN (SELECT t.c FROM t WHERE FALSE) AS subq ON TRUE;
SELECT t.c FROM (SELECT t.c FROM t WHERE FALSE) AS subq RIGHT JOIN t ON TRUE;

This bug stemmed from the analyzer converting all joins with a ON TRUE join condition into cross joins. A cross join returns all possible combinations of rows from the left and right sides of a join. So if you have an inner join, which returns all possible combinations of rows that that satisfy a join condition, and the join condition always evaluates to true, then it makes sense to go ahead and convert that inner join to a cross join. However, while this optimization makes sense for inner joins, it didn't for left joins, where the left side should still be returned regardless of a match on the right side; the same logic applies for right joins, where the right side should still be returned regardless of a match on the left. When left and right joins were converted to cross joins, it incorrectly filtered out rows where one of the sides was empty (in this case, all the rows, resulting in an empty set).

Once we identified the root of this bug, we were able to quickly fix it. Furthermore, while reviewing this bug to write this blog post, we discovered a similar bug with full outer joins. Similar to left and right joins, full outer joins return all rows of both sides regardless of matches and therefore shouldn't be converted to cross joins either. Not only did Argus find bugs for us to fix, it also prompted us to uncover more bugs, allowing us to further Dolt's correctness.

19 Bugs! Sounds Bad...

We decided to do a call with Qiuyang because we were curious about how many tests he ran to generate the 19 bugs. Here at DoltHub, we are concerned with correctness percentage. Dolt scores 100% correctness on the open sqllogictest benchmark. All we were seeing was the numerator: bugs. We also wanted to know the denominator: tests run. Did the first 100 test cases generate 19 bugs? Or was it more like 10M test cases?

Qiuyang said at first, he was able to generate a bug in 2M test cases. After we fixed the bugs, this was down to a bug in 12M test cases. We started at 99.99995% (six nines) correct and improved to 99.99999992% (nine nines) correct. Thank you Qiuyang!

We run sqllogictest on every Dolt pull request to ensure we do not regress correctness. We were interested in gathering all the tests cases Argus had generated to perform similar regression tests. Quiyang provided us a test suite of 2M passing tests. We have not integrated it into our testing process yet but it is on the to-do list.

Conclusion

Thank you to Quiyang Mang and Runyuan He for using their new AI powered SQL testing tool, Argus, to find SQL logic bugs in Dolt. Give their paper a read to learn more. Questions about Dolt's correctness testing approach? We happily field any and all questions on our Discord. Stop by.

Blog

The Approach

The Bugs

No Results for Left and Right Joins on Empty Table Subquery

19 Bugs! Sounds Bad...

Conclusion

Get started with Dolt