Three 9's of Correctness

SQL
3 min read

Dolt is a SQL database with a custom storage layer designed for Git semantics. Dolt supports fast branching, diffing, and merging, plus all of the features MySQL provides.

One of the ways we track our compatibility with MySQL are sqllogictests, a collection of about 6 million correctness tests. We reached 90% correctness in 2019, 99% correctness in 2021, and have now reached 99.9% correctness.

This blog will give some examples of how the recent features contributed to correctness.

Background

Parsing a SQL query enforces one layer of SQL language semantics. But queries that are valid SQL strings might still not be executable.

"Binding", or name resolution, is the process of applying specific database catalogs, schemas, and SQL semantics to generic abstract syntax trees (AST). A query that passes through binding should have column and table names validated, types checked, and other mandatory semantics like aggregation determinism verified.

We won't discuss binding a whole lot in this blog, but most of the correctness fixes described are sourced from improvements in our ability to resolve well-formed queries. With that note, let's get to the queries!

Join

Many subtle but simple joins used to fail in Dolt until recently:

create table a (b int);
create table c (d int);

select * from a join c c1 on b = d join c c2 on c1.d = c2.d;
ambiguous column name "d", it's present in all these tables: c1, c2

The b = d filter resolves in MySQL because only left-deep tables are available for definition lookups. In other words, when we built b = d, only tables a and c1 have been defined so far. The definition ambiguity is only present after we have built c2, at which point we have to qualify c1.d = c2.d to avoid an error.

Dolt now supports queries that depend on implicit left to right naming conventions.

In-Scope Lateral References

Lateral references are select clauses with subquery expressions that access other columns in the same SELECT clause. As with the previous example, subqueries can access fields defined to the left:

select x,y,
  (select a from ab where x = a) as a,
  (select u from uv where u = a) as u
from xy;

The "order" of resolving here is xy, a, and then u. The FROM columns are available for reference when we build a. Likewise, a has been defined by the time we build u.

Dolt first started supporting this query a year ago with some of Jason's work.

Lateral joins

Dolt now supports lateral joins. Lateral references can often be written equivalently as lateral joins:

select x, y from xy,
lateral (select a from ab where x = a) as dt1,
lateral (select u from uv where u = a) dt2;

The query above is a rewrite of the previous example, as long as dt1 and dt2 only return 1 row. dt still has access to x,y, dt2 still has access to x, y, a, and the output still includes all columns. If dt1 returns 2 values, the output will double in number because a lateral join behaves like a cross join.

LATERAL clauses make naming separation explicit, which is often more predictable as an execution plan. We are making execution more reliable over time by converting lateral subqueries to lateral joins.

Aggregation

Dolt used to panic when aggregation functions were used outside of SELECT clauses:

select sum(x+1)+1 from xy group by x having avg(x) > 1 order by 1
panic: unresolved column is a placeholder node, but Type was called

goroutine 1 [running]:
github.com/dolthub/go-mysql-server/sql/expression.(*UnresolvedColumn).Type(...)
	/Users/max-hoffman/go/pkg/mod/github.com/dolthub/go-mysql-server@v0.16.1-0.20230815232241-704a2c0709d3/sql/expression/unresolved.go:66
...

Aggs require reassembling to convert SQL clauses into well-formed query plans. The schematic below gives an idea of how aggregate functions need to be separated and grouped from standard expressions before being stitched back together again:

agg-binding

Dolt now supports most aggregation queries. We are still working on the long-tail of the most complicated sorts (see q11 on page 4 here for an example!).

Using

The last feature James recently added is USING join support.

NATURAL JOIN joins two tables with equality filters between columns of the same name, eliminating the redundant columns of the right hand side:

select * from xy t1 natural join xy t2;
+---+---+
| x | y |
+---+---+
| 1 | 1 |
| 2 | 2 |
+---+---+

USING joins do the same thing, except you can select a subset of columns for equijoining/redundancy elimination:

select * from xy t1 join xy t2 using (x,y);
+---+---+
| x | y |
+---+---+
| 1 | 1 |
| 2 | 2 |
+---+---+

select * from xy t1 join xy t2 using (x);
+---+---+---+
| x | y | y |
+---+---+---+
| 1 | 1 | 1 |
| 2 | 2 | 2 |
+---+---+---+

Summary

This blog marks three 9's of SQL correctness and discusses some of the significant classes of queries we fixed to reach this goal. All of these fixes are related to a name binding refactor, which should continue to be a source of correctness and performance wins for Dolt moving forward. Stay tuned for the next update, when we reach 4 9's of correctness!

If you have any questions about Dolt, databases, or Golang performance reach out to us on Twitter, Discord, and GitHub!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.