Dolt is a SQL database with a custom storage layer designed for Git
semantics. Dolt supports fast branching, diffing, and merging, plus all
of the features MySQL provides.
One of the ways we track our compatibility with MySQL are
sqllogictests, a collection of about 6 million correctness tests.
90% correctness in 2019,
99% correctness in
have now reached 99.9% correctness.
This blog will give some examples of how the recent features
contributed to correctness.
Parsing a SQL query enforces one layer of SQL language semantics. But
queries that are valid SQL strings might still not be executable.
"Binding", or name resolution, is the process of applying specific
database catalogs, schemas, and SQL semantics to generic abstract
syntax trees (AST). A query that passes through binding should have column and
table names validated, types checked, and other mandatory semantics like
aggregation determinism verified.
We won't discuss binding a whole lot
in this blog, but most of the correctness fixes described are sourced
from improvements in our ability to resolve well-formed queries.
With that note, let's get to the queries!
Many subtle but simple joins used to fail in Dolt until recently:
create table a (b int);
create table c (d int);
select * from a join c c1 on b = d join c c2 on c1.d = c2.d;
ambiguous column name "d", it's present in all these tables: c1, c2
b = d filter resolves in MySQL because only left-deep
tables are available for definition lookups. In other words, when we
b = d, only tables
c1 have been defined so far. The
definition ambiguity is only present after we have built
c2, at which
point we have to qualify
c1.d = c2.d to avoid an error.
Dolt now supports queries that depend on implicit left to right
In-Scope Lateral References
Lateral references are select clauses with subquery expressions that
access other columns in the same SELECT clause. As with the previous
example, subqueries can access fields defined to the left:
(select a from ab where x = a) as a,
(select u from uv where u = a) as u
The "order" of resolving here is
a, and then
u. The FROM
columns are available for reference when we build
a has been defined by the time we build
Dolt first started supporting this query a year ago with some of
Dolt now supports lateral
Lateral references can often be written equivalently as lateral joins:
select x, y from xy,
lateral (select a from ab where x = a) as dt1,
lateral (select u from uv where u = a) dt2;
The query above is a rewrite of the previous example, as long as
dt2 only return 1 row.
dt still has access to
dt2 still has
x, y, a, and the output still includes all columns. If
returns 2 values, the output will double in number because
a lateral join behaves like a cross join.
LATERAL clauses make naming separation explicit, which is often more
predictable as an execution plan. We are making execution more reliable
over time by converting lateral subqueries to lateral joins.
Dolt used to panic when aggregation functions were used outside of SELECT
select sum(x+1)+1 from xy group by x having avg(x) > 1 order by 1
panic: unresolved column is a placeholder node, but Type was called
goroutine 1 [running]:
Aggs require reassembling to convert SQL clauses into well-formed query
plans. The schematic below gives an idea of how aggregate functions need
to be separated and grouped from standard expressions before being
stitched back together again:
Dolt now supports most aggregation queries. We are still working on the long-tail
of the most complicated sorts (see q11 on page 4
here for an example!).
The last feature James recently
added is USING join support.
NATURAL JOIN joins two tables with equality filters between columns of
the same name, eliminating the redundant columns of the right hand side:
select * from xy t1 natural join xy t2;
| x | y |
| 1 | 1 |
| 2 | 2 |
USING joins do the same thing, except you can select a subset of columns
for equijoining/redundancy elimination:
select * from xy t1 join xy t2 using (x,y);
| x | y |
| 1 | 1 |
| 2 | 2 |
select * from xy t1 join xy t2 using (x);
| x | y | y |
| 1 | 1 | 1 |
| 2 | 2 | 2 |
This blog marks three 9's of SQL correctness and discusses some of the
significant classes of queries we fixed to reach this goal. All of these
fixes are related to a name binding refactor, which should continue to
be a source of correctness and performance wins for Dolt moving forward.
Stay tuned for the next update, when we reach 4 9's of correctness!
If you have any questions about Dolt, databases, or Golang performance
reach out to us on Twitter,