Memoizing Joins

November 4, 2022

9 min read

Dolt is a relational database with Git versioning primitives. Dolt versions data where Git versions code. Most of our users are interested in transactional workloads, but we also have customers with heavier analytical queries. Joins have been the bottleneck for success here, and we have undergone two join search rewrites in response to increasing customer demand. Zach wrote the first version of join planning almost two years ago, Aaron followed up with ordering improvements and hints in 2021, and today we discuss the third iteration.

The new join search:

Doubles our join table count capacity.
Enumerates associative, commutative, right-deep, and bushy join plans to more aggressively search for an optimum.
Uses costing to globally optimize plans, as opposed to heuristics that only target pre-engineered workflows.
Adds HashJoin, SemiJoin, AntiJoin, and FullJoin operators.

We spend much of this blog talking about the data structure used to absorb the complexity, a memo, which categorizes and groups join trees based on common inputs and outputs. Grouping plans by logical properties simplifies the way we represent joins, and is a path towards making go-mysql-server’s analyzer faster and more efficient in the future.

Background

A "relational operator" is a fancy word for a simple SQL query with:

One source of input rows.
Internal logic for manipulating or filtering those rows.
An output "schema" that captures the shape of result rows.

For example, consider a table xy with columns x and y. The query SELECT * from xy where x > 0 has one input relation, xy, and we expect output rows of shape (x, y). Most queries follow this pattern, even aggregations like SELECT sum(x) from x where x > 0 that pack more logic between input and output rows.

Joins are also relational operators, but they have an arbitrary number of input sources! On top of table number we have to consider varying types of filter conditions, join types (ex: LEFT_JOIN, SEMI_JOIN), and competing physical operators (ex: NESTED_LOOP_JOIN vs HASH_JOIN). Finding the lowest "cost" join trees bottom-up sounds simple, but balloons into an O(n!) problem. In practice, we make tradeoffs to find a reasonable plan quickly.

Our latest improvement to Dolt’s join search is an expansion of our "intermediate representation" of join trees. Like a regular relational operator, we assume join orders output the same rows and columns not just for the root node, but every possible join subtree. These subtrees are divided into "expression groups", which represent a collection of equivalent plans from the outside looking in. We call the collection of expression groups a "memo".

This all sounds really complicated, and it is. The concepts are best explored through example. Without further ado, we will dive into an example to show how we use this data structure to divide join search into exploration and costing phases.

Building a memo

The first step of join planning enumerates a series of simple tree rearrangements. We consider this example query:

select * from ab
inner join uv on a = u
inner join xy on x = u;

There are three tables: ab, uv, xy; and two join "edges": a = u, x = u. The edge join types are important, but INNER_JOIN is not subject to special rearrangement rules so we will ignore that detail for now.

Lets convert the original join plan to a memo:

memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (innerJoin 1 2)
├── G4: (tableScan: xy)
└── G5: (innerJoin 3 4)

Our three tables have their own memo groups:G1, G2 and G4. Two groups represent our join edges: G3 represents [ab,uv], and G5 is a group for [ab,uv,xy]. G5 is the root of our join tree, joining all three tables and producing the final query output.

In the original plan, G3 only contains (innerJoin 1 2): an inner join between G1 and G2, or more simply, INNER_JOIN(ab, uv), Similarly, there is only one plan in G5, the inner join between G3 ([ab,uv]) and G4 (xy). Joins always default to "left deep" trees: the right-relation terminates in a tablescan, and the left-relation expands to accommodate the rest of the tree.

The magic happens when we start populating join groups with more ways of executing those relations. For example, we can switch the join order of the seed plan in G3 because INNER_JOIN(uv, ab) is valid. We note this in the memo:

memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (innerJoin 1 2) (innerJoin 2 1)
├── G4: (tableScan: xy)
└── G5: (innerJoin 3 4)

We can also add new expression groups. For example, the edge INNER_JOIN(?, ?, 'x = u') can be used to join xy and uv, even though it joined xy and [ab,uv] in the original plan. In memo-speak, the group G6: (innerJoin 4 2) is valid. Not a valid root, but a valid subtree of the join:

memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (innerJoin 2 1) (innerJoin 1 2)
├── G4: (tableScan: xy)
├── G5: (innerJoin 3 4)
└── G6: (innerJoin 4 2)

We can connect our new group G6 the tree root (the complete result set), with a join to ab (G1): (innerJoin 6 1):

memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (innerJoin 2 1) (innerJoin 1 2)
├── G4: (tableScan: xy)
├── G5: (innerJoin 3 4) (innerJoin 6 1)
└── G6: (innerJoin 4 2)

G5 now has two distinct paths for the full result set: [ab,uv]xy and [xy,uv]ab.

Exhausting the search reveals the full "core enumeration space", a maximally expanded set join groups¹, commutations², and associations³:

memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (innerJoin 2 1) (innerJoin 1 2) (innerJoin 1 2)
├── G4: (tableScan: xy)
├── G5: (innerJoin 4 3) (innerJoin 3 4) (innerJoin 6 1) (innerJoin 1 6) (innerJoin 7 2) (innerJoin 2 7)
├── G6: (innerJoin 4 2) (innerJoin 2 4)
└── G7: (innerJoin 4 1) (innerJoin 1 4)

We simplified and omitted many details for this example⁴. Refer to the paper here for a more thorough specification of the core search space.

Exploring Physical Operators

Join exploration continues into opportunistic physical (execution) operators for each plan.

Nested loop joins

NESTED_LOOP_JOIN is the default execution operator. Small table joins are best executed as two "for loops" comparing all rows between each relation. The downside is an O(n^2) runtime; every row in the left relation scans the entire right relation. But who cares about n-squared if the tables have two rows? The overhead of constructing a better execution strategy will overwhelm the benefit.

Lookup (indexed) joins

A LOOKUP_JOIN is the preferred operator for queries that return a small set of rows using a table index.

We apply LOOKUP_JOINs when 1) the right-side of a join is a tablescan, and 2) the join condition aligns with an index on that right-hand relation. In simpler words, a lookup join is best when there is an index to instantly satisfy the join filter.

For example, G3 provides LOOKUP_JOIN alternatives for every plan from the core search space. G3 joins ab and uv, so both commutations have a tablescan on the right-relation. And both ab and uv have indexes aligning with the join condition, a = u ((a) and (u), respectively).

memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (lookupJoin 1 2) (lookupJoin 2 1) (innerJoin 2 1) (innerJoin 1 2)
├── G4: (tableScan: xy)
├── G5: (lookupJoin 3 4) (lookupJoin 7 1) (lookupJoin 6 2) (innerJoin 4 3) (innerJoin 3 4) (innerJoin 6 2) (innerJoin 2 6) (innerJoin 7 1) (innerJoin 1 7)
├── G6: (lookupJoin 1 4) (lookupJoin 4 1) (innerJoin 4 1) (innerJoin 1 4)
└── G7: (lookupJoin 2 4) (lookupJoin 4 2) (innerJoin 4 2) (innerJoin 2 4)

Hash joins

We use hash joins when one relation is small enough to materialize in memory, and we can use the materialized index for a lookup join. The main difference between LOOKUP_JOIN and HASH_JOIN is we have to build the in memory index for every query. HASH_JOIN also hashes nested join trees.

With a HASH_JOIN we only read each relation from disk once. HASH_JOIN is asymptotically faster than NESTED_LOOP_JOIN, but the overhead of building the hash map can offset that benefit. For example, a HASH_JOIN will be strictly more expensive than a NESTED_LOOP_JOIN if the left-relation only has one row. In that scenario, the overhead of hashing never leads to a runtime benefit.

HASH_LOOKUPS work for arbitrary subtrees, so every default NESTED_LOOP join in our example has a corresponding HASH_LOOKUP⁵:

memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (hashJoin 1 2) (hashJoin 1 2) (hashJoin 2 1) (lookupJoin 1 2) (lookupJoin 2 1) (innerJoin 2 1) (innerJoin 1 2)
├── G4: (tableScan: xy)
├── G5: (hashJoin 3 4) (hashJoin 1 7) (hashJoin 7 1) (hashJoin 2 6) (hashJoin 6 2) (hashJoin 3 4) (hashJoin 4 3) (lookupJoin 3 4) (lookupJoin 7 1) (lookupJoin 6 2)  (innerJoin 4 3) (innerJoin 3 4) (innerJoin 6 2) (innerJoin 2 6) (innerJoin 7 1) (innerJoin 1 7)
├── G6: (hashJoin 1 4) (hashJoin 4 1) (lookupJoin 1 4) (lookupJoin 4 1) (innerJoin 4 1) (innerJoin 1 4)
└── G7: (hashJoin 2 4) (hashJoin 4 2) (lookupJoin 2 4) (lookupJoin 4 2) (innerJoin 4 2) (innerJoin 2 4)

Costing

Now we have a full "forest" of join plans rooted at G5. Every plan in the root group produces the same result set, but we can only choose one to execute.

We recursively optimize the root group from tablescan leaves upwards. Consider optimizing G3, which will be invoked after trying to cost (hashJoin 3 4) in the root group. We must choose between INNER_NESTED_LOOP join, HASH_JOIN, and LOOKUP_JOIN. We compare the three physical operators between ab(rows=n) and uv(rows=m) as follows:

NESTED_LOOP_JOIN: n x m, a for loop for each table reads rows from disk for comparison.
HASH_JOIN: n + m, we read m once, load the rows into a hash table, and probe while scanning n
LOOKUP_JOIN (unique): n, scan n, and probe on-disk index to find match

There is no strict hierarchy that enforces LOOKUP_JOIN < HASH_JOIN < NESTED_LOOP_JOIN, unfortunately. The cardinality of tables, presence of filter conditions, and the lookup index all affect cost. But our rough estimates steer us away from more expensive plans.

In this case, LOOKUP_JOIN is both available and preferred for both edges. Selecting which order now comes down to table size. The cost of each lookup will be the size of the left table, so we want to make every join's left as small as possible. ab < uv <, so the cheapest plan ends up being (lookupJoin 3 4) with a cost size(ab):

memo:
├── G1: (tableScan: ab)
├── G2: (tableScan: uv)
├── G3: (lookupJoin 1 2)
├── G4: (tableScan: xy)
└── G5: (lookupJoin 3 4)

The final plan gets converted into an equivalent execution tree:

LookupJoin((xy.x = uv.u) AND (ab.a = xy.x))
 ├─ LookupJoin(ab.a = uv.u)
 │   ├─ Exchange
 │   │   └─ Table(ab)
 │   │       └─ columns: [a b]
 │   └─ IndexedTableAccess(uv)
 │       ├─ index: [uv.u]
 │       └─ columns: [u v]
 └─ IndexedTableAccess(xy)
     ├─ index: [xy.x]
     └─ columns: [x y]

Summary

In this blog, we tracked join search populating a memo group with alternative plans. After exhausting order transformations and execution operators, groups are "costed" to recursively estimate the lowest cost plan.

We followed a simple example to focus on high-level concepts, but the reordering rules give interesting emergent properties for more complicated queries. Deep joins, subqueries, non-INNER_JOIN operators, and compound filters introduce edge cases that effect plan generation and optimal join selection.

There is still a lot of improvements left to add, but we encourage you to give it a try and let us know what you find! Reach out to us on Discord, GitHub, or Twitter if you would like to chat!

For example, an expression group [ABCD] might include: [AB]x[CD], Cx[ABD], and [ABC]xD. Sub-definitions in brackets, like [AB], are themselves memo groups that can encapsulate multiple join orders (AxB and BxA). Enumerating plans for [ABCD]’s parents higher in the tree defers worrying about [ABCD]’s subtrees. A join operator, like INNER_JOIN(A=E), can connect every pair of subtrees as long as A and E are not on the same side. So this filter can join [ABCD]xE, but it can also join [ABC]x[DE] assuming there was a join condition to produce [ABC and [DE].↩
A commutative pair reverses table order. The commutative pair of select * from xy join uv on x = u is select * from uv join xy on x = u↩
An associative pair reverses edge order. The associative pair of select * from xy join (select * from uv join ab on a = u) on x = u is select * from (select * from xy join uv on x = u) join ab on a = u.↩
Two other transformations reorder edges: left and right-asscom, which apply commutativity plus associativity. Different edge types also have limited valid transformations. LEFT_JOIN, for example, can not commute its relations.↩
Subquery join leaves that reference columns from an outer scope cannot be cached, and therefore cannot be used as the hashed table in a HASH_LOOKUP.↩

Blog