Common Table Expressions (WITH)

8 min read

Introduction

Dolt is Git for data. It's a SQL database that you can clone, fork, branch, merge, push and pull like a Git repository. We're committed to supporting 100% of the functionality offered by Git and 100% of the functionality offered by MySQL. Today we're excited to announce support for another piece of MySQL functionality to get us a little closer to our goal: common table expressions!

What's a common table expression?

Common table expressions (CTEs) are also known by their keyword, WITH. They're basically syntactic sugar that let you avoid copy-pasting the same subquery into a large query that uses it multiple times. So for example, if you were trying to find all combinations of cities that share a zip code and start with the letter 'A', you could write it as two subqueries joined together, like this:

select * from 
    (select name, zip from cities where lower(name) like 'a%') as a_cities1
    join (select name, zip from cities where lower(name) like 'a%') as a_cities2
    on c_cities1.zip = a_cities2.zip;

For some queries, using subqueries like this is the only feasible way to express them! For example, let's say that for some reason I only want matches from the first 100 A cities, sorted by name. I could write:

select * from 
    (select name, zip from cities where lower(name) like 'a%' order by name limit 100) as first_100_a_cities1
    join (select name, zip from cities where lower(name) like 'a%' order by name limit 100) as first_100_a_cities2
    on first_100_a_cities1.zip = first_100_a_cities2.zip;

You might think that this query is equivalent:

select cities1.name, cities2.name 
    from cities cities1 
    join cities cities2 on cities1.zip = cities2.zip
    where lower(cities1.name) like 'a%' and lower(cities2.name) like 'a%'
    order by name limit 100;

But it's not! The first query gives you cities with matching zips from among the set of the first 100 cities starting with A. The second gives you the first 100 cities starting with A that have matching zips. A subtle difference, but depending on the query it might be vital.

CTEs let you write the query so you specify the subquery exactly once, rather than repeating yourself. This is pretty important as your subqueries keep getting bigger, since you only have to edit them in one place. Here's the more complicated query written with CTEs:

with first_100_a_cities as 
    (select name, zip from cities where lower(name) like 'a%' order by name limit 100)
    select * from first_100_a_cities cities1 join first_100_a_cities cities2
    on cities1.zip = cities2.zip

That's much nicer!

Nested CTEs

You can do even more complicated things with CTEs, like define a CTE that refers to another. This is legal:

with first_100_a_cities as
    (select name, zip from cities where lower(name) like 'a%' order by name limit 100),
    first_ab_city as (select name from first_100_a_cities where name like 'ab' limit 1)
    select ...

And of course a subquery expression can define its own CTEs (that themselves can refer to outer-scoped CTEs). This is legal:

with first_100_a_cities as
    (select name, zip from cities where lower(name) like 'a%' order by name limit 100)
        select name, 
        (with first_ab_city as (select name from first_100_a_cities where name like 'ab' limit 1) 
            select name from first_ab_city)
        from first_100_a_cities join...

Implement CTEs

The simplest way to implement CTEs is (I think) by treating them as strict substitutions, kind of like the C preprocessor. But unlike the C preprocessor, we're not just going to do simple textual substitutions. We're going to substitute the entire instantiated query plan.

The first step is just parsing a CTE. They're basically the same as Subqueries, just with a different syntax to declare them. So let's declare a new With node type that contains one or more CTEs represented as subqueries, and populate it.

func ctesToWith(ctx *sql.Context, cteExprs sqlparser.TableExprs, node sql.Node) (sql.Node, error) {
	ctes := make([]*plan.CommonTableExpression, len(cteExprs))
	for i, cteExpr := range cteExprs {
		var err error
		ctes[i], err = cteExprToCte(ctx, cteExpr)
		if err != nil {
			return nil, err
		}
	}

	return plan.NewWith(node, ctes), nil
}

func cteExprToCte(ctx *sql.Context, expr sqlparser.TableExpr) (*plan.CommonTableExpression, error) {
	cte, ok := expr.(*sqlparser.CommonTableExpr)
	if !ok {
		return nil, ErrUnsupportedFeature.New(fmt.Sprintf("Unsupported type of common table expression %T", expr))
	}

	ate := cte.AliasedTableExpr
	_, ok = ate.Expr.(*sqlparser.Subquery)
	if !ok {
		return nil, ErrUnsupportedFeature.New(fmt.Sprintf("Unsupported type of common table expression %T", ate.Expr))
	}

	subquery, err := tableExprToTable(ctx, ate)
	if err != nil {
		return nil, err
	}

	columns := columnsToStrings(cte.Columns)

	return plan.NewCommonTableExpression(subquery.(*plan.SubqueryAlias), columns), nil
}

func convertSelect() {
...

	// Finally, if common table expressions were provided, wrap the top-level node in a With node to capture them
	if len(s.CommonTableExprs) > 0 {
		node, err = ctesToWith(ctx, s.CommonTableExprs, node)
		if err != nil {
			return nil, err
		}
	}
}

The idea is that if a SELECT statement defines any CTEs, we'll wrap the entire query plan in a new With node that captures those expressions.

Now we just need an analyzer rule that knows how to take a With node and do the substitution of the subquery everywhere it appears in the query. Because the With node being at the top level might interfere with other analyzer rules, we want to do this as soon as possible, before most other phases of analysis. So we add our new rule very early, even before we do things like resolve tables.

var OnceBeforeDefault = []Rule{
	{"load_stored_procedures", loadStoredProcedures},
	{"resolve_views", resolveViews},
	{"resolve_common_table_expressions", resolveCommonTableExpressions},
	{"resolve_tables", resolveTables},
	{"load_check_constraints", loadChecks},
	{"resolve_set_variables", resolveSetVariables},
	{"resolve_create_like", resolveCreateLike},
	{"resolve_subqueries", resolveSubqueries},
	{"resolve_unions", resolveUnions},
	{"resolve_describe_query", resolveDescribeQuery},
	{"check_unique_table_names", checkUniqueTableNames},
	{"validate_create_trigger", validateCreateTrigger},
	{"validate_create_procedure", validateCreateProcedure},
	{"assign_info_schema", assignInfoSchema},
}

Now for the rule itself, we just perform the substitution. The only tricky thing about this is the special case of subqueries, which means we have to do several passes.

// resolveCommonTableExpressions operates on With nodes. It replaces any matching UnresolvedTable references in the
// tree with the subqueries defined in the CTEs.
func resolveCommonTableExpressions(ctx *sql.Context, a *Analyzer, n sql.Node, scope *Scope) (sql.Node, error) {
	_, ok := n.(*plan.With)
	if !ok {
		return n, nil
	}

	return resolveCtesInNode(ctx, a, n, scope, make(map[string]sql.Node))
}

func resolveCtesInNode(ctx *sql.Context, a *Analyzer, n sql.Node, scope *Scope, ctes map[string]sql.Node) (sql.Node, error) {
	with, ok := n.(*plan.With)
	if ok {
		var err error
		n, err = stripWith(ctx, a, with, ctes)
		if err != nil {
			return nil, err
		}
	}

	// Transform in two passes: the first to catch any uses of CTEs in subquery expressions
	n, err := plan.TransformExpressionsUp(n, func(e sql.Expression) (sql.Expression, error) {
		sq, ok := e.(*plan.Subquery)
		if !ok {
			return e, nil
		}

		query, err := resolveCtesInNode(ctx, a, sq.Query, scope, ctes)
		if err != nil {
			return nil, err
		}

		return sq.WithQuery(query), nil
	})
	if err != nil {
		return nil, err
	}

	// Second pass to catch any uses of CTEs as tables, and CTEs in subqueries (caused by CTEs defined in terms of
	// other CTEs). Because we transform bottom up, CTEs that themselves contain references to other CTEs will have to
	// be resolved in multiple passes. For two CTEs, cte1 and cte2, where cte2 is defined in terms of cte1 it works like
	// this: cte2 gets replaced with the Subquery alias of its definition, which contains a reference to cte1. On the
	// second pass, that reference gets resolved to the subquery alias of cte1's definition. Then we're done.
	// We iterate until the tree stops changing, or until we hit our limit.
	var cur, prev sql.Node
	cur = n
	for i := 0; i < maxCteDepth && !nodesEqual(prev, cur); i++ {
		prev = cur
		cur, err = transformUpWithOpaque(prev, func(n sql.Node) (sql.Node, error) {
			switch n := n.(type) {
			case *plan.UnresolvedTable:
				lowerName := strings.ToLower(n.Name())
				if ctes[lowerName] != nil {
					return ctes[lowerName], nil
				}
				return n, nil
			case *plan.SubqueryAlias:
				newChild, err := resolveCtesInNode(ctx, a, n.Child, scope, ctes)
				if err != nil {
					return nil, err
				}

				return n.WithChildren(newChild)
			default:
				return n, nil
			}
		})

		if err != nil {
			return nil, err
		}
	}

	return cur, nil
}

There's a little bit of complexity hidden here around naming the columns in the schema of the CTE, but mostly this is it. Let's look at what this transformation function does to the query as represented as a tree of nodes. This example query is lifted straight out of an engine test meant to exercise the functionality above. The query:

WITH mt1 as (select i,s FROM mytable)
    SELECT mtouter.i, 
    (with mt2 as (select i,s FROM mt1) select s from mt2 where i = mtouter.i+1) 
    FROM mt1 as mtouter where mtouter.i > 1 order by 1;

The CTE named mt1 has a tree representation that looks like this:

SubqueryAlias(mt1)
    └─ Project(i, s)
        └─ UnresolvedTable(mytable)

During analysis, you can see this definition get plugged into the larger query. Any UnresolvedTable named mt1 gets replaced by the definition of the CTE. Also note that these parts of the tree are wrapped with TableAlias nodes, since every instance of a CTE must have a distinct alias in a query.

INFO: once-before/0/resolve_views:
With(mt1 AS Project(i, s)
 └─ UnresolvedTable(mytable)
)
 └─ Sort(1 ASC)
     └─ Project(mtouter.i, (With(mt2 AS Project(i, s)
         └─ UnresolvedTable(mt1)
        )
         └─ Project(s)
             └─ Filter(i = mtouter.i + 1)
                 └─ UnresolvedTable(mt2)
        ) as (with mt2 as (select i,s FROM mt1) select s from mt2 where i = mtouter.i+1))
         └─ Filter(mtouter.i > 1)
             └─ TableAlias(mtouter)
                 └─ UnresolvedTable(mt1)
INFO: once-before/0/resolve_common_table_expressions:
Sort(1 ASC)
 └─ Project(mtouter.i, (Project(s)
     └─ Filter(i = mtouter.i + 1)
         └─ SubqueryAlias(mt2)
             └─ Project(i, s)
                 └─ SubqueryAlias(mt1)
                     └─ Project(i, s)
                         └─ UnresolvedTable(mytable)
    ) as (with mt2 as (select i,s FROM mt1) select s from mt2 where i = mtouter.i+1))
     └─ Filter(mtouter.i > 1)
         └─ TableAlias(mtouter)
             └─ SubqueryAlias(mt1)
                 └─ Project(i, s)
                     └─ UnresolvedTable(mytable)

Like many extensions to the query engine, the idea is simple but the execution is riddled with minor details that are hard to get right. Tests are essential!

Always Be Testing

Conclusion

Support of common table expressions brings Dolt one step closer to our goal of being a 100% drop-in replacement for MySQL. In practice, we are already there for a surprisingly large number of use cases. But getting that last 10% of support is going to take the last 90% of the work.

Try Dolt today, or come join us on Discord to talk to us and other customers. Tell us what SQL features you want to see developed next!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt