Copying all of MySQL's dumbest decisions

7 min read

Dolt is Git for data, a SQL database that you can fork, clone, branch, and merge.

Those are the features you won't find in any other SQL database. But what about the features you can find in any other database? What if you need those to work in Dolt too? We are committed to making you happy, even when we have to go to crazy lengths to do so.

Dolt's goal is to replicate 100% of Git's features combined with 100% of MySQL's features. This gives us a compatibility matrix that looks roughly like this:

Git / MySQL compatibility matrix

Nobody ever said achieving greatness would be easy. We suffer because we care.

The Dolt drop-in replacement guarantee

We advertise Dolt as a 100% compatible, drop-in replacement for MySQL. And for most customers, this is actually true! They can take their existing MySQL-backed application, change their connection string to point to Dolt, and everything basically works. When they find that something doesn't just work, they tell us and we can usually get a fix out the same week.

We take the drop-in replacement guarantee very seriously and very literally, even when it means replicating some very odd behavior of MySQL. This is the story of keeping one such promise.

MySQL return schemas

Fire up your MySQL shell and try a few basic queries. Here are two that are equivalent. Note what MySQL names the result columns.

mysql> select current_user;
+--------------+
| current_user |
+--------------+
| zachmu@%     |
+--------------+
1 row in set (0.00 sec)

mysql> select CURRENT_USER();
+----------------+
| CURRENT_USER() |
+----------------+
| zachmu@%       |
+----------------+
1 row in set (0.00 sec)

Hrm, that's interesting. MySQL echoes the exact text, including case, of your select string back in the schema of the result set. And it's not just the shell doing this: if you try this over the wire you'll notice the same thing. And it gets weirder.

mysql> select CURRENT_user(   );
+-------------------+
| CURRENT_user(   ) |
+-------------------+
| zachmu@%          |
+-------------------+
1 row in set (0.00 sec)

mysql> select 1 + 1       + 2;
+-----------------+
| 1 + 1       + 2 |
+-----------------+
|               4 |
+-----------------+
1 row in set (0.01 sec)

OK, so that's kind of cute. It copies the spacing exactly too. But that behavior isn't part of any SQL standard or other spec and it's totally undocumented, so it couldn't possibly matter in production, right?

poor naive padme

The standard C# MySQL connector

We were working to onboard a prospective customer that used a curious pattern when accessing SQL result sets. They're a C# shop, and were using the standard C# client libraries for querying databases. Those libraries typically do something that looks like this:

string queryString =
    "SELECT OrderID, CustomerID FROM dbo.Orders;";

using (SqlConnection connection =
    new SqlConnection(connectionString))
{
    SqlCommand command =
        new SqlCommand(queryString, connection);
    connection.Open();

    SqlDataReader reader = command.ExecuteReader();

    // Call Read before accessing data.
    while (reader.Read())
    {
        var col1 = reader[0];
        var col2 = reader[1];
    }

    // Call Close when done reading.
    reader.Close();
}

So far so good. But our prospective customers were using a slight twist on the SqlReader interface. C# is a pretty interesting language: in addition to allowing you to treat the reader like an array and access columns by offset, you can also treat it like a Dictionary<string, object> and access those values by column name.

You can probably imagine where this is going. They were running code that looked something like this:

string queryString =
    "SELECT current_user";

using (SqlConnection connection =
    new SqlConnection(connectionString))
{
    SqlCommand command =
        new SqlCommand(queryString, connection);
    connection.Open();

    SqlDataReader reader = command.ExecuteReader();

    while (reader.Read())
    {
        var user = reader["current_user"];
        ...
    }

    // Call Close when done reading.
    reader.Close();
}

This worked fine running against MySQL, but it broke in Dolt. The version of Dolt we gave them to use at the time returned a result set column named CURRENT_USER() from the above query. Since select current_user is effectively an alias for SELECT CURRENT_USER(), that should work fine, right?

tfw it didn't work fine

Fixing the bug

We didn't know how many other potential customers were using this library or access pattern, and we didn't want to wait to find out. If we found one client relying on this behavior more or less by accident, there were surely others. The 100% drop-in replacement guarantee means that we dropped what we were doing to unblock the onboarding by fixing the bug.

Unfortunately this was not as easy as you might assume from the outside.

The first place this was broken was the parser, where we had a couple different problems. For one thing, it treated the names of many functions as keywords, and would disregard the case of the function expression outright. It looked something like this:

function_call_generic:
  MAX openb distinct_opt argument_expression_list_opt closeb
  {
    $$ = &FuncExpr{Name: "MAX", Distinct: $3 == DistinctStr, Exprs: $4}
  }

That "MAX" really needs to be $1 or else we stomp on the case. Easy enough to fix, just lots of places that needed fixing.

The next issue was capturing the lexer position in the input stream so that we could capture spacing of the original select expression, which the tokenizer helpfully and naturally removes for us, since it's not syntactically relevant. This involves introducing a couple new grammar rules and some lexer magic that's fiddly to get right but not very hard to understand conceptually.

select_expression_list:
  lexer_old_position select_expression lexer_old_position
    {
        if ae, ok := $2.(*AliasedExpr); ok {
            ae.StartParsePos = $1
            ae.EndParsePos = $3-1
        }
        $$ = SelectExprs{$2}
    }
| select_expression_list ',' lexer_old_position select_expression lexer_old_position
    {
        if ae, ok := $4.(*AliasedExpr); ok {
            ae.StartParsePos = $3
            ae.EndParsePos = $5-1
        }
    $$ = append($$, $4)

lexer_position:
  {
    $$ = yyPosition(yylex)
  }

lexer_old_position:
  {
    $$ = yyOldPosition(yylex)
  }

Why do we need both lexer_position and lexer_old_position? Beats us! Generated look-ahead LR parsers are kind of inscrutable, all we can tell you is that we need one token in some cases and the other in other cases. It will surely break some day but we have plenty of tests to catch it when it does.

Finally we just need a little bit of clunky code to extract the exact text of a select expression after a successful parse:

// For select statements, capture the verbatim select expressions from the original query text
func captureSelectExpressions(sql string, tokenizer *Tokenizer) {
	if s, ok := tokenizer.ParseTree.(SelectStatement); ok {
		s.walkSubtree(func(node SQLNode) (kontinue bool, err error) {
			if node, ok := node.(*AliasedExpr); ok && node.EndParsePos > node.StartParsePos {
				_, ok := node.Expr.(*ColName)
				if ok {
					// column names don't need any special handling to capture the input expression
					return false, nil
				} else {
					node.InputExpression = trimQuotes(strings.TrimLeft(sql[node.StartParsePos:node.EndParsePos], " \n\t"))
				}
			}
			return true, nil
		})
	}
}

It's a kludge, but it works and we have tests to prove it.

Next we just needed to fix up the parser layer in go-mysql-server to use these expressions. To keep things simple, we just wrap any select expressions in an alias of the original query text.

		if selectExprNeedsAlias(e, expr) {
			return expression.NewAlias(e.InputExpression, expr), nil
		}

...

func selectExprNeedsAlias(e *sqlparser.AliasedExpr, expr sql.Expression) bool {
	if len(e.InputExpression) == 0 {
		return false
	}

	// We want to avoid unnecessary wrapping of aliases, but not at the cost of blowing up parse time. So we examine
	// the expression tree to see if is likely to need an alias without first serializing the expression being
	// examined, which can be very expensive in memory.
	complex := false
	sql.Inspect(expr, func(expr sql.Expression) bool {
		switch expr.(type) {
		case *plan.Subquery, *expression.UnresolvedFunction, *expression.Case, *expression.InTuple, *plan.InSubquery:
			complex = true
			return false
		default:
			return true
		}
	})

	return complex || e.InputExpression != expr.String()
}

But nothing is ever easy: adding these aliases exposed a bunch of related bugs in how the engine handled aliases, which took over 1,000 lines of code changes to fix. I'm not mad. The best part of finding a bug is a customer not finding it first.

Then all that was left was fixing the many, many tests (Dolt has many thousands of SQL tests) that broke when their result set schemas changed.

The finished product

This took longer to get right than I would like to admit, but now you can run super wonky select queries in Dolt and get back the exact same result schema as MySQL.

# Welcome to the DoltSQL shell.
# Statements must be terminated with ';'.
# "exit" or "quit" (or Ctrl-D) to exit.
dolt> select Now(    );
+--------------------------------------+
| Now(    )                            |
+--------------------------------------+
| 2021-06-14 19:51:22.408681 -0700 PDT |
+--------------------------------------+
dolt> select concat(   "a", CONCAT    ("b",   "c" )  );
+-------------------------------------------+
| concat(   "a", CONCAT    ("b",   "c" )  ) |
+-------------------------------------------+
| abc                                       |
+-------------------------------------------+

Try Dolt today

Hopefully this little story has helped to convince you that when we say Dolt is a 100% compatible, drop-in replacement for MySQL, we really mean it. If you have a MySQL-backed application that could benefit from branch and merge, fork and clone, instant rollback, or any of the other version control features that you can't find in any other SQL database, why not give Dolt a shot?

Sound interesting? Come join us on Discord to say hi and let us know what you think!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.