Replicating Unexpected Behavior

REFERENCEGOLANG
7 min read

I'm an engineer at DoltHub, and for the past month, I've been working on matching Dolt's regular expression capabilities with MySQL. This is important for us, as Dolt is a drop-in replacement for MySQL, that gives you all the versioning features that you're used to from Git. Toward the end of my last blog post, I mentioned that regular expressions in MySQL don't take collations into consideration at all.

MySQL uses ICU to handle regular expressions, so one of our motivations for choosing to implement ICU in Dolt was so that we'd match MySQL. After all, a drop-in replacement should behave identically when given the same queries, and a regular expression may return different results if the underlying engines are different. This is worse than an error, as the only way to catch it is to compare the output against MySQL, which few users would do. Thankfully, this has been fixed, but the journey there was not easy.

Getting Collation Support

I'd recommend reading my last blog, as I detail our journey of adding ICU, a C/C++/Java library (icu4c & icu4j), to Go. One thing I didn't mention in that post, is that there was another alternative for adding proper regex support, and that was to write our own regex engine. At the time, this seemed attractive, and I actually started writing a regex engine at first, as the goal was to allow regular expressions to properly handle collations. Go's built-in regex engine does not have any support for collations, and it uses RE2 syntax, not to forget that it has different capabilities than ICU, so it was a non-starter. ICU seemed like it would be a challenge to implement, so writing an engine seemed like the best path forward. It wasn't until I discovered that ICU had collation support that I decided to attempt to integrate it over writing our own.

Queue the greater-than-a-month-long journey of trying to integrate ICU into go-mysql-server (GMS), which is the library that implements MySQL's functionality. At every step of the way, the largest battle was with integrating ICU's data, which is where their collations are stored. On top of the trouble, the data file was over 30MB, which would be a massive size increase to our binary. ICU provides a tool that allows one to strip out unneeded data, but I needed to get everything working first before I could worry about that step (and hope it didn't add further unexpected complications).

Eventually, I decided on compiling ICU to WebAssembly using Emscripten, with wazero's runtime providing the module execution for GMS. After getting it running, I took a look through MySQL's source code on GitHub to see how they approached their collation implementation using ICU. I eventually found where they were using ICU's regex functionality, and collations were nowhere to be seen. That...can't be right?

Doubting My Sanity

One of the initial issues that adding collations uncovered was how our regex implementation did not respond to collations. How can that happen when the code apparently shows that they're not using collations?

I went back to the documentation, as I remembered seeing a note regarding collations. I found it in their regex documentation. The relevant lines read as follows:

By default, regular expression operations use the character set and collation of the expr and pat arguments when deciding the type of a character and performing the comparison. If the arguments have different character sets or collations, coercibility rules apply as described in Section 10.8.4, “Collation Coercibility in Expressions”. Arguments may be specified with explicit collation indicators to change comparison behavior.

The terminology seems pretty clear in that collations are used in regexes. It even explicitly states that they can be used to "change comparison behavior". So the documentation states that collations change regex behavior, and I remembered observing this to be true during my exploratory phase before I began the project proper. I went back and looked at a demo query that I wrote:

SELECT REGEXP_LIKE('ss', 'ß' COLLATE utf8mb4_0900_as_ci); -- Returns 1
SELECT REGEXP_LIKE('ss', 'ß' COLLATE utf8mb4_0900_as_cs); -- Returns 0

From the above queries, it's very clear to see that the output is changing due to the collation changing. So the documentation is right, my small tests verify this, so why is MySQL's actual code not corroborating this?

Unexpected Missing Functionality

I was puzzled for quite a while, and began reading the ICU documentation regarding string search. What a coincidence that I found the letter ß in their documentation, and I finally figured out that ß and ss are the same letter at different casings. I'm only familiar with the English language, and am therefore unfamiliar with characters outside of English. If these vary in case, could it be that collations only set the case-sensitivity? That would explain why I couldn't find anything collation-related, as this check could be done before we reached the point of parsing regular expressions. To confirm this, I needed something that was not considered a different case, but evaluated to the same value in one collation but a different value in another collation. Thankfully, some of the collations that I added to GMS have a map I can look through, which makes it easy to find different characters with the same "value".

CREATE TABLE test (v1 VARCHAR(255) COLLATE utf8mb4_0900_as_ci);
INSERT INTO test VALUES ('、');
SELECT * FROM test WHERE v1 = '﹑'; -- Returns ('、')

Great, these characters are viewed as being equal to each other in this collation. Let's now check with a different collation.

CREATE TABLE test (v1 VARCHAR(255) COLLATE utf8mb4_0900_as_cs);
INSERT INTO test VALUES ('、');
SELECT * FROM test WHERE v1 = '﹑'; -- Empty return

And this collation treats them differently. Lastly, let's make sure that they're not different casings of each other.

SELECT STRCMP('、' COLLATE utf8mb4_0900_as_ci, '﹑'), STRCMP('、' COLLATE utf8mb4_0900_as_ci, UPPER('﹑')), STRCMP('、' COLLATE utf8mb4_0900_as_ci, LOWER('﹑')); -- Returns (0, 0, 0)
SELECT STRCMP('、' COLLATE utf8mb4_0900_as_cs, '﹑'), STRCMP('、' COLLATE utf8mb4_0900_as_cs, UPPER('﹑')), STRCMP('、' COLLATE utf8mb4_0900_as_cs, LOWER('﹑')); -- Returns (-1, -1, -1)

Exactly what we need. Now, to see if my case-sensitivity hypothesis is correct.1

CREATE TABLE test (v1 VARCHAR(255) COLLATE utf8mb4_0900_as_ci);
INSERT INTO test VALUES ('、');
SELECT * FROM test WHERE v1 = '﹑'; -- Returns ('、')
SELECT * FROM test WHERE v1 REGEXP '﹑'; -- Empty return

Looks like my hypothesis was correct. Just for fun, what about the non-regex LIKE filter? It's not a full regular expression implementation, but it is still a form of pattern matching. It would even be reasonable to assume that LIKE internally converts its input to a regular expression, since it's functionality is a strict subset of a regex.

SELECT * FROM test WHERE v1 LIKE '﹑'; -- Returns ('、')

Even LIKE works with collations. Only regular expressions do not work with collations.

But We're a Drop-In Replacement

I don't think anybody knows that regular expressions in MySQL don't use collations. From the documentation, it appears that even the developers don't know (or at least some of them don't). We've even gotten bug reports in Dolt stating that we're missing collation support for REGEXP_LIKE. It definitely feels wrong, but that brings up one of Dolt's missions, which is to be a drop-in replacement for MySQL. We chose MySQL as it's the most popular database, and our versioning feature set mimics Git, so that there's zero learning curve to using Dolt if you're familiar with both of them. dolt checkout, dolt branch, dolt merge, etc., but we operate on MySQL tables instead of code.

That brings up the question of expectation. If we're a drop-in replacement, does that also mean that we should replicate this unexpected behavior? Strictly speaking from a compatibility standpoint, we should, as you'll get the same results in both Dolt and MySQL, meaning you don't have to worry about slightly different results in some circumstances. From the expectation standpoint, it's likely that some users are already receiving unexpected results in MySQL, and they aren't aware due to the lack of errors being raised as the expression is valid. There have been other cases where we've deviated from MySQL, as the behavior of MySQL clearly didn't match user expectation, but I'm not sure how relevant that is here. It's also possible that nobody is actually running into this collation issue, and it's all an academic exercise.

Some decisions have no right or wrong answer, and those are often the toughest decisions to make. C++ is an example where backward-compatibility is a priority, and many measures have been taken to ensure that compatibility. And yet, sometimes, breaking changes are made. Or worse, slight alterations to the underlying logic in some functions cause slightly different results, which cause bugs that end up being extremely difficult to debug.

For now, we're choosing to replicate MySQL for this instance. If we begin to get bug reports, then it will become clear that we need to implement collation support for regular expressions. The reality is that most users will never even know (except for those that read this blog post).

Conclusion

Although I've experienced a lot of roadblocks this past month while implementing proper regular expression support, it has overall resulted in some of the most growth I've experienced as an engineer. This was a task unlike anything I've tackled before, and I know the knowledge I've gained while working on this will pay off in the future. If you've enjoyed this blog, feel free to check out some of our other blogs! To stay up to date, you can follow us on Twitter, or you can chat directly with us through Discord.


  1. REGEXP is equivalent to REGEXP_LIKE. The former syntax is reserved for filters, while the latter is the function implementation that may be used anywhere. Internally, they call the same code.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.