Finally Adding Collations

SQL
10 min read

Here at DoltHub, our centerpiece is Dolt, which fuses a MySQL-compatible database with Git-style versioning capabilities. After you install Dolt, all it takes are a few commands to have a running server:

mkdir demo
cd demo
dolt init
dolt sql-server

Your MySQL-backed applications will require no change, as we aim to be completely compatible with MySQL, which includes all of MySQL's unique quirks (and features). However, you now have access to the full power that versioning provides, and as Git is the most popular versioning software ever, you can apply that same knowledge to Dolt; there is practically zero learning curve. But with Dolt, you're versioning your data rather than your source code. Want to run tests against production without risk? Make a new branch off of production and test without fear. Curious to know who made those changes to that one table? dolt blame has you covered. Used to MySQL's implementation of character sets and collations for strings? Dolt even supports this, however getting this in has been a journey.

Implementing MySQL

I mentioned that Dolt is MySQL-compatible, and that's because we're not a wrapper around MySQL. We're completely re-implementing MySQL's feature set over a custom storage engine, which presents a lot of interesting problems to solve. The MySQL documentation is a great resource for learning how to use MySQL, but it does not contain any information on how to implement MySQL. Often times this isn't an issue, as it is fairly clear on what the observed behavior should be. Take indexes as an example. When they're used correctly, statements execute quicker. Worst-case scenario is that they're not used, and statements are slower.

Character sets and collations are a bit different. As a very brief summary, character sets define how to interpret a sequence of bytes as a sequence of characters, and collations define how characters sort in relation to each other. For example, saying that the sequence of bytes [1,2] is equivalent to the string "ab" is the job of a character set, and a collation states that "a" comes before "b" when sorting.

CREATE TABLE case_sensitive (col VARCHAR(20) COLLATE utf8mb4_0900_bin);
CREATE TABLE case_insensitive (col VARCHAR(20) COLLATE utf8mb4_0900_ai_ci);
INSERT INTO case_sensitive VALUES ('A'), ('B'), ('a'), ('b');
INSERT INTO case_insensitive VALUES ('A'), ('B'), ('a'), ('b');

The above creates two tables, one that sorts different casings of the same letter as different characters (case-sensitive), and another that only cares about the letters and not the casing (case-insensitive). Although they both have the same data and we run the same queries, they return different results.

SELECT * FROM case_sensitive WHERE col = 'A';
+-----+
| col |
+-----+
| A   |
+-----+

SELECT * FROM case_insensitive WHERE col = 'A';
+-----+
| col |
+-----+
| A   |
| a   |
+-----+

SELECT * FROM case_sensitive WHERE col < 'b';
+-----+
| col |
+-----+
| a   |
| A   |
| B   |
+-----+

SELECT * FROM case_insensitive WHERE col < 'b';
+-----+
| col |
+-----+
| A   |
| a   |
+-----+

The case-sensitive collation sorts uppercase letters before lowercase letters, which is why the col < 'b' filter still returns an uppercase 'B'.

The stakes are also completely different compared to indexes. In the best-case scenario, people don't really notice or care that they exist. In the worst-case scenario, data may be corrupted and queries may return the wrong results (which may or may not be caught!). Data corruption is unacceptable, and incorrect results can lead to catastrophic circumstances, so it is imperative that we get character sets and collations right.

First Contact

This leads us back to December 11, 2019, where I first added parsing support for character sets and collations. At the time, we only truly supported the utf8mb4 character set, along with the utf8mb4_0900_bin collation, as they match up perfectly with Go's internal string encoding and sort order. It was an effective stop-gap though, as many tools threw collations into statements, but didn't really make use of them. Of course we needed to add actual support for them, but there are 42 character sets and 272 collations. Not only does it sound like an incredible amount of work, but we didn't even know where to start. Going back to the documentation, it lists all of the rules for using character sets and collations, but states nothing regarding any internal details.

So it was put on hold. Whenever we were deciding the next bit of functionality to add, character sets and collations always came up, but as the project was so large, it kept getting pushed back. Fast-forward nearly 3 years and Dolt has grown so much that we can't delay it any further. Items that used to be higher priority have been finished, and now character sets and collations are next-in-line. I knew it was going to be a challenge, and yet I still underestimated it.

How It Was Done

When starting, I already knew that the documentation wasn't going to give me any direct answers. This left me with one other alternative, which was to get the data from the MySQL program itself. Thankfully, MySQL is open-source, so my first stop was their GitHub. Perhaps I could get some clues on what they referenced to implement character sets and collations, but I was only greeted by arrays of hexadecimal numbers spanning thousands of lines. Now, that is how they implement these things, but it's practically unparsable just by looking at the code. With enough time and effort, I could break down the arrays into something I could understand, but I reckon it would require study of the entire repository. So, I tried thinking of a different way, and I came up with a potential solution.

Go implements the full Unicode standard, so if I could find a way to convert every Unicode character to a MySQL character in a character set, I could potentially map out every character set. Thankfully, SQL is a powerful language, so I was confident that I could find some query to give me the information I needed. First, I wanted to get an idea of how Go's Unicode implementation compares to MySQL's Unicode implementation, so I came up with the following query:

SELECT CAST(CONVERT(_utf8mb4 0x____ USING utf8mb4) AS BINARY);

I knew that Go's Unicode implementation was based on UTF8 4-byte, which matched MySQL's utf8mb4 character set, but I didn't know if they were exactly the same. By converting each Go character to a hexadecimal and returning the MySQL result as a binary string, I could compare the byte representations of both Go and MySQL. Then, by looping over every valid UTF8 character in Go, I was able to confirm that they both have the same binary representation of their characters; in MySQL terms, they have the same character set. One limitation of this approach is that MySQL may support more Unicode characters than Go, but with over 1.1 million characters in Go already, I don't think this is very likely.

Next, I wanted to confirm that Go's sort priority matched that aforementioned utf8mb4_0900_bin collation, as up to that point we didn't have any hard evidence proving it to be true. For Go, each successive character sorts after the previous, so I used MySQL's STRCMP to see if this held true for the assumed collation.

SELECT STRCMP(CONVERT(_utf8mb4 0x____ USING utf8mb4) COLLATE utf8mb4_0900_bin, CONVERT(_utf8mb4 0x____ USING utf8mb4) COLLATE utf8mb4_0900_bin);

It turns out it does!

With confirmation that Go's internal string encoding and sort order matched two of MySQL's options, I was able to finally craft a query that would give me the byte representation of any character in any character set:

SELECT CAST(CONVERT(_utf8mb4 0x____ USING ____) AS BINARY);

You'll notice that this is the same query as earlier, except that the character set after USING will change to whatever character set I'm interested in. There were a few other things that I had to consider, such as not all character sets implementing the full Unicode standard. Some implement only the ASCII range, some implement the extended ASCII range, and others implement ASCII + some subsection of Unicode.

With this, I'm able to take advantage of "implementation details", or things that we do under the hood that users don't know about. I've mentioned Go's built-in strings multiple times, as that is what Dolt is built on. They're easy to work with as they're built directly into the language, and I didn't want to complicate my fellow engineers' lives by removing standard Go strings from the codebase. Instead, I'm converting all incoming strings to utf8mb4, so we can use them like normal Go strings internally, and then convert them back to their original character set when needed. This also makes converting between character sets trivial, as it just requires changing the public character set.

With character sets out of the way, we had one final hurdle, which were collations. As we internally represent strings as normal Go strings, collations need to sort utf8mb4-encoded strings rather than strings belonging to their actual character set. I already had a map from the character set to utf8mb4, so it was a matter of finding the right query. This, however, was a bit trickier, as the "easiest" approach didn't quite line up exactly with what actually using MySQL was reporting. The "easiest" approach is to use the WEIGHT_STRING() function:

SELECT HEX(WEIGHT_STRING(CONVERT(_utf8mb4 0x____ USING ____) COLLATE ____));

WEIGHT_STRING() returns the sorting weight of any given string. Give it individual characters, and you get the weight for that specific character. The issue, however, is that not all characters return a weight. In those cases, I use a secondary query to compare it against a sorted list of characters:

SELECT STRCMP(CONVERT(_utf8mb4 0x____ USING ____) COLLATE ____, CONVERT(_utf8mb4 0x____ USING ____) COLLATE ____);

It turns out that relatively few characters do not return a weight, so this allows me to catch those and assign them a weight based on what they compare with.

And with that, after validating all of my results by running them back through MySQL, I now have all of the data to recreate character sets and encodings in Dolt!

Generated Code

In the last section I listed raw queries, but I wasn't typing those by hand for every character. Instead, I wrote and open-sourced a tool called collation-extractor. I give it a character set and/or collation, and it maps everything into data structures for me. With those data structures, I then generate files to insert into go-mysql-server, which is the MySQL-server-implementation that interfaces with Dolt. I knew that the generated code was correct, but there was a major issue with the size. The first character set I ever generated was roughly 90 megabytes! I had built a literal map going from every input character (the character set) to every output character (Go's string encoding), and that map had over 1.1 million entries! go-mysql-server was now taking so long to compile with the new file that it wasn't feasible to add anymore, so I had to rework the generated files.

I won't go into too much detail (feel free to glean the source code), but I noticed that stretches of some character sets would line up with the utf8mb4 character set, so I encode those stretches as offsets. For collations the story is similar, as most collations sort "a" < "b" and so on, therefore such sections are simply offsets. There are heuristics that determine how large a range must be to become an offset, as otherwise the performance of the generated code would be impacted. A map lookup is relatively quick compared to checking which range a character fits in, but a map is far more inefficient from a code-size perspective, so it's a delicate balance to achieve. You can see an example of a generated character set here, and an example of a generated collation here.

What's Left?

Although we fully support character sets and collations now, that does not mean that we've actually added every one of them just yet. As there are hundreds of them and the tool takes some time to run for each individual one, we add them on request. As well, we need to implement collation support for pattern matching, as our current regex parser implicitly uses Go's sort order. Overall, it has been very challenging to add support for character sets and collations, but this has been one of the most satisfying endeavors that I've gone on in my engineering career.

If you enjoyed reading this blog, perhaps check out some of our other blogs! You can keep up to date with us through Twitter, or you can chat directly with us through Discord. You can even check out our ever-growing list of open databases at DoltHub!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.