Sometimes, the most complicated solution may end up being the simplest.
This sounds like a contradiction, but it depends on where a project's priorities lay.
I'm an engineer here at DoltHub, and about 9 months ago I implemented collation support for our main product: Dolt.
As Dolt is a drop-in replacement for MySQL for those that want versioning features (think Git for databases using MySQL's dialect), collations, at least for me, were kind of a tipping point where we've gone from supporting most of the "major" features to focusing on the more "hardcore" features that users want.
Every database user depends on collations, but most aren't even aware that they exist...except for those that do know, and usually those users really need proper collation support.
So, I implemented them from scratch.
I won't go over that journey here as I already wrote another blog about that, but a part that I didn't mention in that blog is that it broke our regex support.
Well, I say "broke", but our support was kind of iffy to begin with.
One collation that MySQL has is called
utf8mb4_0900_bin, and it just so happened to line up perfectly with Go's built-in string handling, which is the language that Dolt is implemented in.
This let us get away with enforcing
utf8mb4_0900_bin as the only supported collation for a long time.
Consequently, we used Go's built-in regex support, which does differ from MySQL's regex support as Go uses the RE2 syntax, while MySQL uses ICU.
Adding collation support broke our assumptions about the string type, and therefore meant that we needed to implement proper regex support.
Well, that would be simple, right?
We just need to add the ICU library to Dolt!
Learning What "Simple" Means
Well, for anyone familiar with ICU, you may have already spotted a complication.
ICU is maintained as a C/C++/Java library (icu4c & icu4j), and Dolt is a Go application.
I've already embedded a Go library in C, so I just need to do the reverse, right?
If only "the reverse" was as straightforward.
The first core issue that I noticed is that people don't really compile ICU all that often.
For many, it's shipped as a library that's either already a part of the system, or a dependency that is added by a program.
Thankfully, the documentation is very robust.
Sadly, it's geared toward a very specific workflow, which is to use
These are time-tested solutions for C that don't make a lot of sense in the world of Go.
While we could introduce a new build process for Dolt, we can't quite introduce one for go-mysql-server.
We don't just maintain Dolt.
We also maintain GMS, which is the library that implements all of MySQL's behavior.
The library is built such that Dolt is an integrator that implements all of its versioning features.
This is great for a separation of concerns, as it gives us a potential path toward supporting other SQL dialects in the future.
This also means that other users and companies use GMS for their own needs since it's a near feature-complete MySQL interface.
Such a change would force all of these downstream users to adopt this new build process for their projects, which would be met with quite a bit of unhappiness/outrage.
To get around this, we could precompile ICU into a static library.
Except...we'd need to do that for every supported platform and architecture combination that Go supports.
Not only would that be very difficult to do, but it would greatly increase the size of our repository, so that approach would not work.
We do have another route forward, which uses CGo.
CGo uses the C/C++ compilers defined in the environment, and it seems reasonable to assume that many developers will already have these installed, so there'd be less pushback in that way.
If that were the only issue, then it seems like it'd be an obvious choice, but there was a bigger elephant in the room: the fact that we'd have to modify ICU so that it does not rely on
configure file is nearly 10k lines long, and handles all of the platform differences that we'd somehow need to consider.
Even better, CGo has a limitation, in that all files need to be in the same directory for a package.
Even with all of this, I didn't see a better way forward, and spent a few weeks to "successfully" get it working in Windows...and only Windows.
At the very least, I got it working, and so this approach was actually possible!
All would have been well, except for that smaller elephant I mentioned earlier: CGo uses the compilers defined in the environment.
I didn't realize at the time, but after talking to one of our cofounders, Aaron, I learned that CGo breaks native cross-compilation, as it requires additional setup.
This is something that we heavily take advantage of within DoltHub, and is a pretty big side-effect to force upon adopters of GMS.
Aaron, after hours, threw out the idea of WebAssembly, and at first I thought he was joking.
Turns out, he wasn't joking, and after looking more into it, it actually seemed like the best solution.
So we've got a C/C++ library that could be included directly into Go, but instead we'll compile it to a WASM module.
We'll then run that module using a runtime, and that runtime will be contained within our Go code.
This seems like a textbook case of overcomplication, and yet it actually is simpler than the previous efforts!
Massive shoutout to wazero, as their library is exactly what we need: a zero-dependency WebAssembly runtime.
Without them, this approach would not have been as big of an improvement over the CGo implementation.
With the runtime taken care of, all that's left is to compile ICU into a WASM module.
Emscripten made that easy by porting ICU, so that all I'd need to do is provide the argument
With those two things done, we had ICU in a WASM module, along with a runtime to execute it!
From here, it was quite straightforward to get it working.
I gave it an empty C++ file so that it had something to "compile", and out popped our module!
Using Go's embed functionality, I was able to inject the module into Go so that it appears as a simple byte slice, which fulfills our single-binary approach (no need to load a module from the filesystem).
From there, wazero compiled our byte slice and exposed our exported functions via functions that take a variable number of
uint64s, so I had to manually map each function's expected input.
Overall, the workflow was quite similar to working with C code from within Go (remembering to
free after allocations, etc.).
With it all hooked up, I ran it, and it worked!
It actually worked!
I gave it regular expressions, and it returned the results that I expected it to!
At this point, I'd been working on this project for over a month, so one could imagine how happy I was to have finally gotten to this point.
At DoltHub, I specialize in taking long-form projects, but generally I'm making progress during the majority of the process.
This was the first one where I felt like I was running into brick wall after brick wall, but it was finally working.
Now that it's working like I expect, let's check the performance...and wow, it's slow.
Like 50-100x slower than Go's built-in regex.
As bad as that sounds, the real-world impact is far less, as there's overhead in fetching data from tables, and other forms of processing that occur before and after the regex portion.
But this was still painfully slow.
After digging further into it, the slow part was primarily the function calls that transition from Go to WASM and back.
If I could reduce the number of calls, then I could massively speed up execution.
Thankfully, Aaron came to the rescue again, and mentioned that we could take advantage of the global stack variable to skip some
I modified the module (using the text representation) to expose the global stack variable, and put every small allocation on the stack by moving the stack pointer and directly copying our values to the memory (which wazero exposes for us).
This brought the multiplier down to around 10x, which is still a lot, but after factoring in the rest of the program, only results in around a 15% speed reduction (for regular expressions only).
Considering we gained complete regex compatibility, I think that tradeoff is worth it, plus there are additional optimizations that can be made to bring that down even further for general use cases.
One such optimization is having a "fixed" memory location to store smaller strings, which would get rid of more
free calls for the majority of use cases (like strings under 10MB).
What About Collations?
I spent quite a bit of time at the beginning of the blog talking about collations, and they didn't show up at all after the first section.
Besides gaining full syntax and capability compatibility by going with ICU, the other reason for choosing ICU was due to MySQL's collation support.
MySQL supports collations, and they use ICU, therefore ICU must be able to handle collations, right?
They do, and they have an entire section of their document dedicated to ICU data, which contains collations.
My naivety assumed that MySQL took advantage of this collation support, and so a substantial amount of time was spent trying to compile this data into the program.
Earlier, when I mentioned that it took me weeks to get a CGo build working in Windows, 80% of that time was spent getting the data into the build process.
It wasn't until I had the WASM module running that I discovered that regular expressions in MySQL don't take collations into consideration at all.
I have a future blog post that goes a bit more into this, so I'll leave the rest for that post.
It's been a long journey to get proper regex support, but it's finally in Dolt.
Regular expressions will behave exactly as they do in MySQL (including nifty features like backreferencing).
In case you're curious, we've also open-sourced our regex repository, so feel free to check it out!
To stay up to date with future information regarding Dolt, you can follow us on Twitter, or you can chat directly with us through Discord.