Implementing Full-Text Indexes

August 14, 2023

10 min read

A few weeks ago, we announced our initial implementation of Full-Text indexes. Dolt uses a bespoke SQL engine, which allows us to have a Git-influenced versioned database with the performance that would be expected of a production-ready database. This also means that, in order to fulfill our goal of being a drop-in replacement for MySQL 8.0, we must implement all of MySQL's features within our engine.

In this blog post, I'll go over how we implemented Full-Text indexes in Dolt's SQL engine.

What are Full-Text indexes?

Before we go over the implementation, we should briefly go over what Full-Text indexes are. Full-Text indexes provide an efficient way to search through text. A FULLTEXT index is declared over one or more text columns, and those columns form the basis of a document. A document is the concatenation of values from a row, defined by the columns of the FULLTEXT index. Documents are broken down into words, and a reference is created between each word and any documents that contain the word.

The Full-Text search engine may be invoked using the MATCH ... AGAINST ... expression, which will match any given words to their parent documents. This can be far more powerful than a regex-based search, as words do not have to be in order, and also the text is normalized (according to MySQL's rules on normalization).

This search efficiency is especially evident when there are a large number of documents, or the documents themselves are very large. The tradeoff is that we pay a penalty during insertion, as we break each document down into its individual words.

Now, let's walk through how Full-Text indexes are implemented, from creation to execution.

Defining a normal index

Creating a standard index is relatively straightforward in Dolt. I previously wrote a blog post on how they're implemented, but I'll simplify it a bit here. Say we have the following table:

CREATE TABLE example1 (`col1` BIGINT PRIMARY KEY, `col2` BIGINT, `col3` BIGINT, INDEX idx (`col2`));

This creates two "tables". The first is as defined, with three columns. The second contains the indexed column (col2) and a unique identifier¹ for each row, which is the primary key in this case (col1). If we were to represent the second table as a SQL statement, it would look like the following:

CREATE TABLE example1_idx (`col2` BIGINT, `col1` BIGINT, PRIMARY KEY (`col2`, `col1`));

Primary keys are always stored in a sorted order, which allows us to quickly find any row as long as we have a key. We can also use a prefix of the key, meaning we can find all primary key values using just col2 on our second table.

SELECT * FROM example1 WHERE col2 = 4;

This query will use the value 4 as a prefix for the primary key on our second table. This will let us get all matching rows on the second table, upon which we'll extract the original primary key (col1) from our second table's results to get all of the rows from our original table.

This vastly speeds up the majority of searches that filter on the indexed columns (using the WHERE clause, joins, etc.), however it comes with the added complexity of doubling all insertions, deletions, and updates. This, however, differs from a Full-Text index.

Defining a Full-Text index

Determining the key columns

As mentioned in the overview, FULLTEXT indexes must be declared over text columns. These indexes may contain only a single column, or they may contain multiple columns, just like normal indexes. When a Full-Text index spans multiple columns, it treats those columns as though it were a single column by concatenating all values and inserting a space between those values. The resulting string is referred to as a document, and documents are the unit that all Full-Text functions revolve around.

Now that we have documents, we still need a way to uniquely refer to their parent row within the original table. MySQL chooses to create a hidden FTS_DOC_ID column, which is an auto-incrementing column containing a UNIQUE INDEX. This acts as a sort of hidden primary key, and allows the MySQL developers to program against the assumption that all tables have a unique identifier. This works well enough in MySQL, however it would cause problems in Dolt, as auto-incrementing columns add an insertion-dependence into the data, which makes merging tricky (the same data, inserted in a different order, would create merge conflicts due to differing auto-increment values). Since MySQL's FTS_DOC_ID column is hidden anyway, we decided to do without it, and instead deal with the added complexity of working with tables that do not define any usable keys.

When the original table defines a primary key, then it is used as-is, similar to how they're used in a normal index. As a special case, a UNIQUE index that does not allow NULL values is essentially the same as a primary key, so we also treat those indexes as though they were primary keys. If the original table does not have a primary key, then we use the hash of the row's contents. This third strategy does not let us uniquely lookup a row, as multiple rows could have the same contents. To clarify, we cannot perform a lookup on the original table using the hash of a row², as we would have to hash each row's contents to find a match, which is just a full-table scan. This means that we lose the ability to filter rows for our matches when using such "keyless" tables. Although not preferable, this does allow us to bypass the auto-increment issue, as all values within the table are accounted for by the users, meaning we are not introducing potential merge issues as a by-product of adding Full-Text indexes. With the three strategies in place, we cover all possible tables, and we have the columns we'll use to identify rows with (hereafter referred to as key columns).

The extra tables

Unlike normal indexes that create a single secondary table, we create five tables.

The first table handles the configuration for Full-Text indexes on that table, meaning it is shared by all Full-Text indexes that are declared for the specific table. Its SQL equivalent would look like the following:

CREATE TABLE fts_config (`id` INT PRIMARY KEY, `stopword_table` VARCHAR(2048) NOT NULL, `use_stopword` TINYINT NOT NULL);

These table names will be used later to refer to each specific table.

The second table handles the position of each word within each document. This table is primarily used in query expansion, which is not a feature that we've implemented yet, but it enables word adjacency on matches. Its SQL equivalent would look like the following:

CREATE TABLE fts_position (`word` VARCHAR(84), /*key columns,*/ `position` BIGINT UNSIGNED, PRIMARY KEY (`word`, /*key columns,*/ `position`));

The commented out /*key columns*/ represents the key columns as mentioned earlier.

The third table handles the number of times a word appears for a given document. If a word appears in a document 4 times, then this table will store that 4. This is used when calculating relevancy, which we'll discuss later. Its SQL equivalent would look like the following:

CREATE TABLE fts_doc_count (`word` VARCHAR(84), /*key columns,*/ `doc_count` BIGINT UNSIGNED NOT NULL, PRIMARY KEY (`word`, /*key columns*/));

The fourth table handles the number of times a word appears in a document. If a word appears in 7 different documents, then this table will store that 7. This table is also used when calculating relevancy. Its SQL equivalent would look like the following:

CREATE TABLE fts_global_count (`word` VARCHAR(84) PRIMARY KEY, `global_count` BIGINT UNSIGNED NOT NULL);

Since this table stores information about a word relative to the entire index, it does not need to reference the key columns (similar to the configuration table).

The fifth table stores each row's hash, the number of times the row occurs within the table, and the number of unique words within the document for that row. This table ensures that we're iterating over the correct number of documents. For tables that contain a primary or applicable unique key, the occurrence value will always be one, while all other tables may store values greater than one. This may seem unnecessary for primary and unique keyed tables, however it allows us to greatly simplify our logic by not making a distinction between keyed and keyless tables. The unique words are used when calculating relevancy. Its SQL equivalent would look like the following:

CREATE TABLE fts_row_count (`row_hash` VARCHAR(84) PRIMARY KEY, `row_count` BIGINT UNSIGNED NOT NULL, `unique_words` BIGINT UNSIGNED NOT NULL);

Match expression and search modes

As mentioned earlier, the Full-Text search engine is invoked using the MATCH ... AGAINST ... expression. This is an example query:

SELECT * FROM some_table WHERE MATCH(some_col) AGAINST ('Cleopatra');

The parenthesis after MATCH specify the columns that will form a document. These columns do not have to be declared in the same order as an existing Full-Text index. An index just has to exist on the columns.

The parenthesis after AGAINST specify the search terms, or the input words. These must be string literals, and cannot be columns, function return values, etc. If no search mode is specified, then the search mode IN NATURAL LANGUAGE MODE is implied. This means the above query is equivalent to:

SELECT * FROM some_table WHERE MATCH(some_col) AGAINST ('Cleopatra' IN NATURAL LANGUAGE MODE);

There are four search modes.

IN NATURAL LANGUAGE MODE
- This mode simply matches the input words with the words found in the document.
IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION
- This mode performs two searches in order to find words that are associated with the input words. The first search is similar to the default natural language mode. The second search then analyzes the documents returned in the first search to find words that appear to be associated with the input words. The final set of documents are based on the results of both searches.
WITH QUERY EXPANSION
- This performs the query expansion portion, but does not use the same ruleset as natural language mode.
IN BOOLEAN MODE
- Allows defining a custom ruleset to determine relevancy. This is similar to a specialized form of regex. Can be used to exclude documents that contain certain words.

Of the four modes available, we currently only support the default mode.

Match expressions can be used as a filter (following a WHERE clause), where the resulting relevancy value is used to determine whether a row matches. Match expressions may also be used as an output value, which will return the relevancy. This is an example query:

SELECT MATCH(some_col) AGAINST ('Cleopatra') FROM some_table;

This returns the relevancy for every row in the table (which will be zero for all non-matching rows).

Filtering

When a match expression is used as a filter, it functions similarly to filtering on normal indexes by reducing the number of rows that are being evaluated. As Full-Text indexes are structured differently, they only filter when used by a match expression. Comparison, joins, etc. do not make use of Full-Text indexes for filtering.

The filtering itself is dependent on the search mode. Modes such as boolean are always evaluated over every row. However, since only the default mode is implemented for now, we only have one case to consider. First, we verify that the Full-Text index being used references a primary or unique key. If it does not, then we revert to a full-table scan. If it does, then we search fts_doc_count for the input words, which will return our key columns (the fts_position table would work as well). With the key columns, the process is the same as having a primary key with a normal index.

Relevancy

Regardless of whether a filter is used or a full-table scan is used, we need to verify that a document matches the input words for the given search mode. As we currently only implement the default search mode, we will ignore the other modes.

First, we grab the word count for the row's document by fetching it from our fts_doc_count table. This technically could be computed on the fly, but a document could be hundreds of megabytes in size, so it's far more efficient to do it once and lookup the result. This lookup also verifies that the word exists in the document, which allows us to bail early if we cannot find a match in the fts_doc_count table.

Next, we grab the word's count across all documents from our fts_global_count table, and the number of unique words from our fts_row_count table. These values are then used to calculate the relevancy, which is a value that represents how "accurate" a document is compared to the input words. A higher relevancy should correspond to a "better" match. This is often done using some form of tf-idf (term frequency-inverse document frequency). MySQL's exact formula is not stated within the documents, therefore we have not yet settled on our exact algorithm. The exact number is less important, rather it's the relationship between documents that matters, and therefore we're looking to find the best algorithm that captures that (while adhering as close as possible to MySQL's output). Expect this to change in the future (MySQL's output is also unstable across versions).

Conclusion

With the relevancy discussed, all of our Full-Text index implementation has been covered! MySQL's documentation goes into great detail on the output of Full-Text indexes, however very little is said about the implementation. The majority of our implementation was born from trial and error of attempting to understand why MySQL made the decisions that they did, and inferring what might be the most likely implementation, while also considering our unique aspect of being a versioned database.

If you've enjoyed this blog, feel free to check out some of our other blogs! To stay up to date, you can follow us on Twitter, or you can chat directly with us through Discord.

Tables without primary keys have hidden unique identifiers that the indexes use.↩
As mentioned in the other footnote, Dolt does have a unique identifier for keyless tables, and it could be used in place of hashing a row's contents. The main issue is due to the divide between the SQL engine and Dolt. I treat them as the same in this blog to simplify the explanation, however they are two separate projects. The engine is developed in concert with Dolt, but Dolt is technically an integrator. With normal indexes, the engine relies on the integrator to manage the secondary table, and therefore Dolt is able to work with its special unique identifier. This also means that all integrators must manage their secondary tables, greatly increasing their workload. In contrast, the SQL engine handles all of the Full-Text secondary tables, which simplifies the job of every integrator. This comes with the downside of not being able to access these hidden types though. We could allow it, but then we'd have to completely redesign our engine's type system to accept arbitrary non-MySQL-conformant types, which would be a potentially massive undertaking. In addition, this split architecture could allow us to include multiple SQL engines, one of which could be a PostgreSQL-compatible engine, without having to make many changes on the Dolt side as they would hook into the same integration points.↩

Blog