Prefix Indexes

SQL
5 min read

If you haven't heard, Dolt is a SQL database with Git versioning. A couple of months ago, a customer asked for prefix indexes, so we implemented them. In this blog, we'll discuss how to use prefix indexes, their benefits, as well as their limitations.

Prefix Index Support

Recently, we added prefix index support for all string and byte string column types: BINARY, VARBINARY, CHAR, VARCHAR, BLOB, and TEXT. Prefix indexes are very similar to Secondary indexes in that they index data in a B-tree for improved read query times, with a trade-off of increased write query times. However, prefix indexes allow you to only index the first N characters of a string field. Additionally, prefix indexes are the only way to create indexes on out-of-band string types like BLOB and TEXT.

As a wise man once asked, “why waste time say lot word when few word do trick?”.

Benchmarking

To see how much prefix indexes improve performance, I wrote small benchmark. I create four identical tables, two of which use text types, and two use varchar(32). For each type, one of the tables will have a prefix index defined over the column (_idx), while the other won't (_no_idx). Then, I randomly generate 10000 batches of 10 strings of length 32 for a total of 100000 probably unique strings. Lastly, the same strings are inserted to the tables, giving us four tables with identical row data.

Here is the python benchmark setup:

# Create a new connection
import mysql.connector
mydb = mysql.connector.connect(
  host="localhost",
  user="root",
  password="",
  port=3307
)
cursor = mydb.cursor(buffered=True)

import random
import string

# generate a random 16 length string
def gen(k):
    global counter
    res = ''.join(random.choices(string.ascii_lowercase + string.ascii_uppercase, k=k))
    return res

def gen_batch(batch_size):
    res = []
    for i in range(batch_size):
        res.append("(\'{}\')".format(gen(32)))

    return ','.join(res)

cursor.execute("use tmp")
cursor.execute("drop table if exists txt_tbl_idx")
cursor.execute("drop table if exists txt_tbl_no_idx")
cursor.execute("drop table if exists var_tbl_idx")
cursor.execute("drop table if exists var_tbl_no_idx")

cursor.execute("create table txt_tbl_idx (t text)")
cursor.execute("create table txt_tbl_no_idx (t text)")
cursor.execute("create table var_tbl_idx (v varchar(32))")
cursor.execute("create table var_tbl_no_idx (v varchar(32))")

for i in range(10000):
    s = gen_batch(10)
    cursor.execute("insert into txt_tbl_idx values {}".format(s))
    cursor.execute("insert into txt_tbl_no_idx values {}".format(s))
    cursor.execute("insert into var_tbl_idx values {}".format(s))
    cursor.execute("insert into var_tbl_no_idx values {}".format(s))
cursor.execute("commit")

cursor.close()
mydb.close()

Picking Prefix Lengths

However, we haven't added any indexes yet, because we haven't decided on a good prefix length. Ideally, a prefix should be as short as possible, while able to uniquely identify each row in the table. So, we'll just try a bunch of prefix lengths and see how well they separate the data (the best should be 100000).

mysql> select count(distinct left(t, 1)) from txt_tbl_idx;
+----------------------------+
| count(distinct left(t, 1)) |
+----------------------------+
| 52                         |
+----------------------------+
1 row in set (0.00 sec)

mysql> select count(distinct left(t, 2)) from txt_tbl_idx;
+----------------------------+
| count(distinct left(t, 2)) |
+----------------------------+
| 2704                       |
+----------------------------+
1 row in set (0.00 sec)

...
    
mysql> select count(distinct left(t, 6)) from txt_tbl_idx;
+----------------------------+
| count(distinct left(t, 6)) |
+----------------------------+
| 100000                     |
+----------------------------+

An important thing to note is that you don't have to pick a prefix length that uniquely identifies every row, if you're willing to accept a performance hit in favor of a smaller disk footprint. In this case, I'm choosing a prefix length of 6.

mysql> alter table txt_tbl_idx add index (t(6));
mysql> alter table var_tbl_idx add index (v(6));

It is important to note here that if your column data is very similar, it may not be ideal to use a prefix index. For example, a string column full of web URLs would not be an ideal candidate for prefix indexing; they all start with https://, http://, www., etc., meaning a prefix would have trouble uniquely identifying them.

Benchmark Results

Now, we are ready to run some queries. Here's a python script that runs a SELECT query with a LIKE clause that takes advantage of the prefix index when it exists. The dolt server is queried 100 times, and the average time is taken.

# Create a new connection
import mysql.connector
mydb = mysql.connector.connect(
  host="localhost",
  user="root",
  password="",
  port=3307
)
cursor = mydb.cursor(buffered=True)
cursor.execute("use tmp")

import time
def measure_query(query):
    diff = 0
    num_trials = 100
    for i in range(num_trials):
        tic = time.perf_counter()
        cursor.execute(query)
        toc = time.perf_counter()
        diff += toc - tic
        cursor.fetchall()
    return round(diff / num_trials, 6)

txt_idx = measure_query("select * from txt_tbl_idx where t like 'la%'")
txt_no_idx = measure_query("select * from txt_tbl_no_idx where t like 'la%'")
var_idx = measure_query("select * from var_tbl_idx where v like 'la%'")
var_no_idx = measure_query("select * from var_tbl_no_idx where v like 'la%'")

print("text table with index:", txt_idx)
print("text table without index:", txt_no_idx)
print("varchar table with index:", var_idx)
print("varchar table without index:", var_no_idx)
print("text speed up:", txt_no_idx / txt_idx)
print("varchar speed up:", var_no_idx / var_idx)

cursor.close()
mydb.close()

Results for query "select * from txt_tbl_idx where t like 'la%'":

text table with index: 0.000283
text table without index: 0.013347
varchar table with index: 0.000558
varchar table without index: 0.008789
text speed up: 47.16254416961131
varchar speed up: 15.75089605734767

Evidently, there is a significant improvement to using indexes here. We see roughly 47 times speed up for SELECTs on TEXT columns, and a 15 times speed up for VARCHAR. The reason there's such a large difference between TEXT and VARCHAR is that TEXT types are stored out-of-band. Under the hood, TEXT fields contain a reference (like a pointer) to the actual data, while VARCHAR fields contain the data. This means that every time we want to read from a TEXT field there's an overhead to dereference that pointer and retrieve the data, emphasizing the importance of defining prefix indexes.

Drawbacks

However, it is important to note that the above is a best case scenario.

Results for query: "select * from txt_tbl_idx where t > 'A'":

text table with index: 0.344052
text table without index: 0.472808
varchar table with index: 0.442481
varchar table without index: 0.461156
text speed up: 1.3742341273993466
varchar speed up: 1.0422052020312738

Here, we see that using the index did improve our results, but not by a huge amount. There are two potential reasons for this: (1) TableScans are great for reading huge sections of a table, and (2) TableScans are parallelized, while IndexedLookups are not. The query we have will actually return every row from our table, meaning we're basically doing a full table scan, which a query on an indexless table would do anyway.

As mentioned earlier, while reading from a table with Prefix indexes is usually faster, it comes as the sacrifice of worse write times. Running the benchmark from before, but with indexes defined beforehand yields these results.

txt_tbl_idx avg write time: 0.001838
txt_tbl_no_idx avg write time: 0.000984
var_tbl_idx avg write time: 0.001565
var_tbl_no_idx avg write time: 0.000873

It takes roughly about 1.8 times longer to write to the tables with indexes than those without. Lastly, it takes about 0.31 seconds to add a Prefix index to the benchmark tables after they've been filled with rows.

Conclusion

There are clearly improvements to be made here and in many other places for dolt, but, once again, we are another step closer to matching MySQL's feature set and performance. If you have any questions about Dolt or find any bugs you can reach out on Discord and GitHub!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.