Introducing Secondary Indexes

FEATURE RELEASESQL
5 min read

Dolt is a SQL database with Git-style versioning. We're constantly adding new and exciting SQL features, and secondary indexes are one of them! This blog goes over what they are, why they're useful, and how they're implemented in Dolt.

What are indexes?

Indexes are a data structure that is used by many relational databases to allow for quick retrieval of a value given a key. They're set on specific columns of a table (usually in a specific order as well if multiple columns are used), and generally speed up read queries that involve those columns. Typically indexes are implemented as additional tables that reference the target table, but in all cases, they rely upon the data that is stored in their target table.

For example, given the following table:

pk val0 val1
1 11 21
2 12 22
3 13 23
4 14 24
5 15 25

An index on val0 may look like:

val0 pk
11 1
12 2
13 3
14 4
15 5

This means that they're not always perfect for every table, as they incur a performance penalty to write operations on those tables, since the indexes must be kept in sync with their target table's data. If a table is primarily for storage of data, with a relatively small amount of retrieval of that data, then an index may hurt more than help. However, for tables that are heavily read from, indexes are the best option to improve query performance.

For the user, indexes usually don't involve any extra bookkeeping after their creation. The database updates the indexes for you, allowing you to keep the same workflow as before their creation.

Different types of indexes

Perhaps the most common index is usually one that isn't referred to as such: the primary key! It has special significance in that it is used to determine the uniqueness of a row in a table. Even when a database allows you to create a table without a primary key, they'll usually create a hidden one for internal usage, since many storage engines require it.

Of course, primary keys may be one of the most well-known indexes, but they're far from the only one. We call all of these other indexes "secondary indexes". For this article, I'll just mention three others, and I'll start with the column index. It may be applied to a single column, and it supports the majority of commonly-used data types, such as INT, DATETIME, VARCHAR, etc. Next are composite indexes, which use multiple columns, and generally the order of the columns is important. This means that an index that declares columns with the order a, b is different from an index that declares the order b, a. The data types allowed are the same as with column indexes.

Lastly, we have unique indexes. These are either column or composite indexes, however they have a unique constraint on them. This constraint enforces uniqueness amongst the values or value combinations within a table. For example, if a unique index is put on a column in a table, and that column has the value 5 on any row, then no other row in that table is allowed to have that value 5. This probably sounds pretty similar to how a primary key functions, and that observation is a good one! Some databases will actually make use of a unique index as the primary key if one is provided during table creation and no explicit primary key is specified.

How they're implemented in Dolt

In Dolt, table data is implemented as a sorted map, with the key being a tuple of the primary keys, and the value being a tuple of all of the other columns. For indexes, this idea is modified a bit. For the key tuple, we take the columns that are indexed and append the primary keys to them (if they're not a part of the indexed columns), and store that as the key for the map. The value is left empty, which essentially turns the map into a set.

For a visual example, let's pretend we have a table created with the following statement:

CREATE TABLE `employees` (
  `id` BIGINT PRIMARY KEY,
  `first` VARCHAR(20),
  `last` VARCHAR(20),
  `favorite_number` DOUBLE,
  INDEX `idx_fav_num` (`favorite_number`)
);

With employees having the following data:

id first last favorite_number
1 Petey Cruiser 3.3
2 Anna Mull 1.0
3 Paul Molive 7.0
4 Paige Turner 1.0
5 Bob Frapples 5.5

Our employees map looks like:

key value
(1) (Petey, Cruiser, 3.3)
(2) (Anna, Mull, 1.0)
(3) (Paul, Molive, 7.0)
(4) (Page, Turner, 1.0)
(5) (Bob, Frapples, 5.5)

Thus, our idx_fav_num index map looks like:

key value
(1.0, 2) ()
(1.0, 4) ()
(3.3, 1) ()
(5.5, 5) ()
(7.0, 3) ()

If you wanted to query everyone whose favorite number is 1.0, you might write:

SELECT `first`, `last` FROM `employees` WHERE `favorite_number` = 1.0;

If we did not have a secondary index, then we would have to scan the entire table, row by row, to find all rows that list a favorite number being 1.0. However, by using the index, we can jump to where we first find 1.0, and iterate until we find a different number. If instead you wanted to find someone with a favorite number of 6.3, then we can easily find that no values exist between 5.5 and 7.0, and return an empty result immediately.

In this example, iterating through all five rows would be very quick, however when the row count approaches the thousands or millions, the performance savings that indexes bring get larger and larger. To illustrate this, I ran the following command:

dolt sql -q "CREATE TABLE test (pk BIGINT PRIMARY KEY, val0 BIGINT);"

I imported 10,000,000 rows into this test table, and ran:

time dolt sql -q "select * from test where val0 = 4586105337100680032"
+---------+---------------------+
| pk      | val0                |
+---------+---------------------+
| 7113158 | 4586105337100680032 |
+---------+---------------------+

real    0m22.828s
user    0m0.000s
sys     0m0.015s

Next, I added an index by running:

dolt sql -q "CREATE INDEX idx_val0 ON test(val0);"

I then re-ran the above query:

time dolt sql -q "select * from test where val0 = 4586105337100680032"
+---------+---------------------+
| pk      | val0                |
+---------+---------------------+
| 7113158 | 4586105337100680032 |
+---------+---------------------+

real    0m0.063s
user    0m0.000s
sys     0m0.015s

Adding an index brought the read time down from 22.8 seconds to 63 milliseconds, which is a massive improvement.

As mentioned earlier, indexes make writes more expensive. To measure how much more expensive, I ran an additional 1,000,000 INSERT statements on the aforementioned table. Without an index, it took roughly 1 minute and 6 seconds. With an index, it took 4 minutes and 48 seconds. We're still improving our performance with writes, and we're already working hard to bring you these enhancements!

What's next for indexes?

Right now, Dolt has full support for column, composite, and unique indexes. In the future, we plan to add support for index prefixes (also known as partial indexes), fulltext indexes, and spatial indexes. We are working hard to expand our index support, so check our release notes on GitHub to stay updated! Better yet, if you need a new index type, implement it and send us a pull request, or file and issue telling us about your use case.

Conclusion

Indexes are a major step forward for Dolt's performance. They allow for a substantial read performance, with the tradeoff of slower writes. The journey to a complete index implementation is still on-going, and I hope you will join us along this ride!

If you enjoyed reading this blog, why not check out Dolt, or browse our repositories here on DoltHub? We have a great collection of data that you may find interesting, and a pool of ever-growing community-powered datasets to view, pull, and collaborate on!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.