Indexing Keyless Tables

8 min read

Introduction

Dolt is a new relational database with Git style versioning and offline debuggability. To be used as an application database, Dolt must be a drop in replacement for other SQL databases. We started with MySQL, the most frequently used SQL syntax, and have been inching towards 100% compliance over the last year.

At the end of 2020 Dolt added keyless tables, a way to store rows without specifying a unique primary key. Storing data with a primary key is more efficient and user friendly, but sometimes we need keyless tables. When importing a CSV file, for example, we may not know which if any field to specify as the "primary key".

Today we will talk about our newest feature: adding secondary indexes to keyless tables! After showing how to use this feature, we will discuss specific applications that depend on this pattern, and how secondary indexes work compared to more common B-tree structures. If you are interested in storage layer trade offs of relational databases, you might also find the deep dive at the end interesting.

Demo

First, we download dolt:

sudo bash -c 'curl -L https://github.com/dolthub/dolt/releases/latest/download/install.sh | sudo bash'

Next we create a keyless table:

$ dolt init
Successfully initialized dolt data repository.
$ dolt sql -q "create table kl (a int, b int, c int)"
$ dolt sql -q "insert into kl values (0,0,0), (1,1,1), (1,1,1)"
Query OK, 3 rows affected
$ dolt sql -q "select * from kl"
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 1 | 1 |
| 1 | 1 | 1 |
| 0 | 0 | 0 |
+---+---+---+

We duplicate the table for a reference comparison later:

$ dolt sql -q "create table kl_idx as select * from kl";

And now add an index to the copy, kl_idx:

$ dolt sql -q "alter table kl_idx add index a_idx (a)"

Performing lookups on these tables yield identical results:

$ dolt sql -q "
    select * from kl
    where a = 1 and
    a in (select a from kl_idx where a = 1)"
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 1 | 1 |
| 1 | 1 | 1 |
+---+---+---+

The difference lies in how the data is generated. Our first table performs a table scan to find the WHERE filtered rows, while our secondary index on kl_idx performs point lookups:

$ dolt sql -q "explain select * from kl where a = 1"
+---------------------------------------+
| plan                                  |
+---------------------------------------+
| Filter(kl.a = 1)                      |
|  └─ Projected table access on [a b c] |
|      └─ Exchange(parallelism=8)       |
|          └─ Table(kl)                 |
+---------------------------------------+

$ dolt sql -q "explain select * from kl_idx where a = 1"
+--------------------------------------------------+
| plan                                             |
+--------------------------------------------------+
| Filter(kl_idx.a = 1)                             |
|  └─ Projected table access on [a b c]            |
|      └─ IndexedTableAccess(kl_idx on [kl_idx.a]) |
+--------------------------------------------------+

Indexes make lookups faster but writes slower. This is often the appropriate trade off as most tables are read more often then they are written.

Motivation

Keyless tables and secondary indexes facilitate data migrations. Converting between data types (like file imports) and schema changes (like changing a table's primary key) temporarily store data in unoptimized states.

Object relational mappers (ORMs) make heavy use of keyless tables. For example, Pandas uses SQLAlchemy under the hood to import a table. The snippet below (taken from the Dolt with Popular DataFrames blog) creates a Pandas DataFrame and writes it to a running Dolt server:

In [1]: import pandas as pd
   ...: from sqlalchemy import create_engine
   ...: engine = create_engine("mysql+pymysql://root@localhost/test_db", echo=False)

In [2]: with engine.begin() as connection:
   ...:     df = pd.DataFrame({"name" : ["User 4", "User 5"]})
   ...:     df.to_sql("users", con=connection, if_exists="append")

In [3]: engine.execute("SELECT * FROM users").fetchall()
Out[3]: [(1, 'User 5'), (0, 'User 4')]

We were lazy about specifying a schema and primary key, so SQLAlchemy defaults to using the Pandas index when creating the users table. These are the queries that SQLAlchemy runs under the hood:

> DESCRIBE `users`;
error running query: table not found: users
> CREATE TABLE users (
	`index` BIGINT,
    name TEXT
  );
> CREATE INDEX ix_users_index ON users (`index`);
> INSERT INTO users (`index`, name) VALUES (0, 'User 4'),(1, 'User 5');
> SELECT * FROM users;
[INT64(1) TEXT("User 5")]
[INT64(0) TEXT("User 4")]

And it isn't just Pandas that depends on this behavior. For example, the model and experiment tracking platform MLFlow runs a series of table creates and alters when initializing a storage backend using the alembic migration library. If you use Django, FastAPI, or another Python web app, you probably run SQLAlchemy migrations in the background.

How Indexes Work

A Bit of History: B-Tree vs. B+Tree

Early B-Tree implementations used nonclustered indexes to separate index from row. A tree traversal here ends with a pointer to a row address in a separate heap. Concerns are divided between two data structures on disk, which can be useful, but range reads are interrupted by disk seeks to the row heap.

A new generation of index storage, called clustered index, or B+Tree, stores an index and row in one tuple. The consolidated structure is bigger than a nonclustered index, but most of time this ends up being more efficient. Indexes can be compressed, and coupling the two facilitates sequential reads.

Our default primary key tables are clustered indexes. Keyless tables with a secondary index are nonclustered indexes, the original B-Tree design. In theory this marks a clear path towards implementing keyless indexes. But reasoning about the relationships between row and index on disk can get hairy.

Storing Indexes on Disk

Primary Key Tables

Here is the friendly view of a clustered index with the same data as our demo keyless table (without duplicates):

$ dolt sql -q "create table pk (a int primary key, b int, c int)"
$ dolt sql -q "insert into pk values (0,0,0), (1,1,1)"
$ dolt sql -q "select * from pk"
+---+---+---+
| a | b | c |
+---+---+---+
| 0 | 0 | 0 |
| 1 | 1 | 1 |
+---+---+---+

The tuples fetched by SELECT are stored as a map from primary key to row value:

     Map<Union<>,Union<>> - map {
       (1821,0): (8313,0,16287,0),
       (1821,1): (8313,1,16287,1),
     }

The first row is (0,0,0), and the second is (1,1,1). Substituting human readable fields for column tags makes this a bit easier to understand:

     Map<Union<>,Union<>> - map {
       ('a',0): ('b',0,'c',0),
       ('a',1): ('b',1,'c',1),
     }

There is no separate data structure for our primary key index. The index is the first tuple in our storage map. Index lookups are point lookups by default, we directly access rows using the map key (i.e. primary key). Range reads are efficient because we identify bounds with the primary key and sequentially iterate neighboring rows.

Keyless Tables

Keyless tables write extra metadata to compensate for a lack of primary key. A row's hash is its storage key, and the value tuple includes row cardinality (duplication factor):

     Map<Union<>,Union<>> - map {
       (2251799813690248,34a469f8-1447-a00b-b33e-7386544e4f5a): (2251799813690249,1,6546,0,1134,0,12666,0),
       (2251799813690248,d3df65c6-c47b-23fc-63ef-5ffa9fe594e4): (2251799813690249,2,6546,1,1134,1,12666,1),
     }

The row is at the end of the line: (6546,0,1134,0,12666,0) -> ('a',0,'b'0,'c',0). This (2251799813690248,34a469f8-1447-a00b-b33e-7386544e4f5a) is the MD5 hash of the row (('hash', 3414...)), and this (2251799813690249,2) is the row cardinality (('card',2)).

The row hash is the only lookup key into a keyless tables. We can only table scan queries like select * from kl where a = 0.

Secondary Indexes

We looked at row level record storage in the last section. Now we turn our attention to secondary index structures.

Adding a second index to our primary key table stores a new mapping from b, the lookup key, to a, the primary key:

$ dolt sql -q "alter table pk add index b_idx (b)"

We save ('b',0, 'a',0) and ('b',1,'a',1), ordered in the lookup direction.

     Map<Union<>,Union<>> - map {
       (8313,0,1821,0): (),
       (8313,1,1821,1): (),
     }

The new mapping facilitates lookups from b -> a via b_idx, and then a -> row through the primary key.

Using the same design, a keyless secondary index maps from a lookup value to the keyless row hash:

     Map<Union<>,Union<>> - map {
       (6546,0,2251799813690248,34a469f8-1447-a00b-b33e-7386544e4f5a): (2251799813690249,1),
       (6546,1,2251799813690248,d3df65c6-c47b-23fc-63ef-5ffa9fe594e4): (2251799813690249,2),
     }

The index includes the lookup value, hash, and cardinality: (a=0, hash=34a4...): (card=1). When executing select * from kl where a = 0, the search traverses a -> hash via the a_idx, and hash -> row via primary row storage. This is not as fast as a primary key, but we can now do point lookups into keyless tables.

Trade-offs and Considerations

We conclude by highlighting some edge cases in practice. First, indexed deletes are pathological without storing cardinality in the index. Executing delete from kl_idx where a = 1 leaves the database in an inconsistent state if we delete the index and only one row (leaving the row card = 1). To address this, we need to surface two lookups when the cardinality is two, and n lookups for card = n. The correct behavior hits the row twice, reducing the cardinality to zero. Storing cardinality in index lets us do indexed deletes correctly.

Second, structural sharing works best when keys change infrequently. This is true for row keys as well as index keys. If we store index cardinality next to the lookup field and hash, we lose structural sharing and amplify writes on duplicate inserts. Placing the cardinality as the secondary index map value (separate from the other fields) minimizes unnecessary duplication.

Future

Migrations for libraries like SQLAlchemy include conversions between clustered and nonclustered indices. In a version controlled database, there is no structural sharing between equivalent primary key and keyless tables. Diffing and merging in particular will have special constraints between table migrartions.

Summary

Secondary indexes on keyless tables are a way to create a nonclustered index table. The main difference between clustered and nonclustered tables is an extra level of indirection. Data and index are decoupled in a nonclustered index, requiring an extra lookup to reach data.

Secondary indexes are structured the same for clustered and nonclustered tables. Primary key tables have a default index to simplify storage and search. Keyless rows generate overhead to compensate for a primary key absence. This overhead complicates keyless rows and their secondary indexes in the same way.

You might not use this feature directly (we recommend setting primary keys when possible), but your ORM and migration software probably does!

If you are interested in learning more about relational databases, want to try out Dolt, or have feedback please reach out on our Discord! We ship new features and integrations for users every month, and would love to hear about your experiences with databases.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt