Keyless Tables in Dolt

December 21, 2020

5 min read

Dolt is a tool built for collaboration and data distribution, it's Git for Data. Git versions files, Dolt versions tables. Today, we're announcing support for keyless tables in Dolt.

Strongly typed schemas are the best and worst parts of relational databases. Their structure allows for a rich query language and helps to enforce data validity. However, their rigidity proves to be a barrier to ingesting data. Enforcing constraints on data types and table membership provides certainty to downstream users at the cost of data cleaning and validation for data producers. Perhaps the most fundamental constraint is a PRIMARY_KEY, the requirement that each row in a table have a unique id. Some RDBMS support tables without primary keys, but in order to support version control capabilities, Dolt previously required all tables to define a primary key. We've now removed this requirement and support keyed and keyless tables.

"Write queries not code"

Version control and SQL functionality make Dolt a powerful tool for accessing and working with datasets. Users can simply dolt clone dolthub/us-president-precinct-results and immediately access the data via a SQL interface. The hard work comes up front though. Modeling a dataset into a relational schema, validating its contents, and defining constraints is time consuming. A big part of the challenge is writing the ETL code to prepare the data for import. It's boring tedious work, and when you succeed, you often throw that code away. We have a saying at DoltHub "Write queries not code". Most of error-prone ETL work can actually be accomplished in SQL, which is a lot more intuitive to write. Keyless tables are the perfect tool for this workflow as any CSV can be imported into Dolt without a key. dolt table import will even infer column types for numerical and chronological types. If a type can't be determined, the data will still be imported as a string type. Let's take a look at an example. We'll import example.csv and then transform it into the table we want.

% dolt table import -c mytable example.csv
  Rows Processed: 7, Additions: 7, Modifications: 0, Had No Effect: 0Import completed successfully.
% dolt schema show mytable
  mytable @ working
  CREATE TABLE `mytable` (
    `Name` longtext,
    `Age` longtext NOT NULL,
    `Breed` longtext NOT NULL
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
% dolt sql -q "SELECT * FROM mytable;"
  +--------+-----+-----------+
  | Name   | Age | Breed     |
  +--------+-----+-----------+
  | Fido   | 5   | terrier   |
  | Pongo  | 99  | dalmation |
  | NULL   | 8   | boxer     |
  | Spot   | 11  | labrador  |
  | Echo   | *** | beagle    |
  | Lassie | 6   | collie    |
  +--------+-----+-----------+

It appears like the CSV contains data describing dogs, but also containing some errors. Schema inference failed to interpret Age as a numeric type because of the error in the fifth row, but we can see this should be an int. We can also see an Age value of 99 that is certainly incorrect. Let's create a new table with the correct schema:

# Welcome to the DoltSQL shell.
# Statements must be terminated with ';'.
# "exit" or "quit" (or Ctrl-D) to exit.
dolt> CREATE TABLE puppers (
   ->   Name varchar(20),
   ->   Age int,
   ->   Breed varchar(20),
   ->   PRIMARY KEY (Name)
   -> );

From here we can incrementally clean the data and insert it into the new table:

% dolt sql
dolt> DELETE FROM mytable WHERE Age NOT REGEXP '^[0-9]+$';
  Query OK, 1 row affected
dolt> DELETE FROM mytable WHERE Age > 20;
  Query OK, 1 row affected
dolt> DELETE FROM mytable WHERE name IS NULL;
  Query OK, 1 row affected
dolt> INSERT INTO puppers SELECT name, age, breed FROM mytable;
  Query OK, 4 rows affected
dolt> SELECT * FROM puppers;
  +--------+-----+----------+
  | Name   | Age | Breed    |
  +--------+-----+----------+
  | Fido   | 5   | terrier  |
  | Spot   | 11  | labrador |
  | Lassie | 6   | collie   |
  +--------+-----+----------+
dolt> DROP TABLE mytable;

We now have a correct dataset without a single line of traditional ETL! Simply dolt commit -am "initial import" and dolt push -u origin master, and anyone can clone and use this data.

Maps and Multisets

Let's dig into implementation details of keyless tables and how they differ from primary-key tables. Tables in Dolt are implemented on top of a key-value storage layer where keys and values are both byte arrays. For primary-key tables, the key array is a tuple of the primary-key fields, and the value array is a tuple of the non-primary-key fields. This is quite similar to a traditional relational database persisted with a B-tree storage engine. In particular, MySQL and InnoDB use these same Map<Tuple,Tuple> semantics.

Instead of a B-tree, however, row data in Dolt is persisted in a Prolly-tree, a cross between a B-tree and Merkel-tree. Prolly-trees are the foundation of Dolt's version control system. They support efficient diff and merge algorithms at the storage layer, allowing Dolt's branching model to scale to hundreds of gigabytes. Diff and merge match rows by key tuple to determine what rows have changed between two versions of a table. The need for rows to have a consistent identity is what motivated the requirement that tables define a primary key.

Supporting both SQL and version-control functionality for keyless tables was a minimum requirement for their design. Without a unique key to identify rows with, we need some other means to compare table revisions. In MySQL, keyless tables are implemented using a hidden row id as the primary key. The hidden row id increments on each insert, storing the rows in insertion order. Mimicking this design in Dolt would support SQL functionality, but would complicate diffing and merging tables. Assigning an extrinsic primary key to row leaves us with no way to match rows across revisions.

After much discussion, we choose to implement keyless tables using multiset semantics. Prolly-trees storing keyless tables are defined as Map<Tuple,Int> where the row fields form the key, and the value is the number of duplicate copies of the row. By using the row itself as the key, we maintain the ability to efficiently match rows across table revisions. Multiset semantics also mean that we maintain history-independence at the storage layer. This means that two tables with the same contents will always have the same on-disk representation, regardless of what operations were used to create the current table state. A direct trade-off of this history-independence is that we can't track insertion order of a table as MySQL does. Another sacrifice of multiset semantics is that we can't track updates to individual rows. In primary-key tables we know the identity of each row and can perform cell-wise diffs of UPDATE queries. In keyless tables, UPDATEs are converted into a pair ofDELETE and INSERT queries.

Conclusion

Keyless tables are a big step forward for Dolt in both utility and MySQL compatibility. They are one of many features that allow data to be easily ingested and incrementally structured within a SQL context. Working with data is hard, especially when you're stuck in a world of CSVs and fragile ETL pipelines. We believe Dolt is the answer. Every change is tracked, every row is shareable. If you have any questions or feedback, join us on our Discord and let us know!

Blog

"Write queries not code"

Maps and Multisets

Conclusion

Get started with Dolt