Three Ways to Import Data into Dolt

November 21, 2022

7 min read

Dolt is the first database that versions data like Git versions code. We focused on a few key areas for improving Dolt this year: 1) sharpening MySQL feature parity, 2) smoothing the first hour experience, and 3) chasing MySQL's performance. Today we will talk about importing data into Dolt, which for some users is both a performance bottleneck and a negative first hour experience. This blog discusses what we learned building a new system for measuring import performance. We share improved intuitions about which imports are fast, which are slow and why.

We start with a survey of different import options before digging into the results of our benchmarks.

Bulk Loading Data into Dolt

An "import" loads data from an external format into a database. We bulk insert data when restoring databases from dumps, converting a MySQL database to a Dolt database, converting a set of CSV files into a database, or simply adding a large set of changes into a preexisting database.

There are three ways to achieve the same result: SQL dumps (i.e. dolt sql < dump.sql), dolt table import, and LOAD DATA INFILE.

SQL Dump

The first way to import data simply runs a SQL script:

$ dolt sql < dump.sql`

dolt sql runs every line of dump.sql when fed through standard input. Queries in a SQL script generally do not have to be inserts, but a import dump.sql will look something like this:

create table xy (
    x int primary key,
    y varchar(32)
);
insert into xy values
    (1, '1'),
    (2, '2'),
    (3, '3'),
    (4, '4');
insert into xy values
    (5, '5'),
    (6, '6'),
    (7, '7'),
    (8, '8');

A dump defines and then inserts data into tables. Queries are separated by semi-colons for parsing, and new lines for convenience.

All together, the import looks something like this:

$ dolt init
$ dolt sql < dump.sql
Query OK, 4 rows affected (0.00 sec)
Query OK, 4 rows affected (0.00 sec)
$ dolt sql -q "select * from xy"
+---+---+
| x | y |
+---+---+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
+---+---+

If you prefer strings rather than files, the batch (-b) flag uses a slightly different syntax for the same effect:

$ dolt sql -b -q "
insert into xy values (9, '9');
insert into xy values (10, '10');
"

Table import

The second way to bulk load data uses a Dolt command line function: dolt table import. I usually divide this path into two steps. First, I create the table schema:

dolt sql -q "create table xy (x int primary key, y varchar(32));"

dolt table import can also infer schemas, but I prefer the SQL-native format for setting a schema's columns, primary key, and any indexes.

The second step either -c creates, -u updates, or -r replaces the table. These three will have the same result on our empty table, but a replace will be repeatable on a non-empty table:

cat data.csv
x,y
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
dolt table import -r xy data.csv
Rows Processed: 8, Additions: 8, Modifications: 0, Had No Effect: 0
Import completed successfully.

LOAD DATA INFILE

The previous methods import data using Dolt's command line interface. The final requires a server and client connection. The LOAD DATA command works on Dolt and MySQL provided the server has been configured to allow it, which isn't the default in either Dolt or MySQL (see docs for details).

First, we start a server in a new directory that holds our data file:

$ mkdir mydb
$ cd mydb
$ echo <<EOF
x,y
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
EOF > data.csv
$ dolt sql-server
Starting server with Config HP="localhost:3306"|T="28800000"|R="false"|L="info"|S="/tmp/mysql.sock"

Next, we connect to this database. We whitelist my local directory /Users/max-hoffman/mydb for access when establishing the client connection.

$ mysql -u root -h localhost --port 3306 --load-data-local-dir /Users/max-hoffman/mydb mydb

Inside the SQL shell, we will reconfigure the server to permit LOAD DATA INFILE commands, which requires root authorization:

mysql> SET GLOBAL local_infile=1;
Query OK, 1 row affected (0.00 sec)

Finally, we can create our table and import data.csv:

Mysql> create table xy (x int primary key, y varchar(32))
Query OK, 0 rows affected (0.02 sec)

mysql> LOAD DATA LOCAL INFILE '/Users/max-hoffman/mydb/data.csv'
       INTO TABLE xy
       FIELDS TERMINATED BY ',' ENCLOSED BY ''
       LINES TERMINATED BY '\n'
       IGNORE 1 LINES;
Query OK, 8 rows affected (0.01 sec)

My file is located at /Users/max-hoffman/mydb/data.csv, and the additional configuration indicates 1) the fields are separated by commas, 2) rows are separated by line breaks, and 3) we should not import the first (header) row.

MySQL has a wealth of options to configure permissions and execution of LOAD DATA. If you find Dolt missing a specific configuration option or pathway, let us know and we will be happy to add it.

Benchmarking Performance

The Good

The good news is that Dolt trails MySQL's import performance by roughly the same margin as the rest of our benchmarks, about 2x. This number will scale naturally as we make all of Dolt faster. But there is room to tighten import independently if the 2x ends up bottlenecking users!

test_name	detail	row_cnt	sorted	mysql_time	sql_mult	cli_mult
col type	datetime	200000	1	0.87	2.21	2.31
col type	varchar	200000	1	0.75	2.4	2.77
config width	2 cols	200000	1	0.81	1.73	1.8
config width	32 cols	200000	1	2.07	1.69	2.43
config width	8 cols	200000	1	1.02	1.78	2.47
pk type	float	200000	1	1.02	1.77	1.49
pk type	int	200000	1	0.83	1.76	1.75
pk type	varchar	200000	1	1.72	1.92	2.03
row count	1.6mm	1600000	1	6	2	2.1
row count	400k	400000	1	1.54	1.9	1.99
row count	800k	800000	1	3.02	1.95	2.06
secondary index	four index	200000	1	4.52	1.43	1.51
secondary index	no secondary	200000	1	0.97	1.72	1.96
secondary index	one index	200000	1	1.31	1.76	1.89
secondary index	two index	200000	1	2.44	1.52	1.59
sorting	shuffled 1mm	1000000	0	6.65	1.9	1.94
sorting	sorted 1mm	1000000	1	6.26	2.03	2.04

Each row is a particular import configuration as described by the name, detail, row count, and a sorting flag. The next column is the mysql time to perform the import. And finally the last two columns are dolt time multipliers for LOAD DATA INFILE from sql, and dolt table import from the command line. For example, the row count, 1.6M import test takes 6.0 seconds in MySQL, 12.0 seconds in DoltSQL, and about 12.6 seconds on the Dolt CLI.

The Bad

Both MySQL and Dolt fall short importing a certain kind of dump file. Consider a "batched" import:

insert into xy values
    (1, '1'),
    (2, '2'),
    (3, '3'),
    (4, '4');

vs and "unbatched" import:

insert into xy values (1, '1');
insert into xy values (2, '2');
insert into xy values (3, '3');
insert into xy values (4, '4');

Batching runs 30x faster in MySQL and 130x faster in Dolt, even though the files themselves are almost the same. Moving a row to disk is fast, but Dolt makes strict isolation and durability guarantees between queries. Improving lifecycle machinery to streamline transaction throughput is hard.

But don't fret! We created a helper tool with a workaround. This insert batcher reduces the number of transaction setup/teardowns by combining insert queries.

After installing Go we can create a batched file:

$ go version
go version go1.19 darwin/amd64

$ go install github.com/max-hoffman/insert-batcher

$ cat dump.sql
create table xy (x int primary key, y int);
insert into xy values (1, '1');
insert into xy values (2, '2');
insert into xy values (3, '3');
insert into xy values (4, '4');
insert into xy values (5, '5');
insert into xy values (6, '6');
insert into xy values (7, '7');
insert into xy values (8, '8');

$ go run github.com/max-hoffman/insert-batcher \
    -in dump.sql \
    -out batch.sql \
    -b 4

$ cat batch.sql
create table xy (x int primary key, y int);
insert into xy values (1, '1'), (2, '2'), (3, '3'), (4, '4');
insert into xy values (5, '5'), (6, '6'), (7, '7'), (8, '8');

The default batching factor -b of 500 should amortize the transaction overhead for most imports.

server	detail	row_cnt	time
dolt	batch sql	10000	0.14
dolt	by line sql	10000	17.96
mysql	batch sql	10000	0.1
mysql	by line sql	10000	3.58

The Ugly

Blob stands for "binary large object" file, and in MySQL corresponds to TEXT, JSON, and BINARY column types. We distinguish between blobs and VARCHAR for sizing reasons. A block of memory typically contains a range of rows, but a blob typically uses several blocks worth of memory on its own. Instead of disrupting the compactness of the regular tree, we write blobs to separate storage. Data blocks contain rows of mostly inlined values, and reference blob files written elsewhere.

This is an example of a schema that will be written inline. All fields will be layed out next to each other in a row-oriented storage.

create table xy (
  x int primary key,
  y varchar(30),
  z varchar(30),
  w varchar(30)
);

This schema writes every y field as a separate file:

create table xy (
  x int primary key,
  y text,
  z varchar(30),
  w varchar(30)
);

When we started this work, the blob schema was more than 10x slower on import!

detail	row_cnt	time
no blob	200000	1.45
1 blob	200000	18.43
2 blobs	200000	61.77

There are some good reasons why blobs take longer to write than varchar. We write blobs as a tree, not a linear sequence of bytes. The blob tree nodes are pooled with the rest of the table's tree nodes, and so blobs usually expand the total node number by a factor of 10-100x. If your TEXT fields fit as a VARCHAR, you are always better off sticking to VARCHAR!

Even then, many MySQL dumps include TEXT columns that we need to support for compatibility. We discovered some no-so-good reasons why blobs were slow, and managed to streamline this edge case in a recent PR:

detail	row_cnt	time
no blob	200000	1.25
1 blob	200000	2.23
2 blobs	200000	2.64

Stay tuned for a follow up blog about how we optimized blob import throughput by 10x.

Summary

We gave a summary of how to import data into Dolt. We discussed how fast Dolt is compared to MySQL, which paths are fast, which are prone to foot-guns, and which are slow and require more consideration.

Next time, we will follow-up with a performance-focused blog detailing how we streamlined a particularly hairy edge case: TEXT column imports.

We are tracking performance changes on release for imports. If anyone finds gaps in our testing, or have particular edge cases they would like us to address, we encourage you to reach out on Discord, GitHub, or Twitter!

Blog