Benchmarking Dolt's Import Performance

SQL
5 min read

Introduction

If you want to work with a database, you probably want to get some data into it first. At Dolthub, we've spent a good chunk of time working on our table import feature which lets you import CSV, JSON, XLSX, and Parquet data into Dolt. In the past couple of months, we've added various improvements to our import path by reducing garbage size and increasing correctness. As our team has experimented with larger databases and as our users have gotten deeper with our product, we received complaints that imports were getting slower with very large CSV files (>10m rows). We weren't sure just how fast users should expect import to be. How much slower are Dolt's imports compared to other databases like MySQL? We've had success using sysbench to guide users on query performance. We think a similar benchmark would work well for tracking import performance. In this blog post, we'll discuss the benchmarking setup we put together to evaluate Dolt's performance vis-a-vis MySQL and some potential performance optimizations.

The Setup

We considered the following set of independent variables:

  • The schema definition
  • The number of rows in each file
  • The key sort order of each file

We measured the following dependent variables:

  • The amount of time the import took to complete
  • The Rows per Second imported by the table import command

We decided to generate 6 different CSV files according to the following schema:

CREATE TABLE `test` (
  `pk` int NOT NULL,
  `c1` bigint DEFAULT NULL,
  `c2` char(1) DEFAULT NULL,
  `c3` datetime DEFAULT NULL,
  `c4` double DEFAULT NULL,
  `c5` tinyint DEFAULT NULL,
  `c6` float DEFAULT NULL,
  `c7` varchar(255) DEFAULT NULL,
  `c8` varbinary(255) DEFAULT NULL,
  `c9` text DEFAULT NULL,
  PRIMARY KEY (`pk`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin;
  • A CSV file with 100k rows in sorted key order
  • A CSV file with 100k rows in random key order
  • A CSV file with 1m rows in sorted key order
  • A CSV file with 1m rows in random key order
  • A CSV file with 10m rows in sorted key order
  • A CSV file with 10m rows in random key order

These files are loaded into both Dolt and MySQL database instances in a containerized environment running in a Dedicated Kubernetes Node. This process is very similar to what we wrote about in our previous blog about performance-benchmarking.

Results

Below is a chart comparing MySQL version 8.0.22 to our most recent Dolt version of 0.40.19. We are about 10x slower than MySQL with bulk imports.

|      name       | program | version | from_time | from_rps | program | version | to_time | to_rps  | rps_multiplier |
|-----------------|---------|---------|-----------|----------|---------|---------|---------|---------|----------------|
| 100k-random.csv | dolt    | 0.40.19 | 13s       | 7528.9   | mysql   | 8.0.22  | 1s      | 71466.8 | 9.5            |
| 100k-sorted.csv | dolt    | 0.40.19 | 13s       | 7561.9   | mysql   | 8.0.22  | 1s      | 74863.4 | 9.9            |
| 1m-random.csv   | dolt    | 0.40.19 | 2m31s     | 6617.3   | mysql   | 8.0.22  | 13s     | 74216.1 | 11.2           |
| 1m-sorted.csv   | dolt    | 0.40.19 | 2m30s     | 6646.1   | mysql   | 8.0.22  | 12s     | 81445.6 | 12.3           |
| 10m-random.csv  | dolt    | 0.40.19 | 27m55s    | 5971.5   | mysql   | 8.0.22  | 2m16s   | 73594.2 | 12.3           |
| 10m-sorted.csv  | dolt    | 0.40.19 | 28m1s     | 5950.1   | mysql   | 8.0.22  | 2m15s   | 73986.3 | 12.4           |

These results were a little surprising to us. When we look at our sysbench write performance results we typically found ourselves at around 6x slower than MySQL. Why was MySQL LOAD DATA so much better than our import path?

I started profiling dolt table import and investigating MySQL's implementation of LOAD DATA to get a sense of what was going on. Here's an example Pprof that I worked with.

pprof

If you've studied GoLang performance profiles before you'll probably notice some pretty big problems in this profile.

  1. We use two different encodings of a SQL "row". One is a query engine specific implementation and the other is storage engine specific. There is overhead in converting between those two types as indicated by the usage of DoltKeyValueMappingFromSqlRow
  2. We perform a substantial amount of type conversions. File rows are read in as strings and then converted into both of our row formats. The import tool itself spent nearly 5 minutes doing time.Parse to interpret our datetime strings.
  3. Memory allocation is exacerbated by an unoptimized row implementation that also requires a wide amount of type conversions when performing engine operations.

From an implementation perspective, we're also missing some features that can improve our performance.

  • Chunking the relevant import file and leveraging concurrency when reading and writing it.

  • Allowing for Prolly tree merge operations. Right now our Prolly tree can only be materialized by one thread. If we could concurrently manifest and merge Prolly trees we could have many concurrent threads writing to the storage engine versus the single-threaded import process. MySQL does something similar with B-Trees where they can have multiple threads manifesting different B-Trees that then get merged.

Conclusion

Dolt is well on its way to becoming a serious 1.0 database that is useful in any engineering organization. As we mature, we are going to spend more resources creating benchmarks ranging from systems tables to our aggregation queries. And of course, we'll make these benchmarks faster :). If database performance is your thing feel free to hop on our Discord and learn more!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.