JSON Showdown: Dolt vs Postgres
Dolt is the first truly version controlled database. It's "Git, but for SQL data." But it's not just Git for SQL data. The same techniques Dolt uses for efficient version control also make it really good at handling JSON. That's why I've been advocating for Dolt as a competitive option for SQL engines that need to store and index large JSON documents, even if you don't care about version control. (You really should care about version control though.)
I've already talked about how Dolt represents JSON, and I won't repeat all the details here. But the core concept is that Dolt builds a partial index on JSON documents that allows the engine to quickly find a specific part of the document without having to load the entire document into memory. Storing this index in a prolly tree lets Dolt also efficiently perform mutation operations on these documents, and even make copies with minor changes while only requiring storage for the parts that changed. This allows us to achieve fast performance while keeping the document in the standard JSON format, eliminating the need for costly conversion operations when the document is returned from queries or exported from the database.
And it's not all talk: last time I put Dolt head to head against MySQL and showed how Dolt beats MySQL at pretty much everything you might want to do with large JSON documents. But MySQL has never been known for its speed. If we really want to asses Dolt, we need to pit it against a more heavyweight opponent.
We're in the process of teaching the Dolt engine the PostgreSQL dialect. Our current target is to launch Doltgres next January. And once we do, Dolt will become a drop-in replacement for Postgres. Between its speed, extensibility, and ubiquity, Postgres is essentially the reigning champion of open-source SQL engines. So how does Dolt compare in the JSON department?
This post will repeat the structure of the MySQL post. First, I'll introduce the syntax that Postgres uses for indexing JSON documents. Then, we'll take a look at Postgres's code to understand how it represents documents and their indexes. And finally, we'll run benchmarks to see how it performs for common operations.
Part 1: What does using JSON in Postgres look like?
Postgres is a bit special: most SQL dialects have one JSON
data type, but Postgres has two.
CREATE TABLE users(userId INT PRIMARY KEY, metadata1 JSON, metadata2 JSONB);
What's the difference? JSON
columns are text columns that are also valid JSON, while JSONB
is a bespoke binary format similar to BSON. This makes JSONB
optimized for document lookups and updates. However, this comes at a cost of slower read and write times, since this format needs to be converted to and from regular JSON whenever it's exported or imported from the database.
Postgres offers a large number of operations on JSON data, which are outlined here. Some of these operations are limited to the JSONB
type.
Suppose you have a JSON column where every document in the column has a value at a certain path, and you want to index just that value. In my Dolt vs MySQL showdown, I advocated for using virtual generated columns. But Postgres doesn't support virtual generated columns. If you want to index that value, you'd need to either use a stored generated column:
ALTER TABLE users ADD COLUMN name TEXT GENERATED ALWAYS AS (metadata->>'name') STORED;
CREATE INDEX name_idx ON users(name)
Or a functional index:
CREATE INDEX name_idx ON users((metadata->>'name'::text));
Stored generated columns have the downside that they take up extra space in your table. Functional indexes have the downside that it can be difficult to reason about whether a query will use a functional index, and often queries that look like they ought to use a functional index can't for subtle reasons.
In my benchmarking, many of the queries would only be optimized with a stored generated column index, and not a functional index. If you use a functional index, make sure to have Postgres EXPLAIN
your queries before using them in your application, so you can be sure that the execution plan will actually optimize the query.
But this isn't the only kind of index that Postgres offers for JSON. There's also a special index that can be created on JSONB
columns that can't be created on JSON
columns: Generalized Inverted Indexes (GIN).
A GIN is an index that maps each component of a composite value onto a list of table rows. In cases where the set of keys is different for each row, it tracks which rows contain each key. When adding a GIN index to your JSON column, you don't need to choose a specific key: the GIN will index all the keys in the document. This doesn't come for free though: building and updating GIN indexes is much slower than normal indexes.
There are two types of GIN indexes on JSONB
columns: the default, and the jsonb_path_ops
operator class. Each optimizes different types of queries. The Postgres documentation provides more details on the difference.
Part 2: Postgres Under the Hood
Postgres is written in C. It's incredibly optimized and makes heavy use of raw pointers and direct memory copying, and it's not the easiest to read. Fortunately, it's well documented and has lots of useful comments. I've done my best to understand it.
The source code for the JSON
type is here. As we can see, it's almost entirely conversion functions and validation logic: within the storage system itself, JSON
columns are identical to TEXT
columns. Given this, we should expect importing and exporting data to/from these columns to be fast. But using them in queries will likely be slow, since the entire document will need to be read before it can be interpreted or manipulated.
The implementation of the JSONB
type is significantly more complicated and spread out over multiple files. The in-memory representation of the JSONB
type is described here, and a serialized version here. This implies that JSONB
documents are stored as a single serialized blob, and don't share storage with other documents in the database.
Part 3: The Benchmarks
All of the tests were conducted on an Apple M2 Pro with 32 GB of RAM.
Here we explore how both Postgres and Dolt perform when working with multiple different document sizes. In each case, the total database size remains the same, but the size of the documents (and the total number of documents) varies. Each document differs only in a single field, which is used by the test queries.
I'm not aware of any standard benchmark that specifically tests operations on large JSON documents, so we're going to roll our own. Our data will be documents generated by this method for representing the Von Neuman ordinals:
CREATE TABLE vn(pk int primary key, j json);
INSERT INTO vn(
with recursive cte (pk, j) as (
select 0, JSON_OBJECT("l", 0)
union all
select pk+1, JSON_SET(JSON_INSERT(j, CONCAT("$.k", pk), j), "$.l", pk+1) from cte where pk <= 20 ) select * from cte
);
Why not use realistic data? Wouldn't that be more meaningful?
We choose this over a more realistic data set because it helps ensure that we're not accidentally relying on the test data having some specific shape. The documents generated by this process contain documents with a range of sizes, including both deeply nested objects and shallow objects with many fields, and everything in-between. Any engine that can handle this should be able to handle any other shape of data.
Test 1: Exporting JSON
We want to see how quickly the engine can load, format, and return large JSON documents. For workflows where all the document manipulation is done by a client, this is the most common use case, so it needs to be fast.
Both MySQL and Dolt have functionality for dumping tables into .sql
script files, so we'll use those tools and benchmark them:
- Postgres:
time pg_dump -t {TABLE} > json.sql
- Dolt:
time dolt dump -fn json.sql
64 KB * 1024 | 256 KB * 256 | 1 MB * 64 | 4 MB * 16 | 16 MB * 4 | |
---|---|---|---|---|---|
Dolt | 0.326±0.005 | 0.314±0.009 | 0.314±0.005 | 0.326±0.009 | 0.354±0.005 |
Postgres JSON | 0.176±0.05 | 0.186±0.019 | 0.176±0.005 | 0.178±0.004 | 0.184±0.016 |
Postgres JSONB | 0.57±0.022 | 0.616±0.042 | 0.600±0.008 | 0.608±0.021 | 0.632±0.034 |
Test 2: Importing JSON
We want to see how quickly the engine can parse JSON and store it in a table. For workflows where all the document manipulation is done by a client, this is the second most common use case, so it needs to be fast.
- Postgres:
time psql < json.sql
- Dolt:
time dolt sql < json.sql
64 KB * 1024 | 256 KB * 256 | 1 MB * 64 | 4 MB * 16 | 16 MB * 4 | |
---|---|---|---|---|---|
Dolt | 3.986±0.047 | 3.996±0.038 | 4.028±0.125 | 3.992±0.051 | 4.134±0.096 |
Postgres JSON | 0.568±0.004 | 0.56±0.001 | 0.564±0.009 | 0.568±0.008 | 0.578±0.008 |
Postgres JSONB | 1.698±0.051 | 1.402±0.181 | 1.35±0.089 | 1.728±0.070 | 1.782±0.053 |
Test 3: Copying a document with minor changes.
Copying documents and making minor changes to them is useful if you're trying to track the history of a document over time. If this can be done quickly, it's also a strong indicator that the database is using structural sharing, where the redundant parts of the different document versions are being reused instead of copied.
Postgres does not have a way to make minor changes to a document in the JSON
format. The easiest way to is to convert it to JSONB
, make the change there, and convert it back. This is obviously slower.
- Postgres:
time psql -c "create table jsonTable2 as select pk, JSONB_SET(j, '{l}', to_jsonb(1)) from jsonTable;"
- Dolt:
time dolt sql -q 'create table jsonTable2 as select pk, JSON_SET(j, "$.l", 1) from jsonTable;'
64 KB * 8092 | 256 KB * 2048 | 1 MB * 512 | 4 MB * 128 | 16 MB * 32 | |
---|---|---|---|---|---|
Dolt | 0.966±0.032 | 0.224±0.009 | 0.078±0.018 | 0.03±0.001 | 0.024±0.09 |
Postgres JSON | 16.896±0.378 | 18.178±1.128 | 20.518±1.007 | 22.074±0.774 | 21.82±0.629 |
Postgres JSONB | 9.012±0.188 | 7.892±0.584 | 8.59±1.121 | 9.274±0.561 | 9.322±0.410 |
Test 4: Extracting a sub-value from a JSON column (no index)
In theory, this should be fast for both engines, but MySQL is held back by a need loading the document from disk. In smaller tables where the operating system can hold the entire db in memory, the first lookup is slow but subsequent lookups are fast.
- Postgres:
time psql -c "select j->>'l' from test;'
- Dolt:
time dolt sql -q 'select j->>"$.l" from test;'
Why is Dolt faster for large documents? That seems weird.
Since Dolt only needs to load and parse the necessary parts of documents, the total size of the document doesn't matter for most operations. Since we're keeping the total amount of data the same and adjusting how many / how large the documents that make up the dataset are, larger documents for the same amount of data means fewer rows, and less work for the engine. Postgres, by comparison, appears to scale in runtime with the total size of the data set, regardless of how that data is distributed.
64 KB * 8092 | 256 KB * 2048 | 1 MB * 512 | 4 MB * 128 | 16 MB * 32 | |
---|---|---|---|---|---|
Dolt | 0.020±0.001 | 0.012±0.001 | 0.011±0.001 | 0.011±0.001 | 0.01±0.001 |
Postgres JSON | 3.138±0.043 | 3.134±0.015 | 3.194±0.050 | 3.222±0.063 | 3.18±0.063 |
Postgres JSONB | 0.092±0.004 | 0.08±0.07 | 0.114±0.005 | 0.074±0.005 | 0.138±0.013 |
Test 5: Filtering on a JSON key (no index)
- Postgres:
time psql -c "select pk from test where j->>'l' = 10;"
- Dolt:
time dolt sql -q 'select pk from test where j->>"$.l" = 10;'
64 KB * 8092 | 256 KB * 2048 | 1 MB * 512 | 4 MB * 128 | 16 MB * 32 | |
---|---|---|---|---|---|
Dolt | 0.262±0.004 | 0.07±0.001 | 0.022±0.004 | 0.012±0.001 | 0.01±0.004 |
Postgres JSON | 3.172±0.038 | 3.152±0.031 | 3.228±0.119 | 3.266±0.101 | 3.152±0.090 |
Postgres JSONB | 0.084±0.005 | 0.078±0.004 | 0.118±0.008 | 0.074±0.005 | 0.138±0.011 |
Test 6: Ordering by a JSON key (no index)
- Postgres:
time psql -c "select pk from test order by j->>'l';"
- Dolt:
time dolt sql -q 'select pk from test order by j->>"$.l";'
64 KB * 8092 | 256 KB * 2048 | 1 MB * 512 | 4 MB * 128 | 16 MB * 32 | |
---|---|---|---|---|---|
Dolt | 0.268±0.004 | 0.076±0.005 | 0.030±0.001 | 0.012±0.004 | 0.010±0.001 |
Postgres JSON | 3.154±0.022 | 3.154±0.024 | 3.272±0.114 | 3.268±0.107 | 3.15±0.101 |
Postgres JSONB | 0.090±0.001 | 0.08±0.007 | 0.114±0.005 | 0.074±0.005 | 0.138±0.013 |
Test 7: Creating an index
64 KB * 8092 | 256 KB * 2048 | 1 MB * 512 | 4 MB * 128 | 16 MB * 32 | |
---|---|---|---|---|---|
Dolt (virtual column index) | 0.634±0.011 | 0.168±0.004 | 0.054±0.009 | 0.022±0.004 | 0.024±0.001 |
Potgres (JSONB, stored column) | 0.528±0.151 | 0.54±0.041 | 0.518±0.050 | 0.488±0.026 | 0.566±0.042 |
Potgres (JSONB, GIN) | 10.666±0.088 | 11.026±0.210 | 11.172±0.135 | 11.148±0.213 | 11.852±0.045 |
Potgres (JSONB, GIN jsonb_path_ops) | 10.666±0.088 | 11.026±0.210 | 11.172±0.135 | 11.148±0.213 | 11.852±0.045 |
Postgres (JSON, stored column) | 3.194±0.059 | 3.182±0.036 | 3.27±0.111 | 3.27±0.129 | 3.148±0.064 |
Postgres (JSON, functional index) | 1.6±0.016 | 3.142±0.030 | 3.194±0.097 | 3.24±0.131 | 3.112±0.043 |
Once the indexes are created, a query that uses the index becomes faster than we can measure. A query that doesn't use the index has a runtime that is the same as on a non-indexed table.
During testing, queries would use indexes under the following conditions:
extract | filter | order | |
---|---|---|---|
Dolt virtual column index | ✅ | ✅ | ✅ |
Postgres JSON stored column index | ✅ | ✅ | ✅ |
Postgres JSON functional index | ❌ | ❔¹ | ❌ |
Postgres JSONB stored column index | ❌ | ✅ | ❌ |
Postgres JSONB functional index | ❌ | ❌ | ❌ |
Postgres JSONB GIN | ❌ | ✅² | ❌ |
Postgres JSONB GIN jsonb_path_ops | ❌ | ✅² | ❌ |
¹: In my tests, Postgres's JSON
type with a functional index would use the index to optimize queries with a filter, but only when the number of rows is sufficiently large. I suspect that this happens because Postgres isn't factoring document size into its planning, so it underestimates the cost of a full table scan on a small number of large documents.
²: GIN indexes won't optimize a normal filter query (where j->>'key' = value
), but can optimize the equivalent "contains" query: (where j @> '{"key": value}')
. A normal GIN index is also has a limitation in that it only maps keys to rows, not key-value pairs. This means that it can help reduce the number of rows that need to be scanned by eliminating rows that don't contain a matching key, but a scan is still required to determine which rows with the correct key also contain the correct value. Thus, the performance of a GIN index can be highly dependent on the shape of the document, moreso than a traditional index. If the key is rare, it can dramatically improve the query. However, when the key is known to exist in every row, adding a GIN index provides no benefit, and actually makes lookup operations slightly slower.
The jsonb_path_ops
GIN index works differently and does store key-value pairs. This makes it as fast as a traditional index for filter operations, provided that "contains" query syntax is used, with the bonus that it works on every possible key. However, a jsonb_path_ops
GIN index is slower to build and can't be used to check if a document contains a key (ignoring the value).
Test 8: Required Disk Space
Aside from runtime performance, another important metric is storage space. Database engines can typically reduce storage space in two ways: compression and structural sharing. Dolt uses both, while Postgres appears to only use compression.
64 KB * 8092 | 256 KB * 2048 | 1 MB * 512 | 4 MB * 128 | 16 MB * 32 | |
---|---|---|---|---|---|
Dolt | 14 | 2.6 | 2.3 | 2.2 | 4 |
Postgres JSON | 15 | 14 | 14 | 13 | 13 |
Postgres JSONB | 37 | 32 | 29 | 27 | 27 |
Conclusion
We can draw the following conclusions about how Dolt and Postgres compare when it comes to handling large JSON documents:
- Between Postgres's two JSON formats, the
JSON
is much faster for importing and exporting, andJSONB
is much faster for queries and manipulations. Which to use likely depends on your expected use case. - Postgres's
JSON
format is faster than Dolt at importing and exporting, but slower than Dolt at every operation across all document sizes. - Postgres's
JSONB
format is faster than Dolt at running queries when document sizes are small, but Dolt overtakes it once the test documents surpassed around 256 kilobytes. - Dolt is noticeably slower than Postgres at importing JSON documents. This is because of how Dolt builds a prolly tree to partially index the document as it's imported. We believe there's a lot of room to optimize this operation if it ever turns out to be a bottleneck for users.
- When storing multiple similar documents, Dolt's uses significantly less storage space than Postgres.
- GIN indexes are a useful feature that Postgres has that Dolt doesn't, but they are extremely slow to build and are only useful when a JSON document's keys are unknown. Dolt can add GIN indexes if this turns out to be a desired feature.
Additionally, Postgres has a max document size of 1 GB for JSON
and 255 MB for JSONB
. Dolt has no limit to document sizes.
Overall, Dolt holds its own even against Postgres. It especially shines in its ability to compactly store similar documents and efficeintly copy documents and make changes to them. It's a great candidate for any workflow that involves storing and manipulating multiple versions of a document, such as history tracking or version control.
We can confidently say that if you're using Postgres with a JSON-heavy workflow, you should consider switching to Dolt instead.
Still not sure if Dolt is right for you? Follow us on Twitter, or join our Discord and we'll be happy to figure out how Dolt can help your use case.