Challenges with Prolly Trees and Columnar Storage

TECHNICAL

8 min read

Dolt is a MySQL-compatible SQL database that supports Git-like version control functionality, including branch, merge, diff, clone, push and pull. It is deployed in a wide variety of use cases, including as an OLTP database for an application that is providing version control functionality on its domain model, as well as for a working database and rendezvous point for agentic workflows making direct changes against the database, keeping the changes on individual branches, being able to review them and merge them independently, etc.

Once they see and interact with the model, our customers are often interested in Dolt-like features for OLAP work loads as well. Currently, Dolt is a single-system database with row-oriented storage. It supports cluster replication for high availability, and shallow clones for more cheaply working with subsets of the database when they split along branch boundaries and when the full history is irrelevant for a given workload, but generally a full copy of the database is required to fit on a given computer in order for Dolt to be appropriate for your use case. Vertical scaling can take you a ways, but for many OLAP use cases this is not an appropriate constraint. It’s also contrary to the evolution of analytic databases in the last few decades.

In this blog post, I talk a little bit about Dolt’s current storage model, and I compare it to columnar storage, a storage model that is popular in OLAP-centered databases and makes very different tradeoffs. The goal is to explore some of the challenges of Dolt’s model with adopting columnar storage. Let’s jump right in.

Row-oriented vs Columnar Storage

SQL databases, both for transactional and analytic workloads, are generally presenting an abstraction where there is a collection of tables, and each table is defined to have a set of columns, each of which has a name and a type. For each table, there is a collection of tuples corresponding to the records in that table. We might have a customer table like:

CREATE TABLE customer (
  customer_id   INT PRIMARY KEY,
  name          VARCHAR(255),
  email_address VARCHAR(255),
  phone_number  VARCHAR(255)
);

The SQL implementation will be responsible for mapping the records in this table to actual storage on a physical storage device. There are number of ways this could be represented in memory and on disk, but two very popular representations are row-oriented and columnar storage. In row-oriented, all of the columns of each tuple are stored together. So we might have, on disk, some serialization that follows the following structure:

1,Aaron,aaron@dolthub.com,
2,Tim,tim@dolthub.com,+1(206)867-5309
3,Brian,,

For each record, we store the values for each of the columns, all adjacent in storage. We can lay these tuples out on disk however we want, as long as we can find all of the ones which correspond to the records in the table, but the important point for this discussion is that we are encoding each row with all of its fields contiguously and in the same place. The number of tuples we will store is equal to the number of rows in the table.

For columnar storage, we instead store each column independently:

1,2,3
Aaron,Tim,Brian
aaron@dolhub.com,tim@dolthub.com,
,+1(206)867-5309,

Each of these is a tuple as well, but there is a fixed number of them. We will store as many tuples as there are columns in the table. Each tuple will have an entry for every row in the table. To be able to reconstruct a given row in the table, we need to store the tuples in the same sort order for each column, as we have above.

Logically this is the columnar storage model. In reality, a storage engine does not need to store these tuples contiguously in storage. It can cut each column tuple up into larger or smaller runs of values as needed. As long as we can find the runs of values and reconstruct the tuple that corresponds to the table, and we know which runs of values correspond to each other for purposes of sort order and record reconstruction, we can make the runs of values whatever size we want.

Often, columnar storage engines will define a sort order for the table, and it will store the records of the table within sets of tuples which are sorted by that sort order. As records are inserted and changed, it can manage the large logical tuples associated with the values of a given column in a table similar to how an LSM manages its sorted record sets, merging multiple sorted tuples into larger contiguous tuples which contain all of the values.

The design space for an analytic-optimized distributed SQL engine is huge, but there are some good reasons why columnar storage is popular in it:

Analytic workloads can have very wide tables. In the literature, it’s not rare to have tables with tens of thousands of columns, for example.
Many wide tables which are used in analytic workloads are often very sparse. It is common that a small percentage of columns are non-NULL for any given row.
The queries used by analytic workloads often select only a small portion of the columns in the table.
Column tuples compressed by themselves typically compress better than similarly sized sets of row tuples. The reason is that all of the data in a column tuple is the same type and it represents the same semantic domain. There are type- and data-specific compression strategies that can yield very good results, such as run-length encoding and delta encoding.
Query execution engines can often generate very efficient code, from a CPU utilization and memory bandwidth standpoint, when applying simple predicates, filters and aggregations to a column whose values are being scanned when they are stored in columnar format. It can be better from a a cache-efficiency standpoint and it can present much better opportunities for vectorization. Similar concerns come up for in-memory storage of data in a program with array of structures vs structure of arrays.

All of these advantages combined has resulted in columnar storage seeing wide usage across the industry when it comes to analytic databases. Google BigQuery, Amazon Redshift, ClickHouse, and Snowflake all use columnar storage. Open source standards for columnar data storage, such as Apache Arrow and ORC, are widely used across open source analytic tools like Spark, Parquet and Hive.

Dolt — History Independence, Content-Addressed, Copy on Write

Columnar-storage represents a unique challenge for Dolt’s approach to versioning a SQL database. Dolt stores a complete copy of every committed revision of a database, keeping each table’s tuples in row-oriented storage in a copy-on-write disk-based sorted Merkle tree we call a Prolly Tree. Dolt currently materializes these trees for every table and index in the database on every transaction commit. Dolt takes advantage of the following properties of a Prolly Tree to implement its versioning functionality:

History Independence — this means that, for a given set of values in the tree, we always get the same tree, with the same internal nodes and block boundaries, regardless of what operations we perform, and in which order, to arrive at those values.
Content-Addressed — this means that the contents of a tree uniquely determine its content address, which we get from the Merkle tree structure of the tree itself. History independence is an important requirement here, since it means that two logically identical trees also have the same Merkle tree structure and the same content address.

In order to efficiently diff and merge two versions of a table and to efficiently pull and push changes to and from remotes, Dolt wants wants trees with identical content to be structurally identical. It achieves this by tracking all database state as content-addresses of the relevant components of the database and by materializing the history-independent Prolly Trees for each index and table at each transaction commit. Transactions, mutations and branch operations which independently arrive at the same contents of a table will also arrive at the same Merkle tree structure, regardless of where those operations are run or what form and order those operations take. Making use of the recursive nature of the Prolly Tree, this model gets you O(1) branching and diffing and merging that scales with the size of the change, instead of the size of the table.

It’s fair to say that Dolt is trading off write efficiency to get its version control capabilities. Dolt misses out on taking as much advantage of write buffering and write optimized data structures like LSMTs as a traditional SQL engine would, partly because of its requirement to materialize the full tree and realize a content address for it so often.

A close examination of the columnar storage layout will show that it is also, in general, trading off write efficiency in order to gain the read-side advantages that it exhibits when it comes to analytic workloads. If you have a 256 column table and you insert a row into it, the most naive row-oriented storage typically requires you to write that row into storage and potentially update the spline of the B-tree that keeps track of which tuples are in the tree. If you insert that same row into a column-oriented database, you need to update at least 256 independent places on disk, potentially encoding the presence of mostly NULLs and inserting the new values into relevant value runs as appropriate. Because of this trade-off, analytic databases which use columnar storage almost invariably make use of batching, flushing and merging operations which amortize these writes out and keep throughput high and scalable.

The compounding write-efficiency trade-offs here, combined with Dolt’s inability to pursue the same amortization and cost hiding strategies which prevail in production grade analytic SQL engines, means that Dolt’s current approach to branching and merging SQL databases is not a great match for columnar storage. It’s simply too expensive to materialize each new column tuple on every write. This is especially true given that columnar storage typically opts for large block sizes. This improves the efficiency of scans, but in the context of Dolt it incurs an even higher write and storage amplification factor.

The Future of OLAP and Dolt

For now, we pretty confident that columnar storage is not a great fit for Dolt’s current implementation and we know that there are some scaling and architectural bottlenecks for OLAP workloads in Dolt. However, we continue to explore and research in this space, because we know that Dolt’s core features would be awesome for scalable OLAP use cases as well. It’s a major open question at DoltHub whether we can find a hybrid approach that allows for more write buffering but still gets us branch, diff, merge, push and pull functionality we are looking for. And we continue to explore just how far could we could get if we had a distributed execution engine and some better optimizations for columnar storage in our Prolly Tree implementations, at least amongst customers whose primary write needs are met by large ingestions which are not too latency sensitive.

If you have a use case for Dolt, whether it be agentic, OLAP, or OLTP, or if you just want to more about Prolly Trees, columnar storage, or anything else database related, don’t hesitate to drop by our Discord.

Blog

Row-oriented vs Columnar Storage

Dolt — History Independence, Content-Addressed, Copy on Write

The Future of OLAP and Dolt

Get started with Dolt