PRODUCTS

KEYWORDS

Adaptive Encoding in Dolt 2.0

Earlier this month we announced Dolt 2.0, the second major release of the world’s first and only version-controlled database. It contains a lot of cool features and enhancements from 1.0, and today we want to do a deeper dive on one of them: adaptive encoding.

What is adaptive encoding?#

Databases store values on disk in compact structures so that the query executor can predictably read a lot of them very quickly when running queries. The database system limits the maximum size of these structures to make these performance guarantees possible.

But what happens when you want to store a value that’s much bigger than the limit for these structures, like the full text of Moby Dick?

In most database systems, there are special types for large values, and they get stored “out-of-band” in a separate file, with a pointer to that file in the main row structure. In MySQL, this applies to several different types:

  • TEXT, TINYTEXT, MEDIUMTEXT, LONGTEXT
  • BLOB, TINYBLOB, MEDIUMBLOB, LONGBLOB
  • JSON
  • GEOMETRY

For these types, because the database doesn’t enforce a length and can’t ensure that a value will fit in the main row structure on disk, all values of these types are stored out-of-band. This works great but comes at a serious performance penalty: the query engine must perform an additional disk operation when reading or writing each value.

Adaptive encoding is a storage optimization for large types. It takes adavantage of the fact that many values of these types are small and can fit in-line with other row data, rather than in a separate file. For such small values, the data is stored inline alongside data for types like INT and VARCHAR. Large values are still stored out-of-band as before. This gets rid of the performance penalty associated with these types as long as the values are small.

This idea has been around for a long time. The most widely used implementation is probably in PostgreSQL, where they call it The Oversized Attribute Storage Technique, or TOAST. Doltgres has had support for adaptive encodings for about a year. Now Dolt has them in 2.0.

Why does it matter?#

In a word: performance. Fewer disk operations means better performance for both reads and writes. The difference is especially pronounced in table scans on tables of these types. In our benchmarks, Dolt can perform table scans 40% faster than MySQL for tables that include these large types.

It also simplifies the job of schema design. Traditionally, database application designers have had to carefully consider the types they used for string data to make sure all values would fit, without incurring the out-of-band performance penalty. Adaptive encoding makes this choice much less important. Choose the data type that best fits the data, and we’ll make sure it’s fast.

How does it work?#

Simply create a new table or column of type TEXT, BLOB, JSON, etc. with Dolt 2.0 to get the performance benefits of adaptive encoding. Values that can fit into main row storage will be stored there, and larger values will continue to be stored out of band.

An example of adaptive encoding

If you want to read a much more detailed deep-dive on how this feature works in Dolt, check out Nick’s announcement blog from last year.

How can I customize the behavior of this feature?#

Also new in Dolt 2.0: a new table configuration parameter you can use to tune the target size of rows that contain adaptive values. It looks like this:

CREATE TABLE t1 
    (pk INT PRIMARY KEY,
    value TEXT)
    TARGET_ROW_SIZE=2048;

This new table parameter controls how large a row can get before Dolt begins storing eligible values out-of-band. The default is relatively small and optimizes for storing many rows in a single chunk on disk, which improves performance for most workloads. But for some database schemas or workloads, especially tables with TEXT, JSON or other large values with total size between 2KB and 64KB, it could make sense to store more of these values inline whenever there’s space. Here’s a table with several JSON documents that targets 48KB row size.

CREATE TABLE json_store
    (pk INT PRIMARY KEY,
    summary JSON,
    details JSON)
    TARGET_ROW_SIZE=49512;

For compatibility with customers more familiar with Postgres, TOAST_TUPLE_TARGET is provided as an alias. The value cannot exceed 64K.

Conclusion#

Adaptive encoding provides Dolt 2.0 a nice performance boost for customers using large types. This is one area where we have substantially advanced beyond what MySQL is capable of, and we’re excited to offer these benefits to customers.

Have questions about database storage techniques? Want to learn more about Dolt or Doltgres? Visit us on the DoltHub Discord, where our engineering team hangs out all day. Hope to see you there.