Virtual column support in Dolt

November 3, 2023

4 min read

Introduction

We're writing Dolt, the world's first version controlled SQL database. Dolt advertises itself as a drop-in replacement for MySQL, so if your current application works on MySQL you can swap in Dolt on the backend with no code changes. It should just work, out of the box.

For the last several weeks, I've been hard at work adding one of the last missing pieces of functionality: virtual columns. It's ready for you to try out in the latest release of Dolt, so go ahead and download your copy and follow along.

What are virtual columns?

Virtual columns are a subtype of generated columns. Generated columns cannot be specified in an INSERT or UPDATE statement: their value is always generated by the server, usually based on other columns in the row. You can choose to either store these values in the rows physically, or else compute them every time they're needed. Let's look at an example, straight from the MySQL docs.

CREATE TABLE triangle (
  sidea DOUBLE,
  sideb DOUBLE,
  sidec DOUBLE GENERATED ALWAYS AS (SQRT(sidea * sidea + sideb * sideb))
);
INSERT INTO triangle (sidea, sideb) VALUES(1,1),(3,4),(6,8);

mysql> SELECT * FROM triangle;
+-------+-------+--------------------+
| sidea | sideb | sidec              |
+-------+-------+--------------------+
|     1 |     1 | 1.4142135623730951 |
|     3 |     4 |                  5 |
|     6 |     8 |                 10 |
+-------+-------+--------------------+

The above schema definition creates a virtual column whose value is always derived from the other two columns, and can never be specified manually. If you try to insert into the column anyway, you get an error:

mysql> insert into triangle values (3,4,5);
ERROR 3105 (HY000): The value specified for generated column 'sidec' in table 'triangle' is not allowed.

By default, generated columns are virtual, which means the values aren't written to disk, just computed whenever needed. But you can make them stored to disk by adding the STORED keyword to the column definition.

CREATE TABLE triangle (
  sidea DOUBLE,
  sideb DOUBLE,
  sidec DOUBLE AS (SQRT(sidea * sidea + sideb * sideb) STORED)
);

The GENERATED ALWAYS clause is also optional, and can be omitted.

What are generated columns used for?

What we've shown so far is interesting mostly as a form of data quality control you build into your schema. Because generated columns cannot be explicitly specified, they are effectively read-only, which gives you assurance for values that should never be updated manually.

But mostly what people use generated columns for is to be able to index documents, particularly JSON documents. Because you can create indexes on virtual columns, it makes it possible to combine JSON documents with fast indexed access on one a field. Let's look at an example.

virtual> create table books (
    bookdata json, 
    isbn varchar(100) as (bookdata->>'$.isbn'), 
    key isbn_key (isbn)
);

virtual> insert into books (bookdata) values ('{
      ->             "isbn":"9781593279509",
      ->             "title":"Eloquent JavaScript, Third Edition",
      ->             "subtitle":"A Modern Introduction to Programming",
      ->             "author":"Marijn Haverbeke",
      ->             "published":"2018-12-04T00:00:00.000Z",
      ->             "publisher":"No Starch Press",
      ->             "pages":472,
      ->             "description":"JavaScript lies at the heart of almost every modern web application, from social apps like Twitter to browser-based game frameworks like Phaser and Babylon. Though simple for beginners to pick up and play with, JavaScript is a flexible, complex language that you can use to build full-scale applications.",
      ->             "website":"http://eloquentjavascript.net/"
      ->         }');
Query OK, 1 row affected (0.00 sec)
virtual> insert into books (bookdata) values ('        {
      ->             "isbn":"9781593277574",
      ->             "title":"Understanding ECMAScript 6",
      ->             "subtitle":"The Definitive Guide for JavaScript Developers",
      ->             "author":"Nicholas C. Zakas",
      ->             "published":"2016-09-03T00:00:00.000Z",
      ->             "publisher":"No Starch Press",
      ->             "pages":352,
      ->             "description":"ECMAScript 6 represents the biggest update to the core of JavaScript in the history of the language. In Understanding ECMAScript 6, expert developer Nicholas C. Zakas provides a complete guide to the object types, syntax, and other exciting changes that ECMAScript 6 brings to JavaScript.",
      ->             "website":"https://leanpub.com/understandinges6/read"
      ->         }');
Query OK, 1 row affected (0.00 sec)
virtual> insert into books (bookdata) values ('        {
      ->             "isbn":"9781484200766",
      ->             "title":"Pro Git",
      ->             "subtitle":"Everything you neeed to know about Git",
      ->             "author":"Scott Chacon and Ben Straub",
      ->             "published":"2014-11-18T00:00:00.000Z",
      ->             "publisher":"Apress; 2nd edition",
      ->             "pages":458,
      ->             "description":"Pro Git (Second Edition) is your fully-updated guide to Git and its usage in the modern world. Git has come a long way since it was first developed by Linus Torvalds for Linux kernel development. It has taken the open source world by storm since its inception in 2005, and this book teaches you how to use it like a pro.",
      ->             "website":"https://git-scm.com/book/en/v2"
      ->         }');
Query OK, 1 row affected (0.00 sec)

Here I've inserted some JSON book data I found on github. I've declared a virtual column on the isbn field of each JSON object and put an index on it, so if I want to look up a book by its ISBN number I can do so using a fast index lookup:

virtual> select bookdata->>'$.title' from books where isbn = '9781593277574';
+----------------------------+
| bookdata->>'$.title'       |
+----------------------------+
| Understanding ECMAScript 6 |
+----------------------------+
1 row in set (0.00 sec)

Using the EXPLAIN keyword, I can see that it's indeed using my virtual column index to do the lookup:

virtual> explain select bookdata->>'$.title' from books where isbn = '9781593277574';
+----------------------------------------------------------------------+
| plan                                                                 |
+----------------------------------------------------------------------+
| Project                                                              |
|  ├─ columns: [json_unquote(json_extract(books.bookdata, '$.title'))] |
|  └─ Filter                                                           |
|      ├─ (books.isbn = '9781593277574')                               |
|      └─ IndexedTableAccess(books)                                    |
|          ├─ index: [books.isbn]                                      |
|          └─ filters: [{[9781593277574, 9781593277574]}]              |
+----------------------------------------------------------------------+

This has become a relatively popular pattern for application developers who for whatever reason have document data in their relational tables. It's the only performant way to query for a match on fields inside such documents. And the virtual columns take no storage space (although secondary indexes on them of course do).

Conclusion

Virtual columns are one of the last big SQL language features we were missing, and we're excited to get it into your hands to start playing with. Let us know what you think!

Have questions about Dolt or virtual columns? Join us on Discord to talk to our engineering team and meet other Dolt users.

Blog

Introduction

What are virtual columns?

What are generated columns used for?

Conclusion

Get started with Dolt