Dolt now supports Column Statistics

July 6, 2022

4 min read

Dolt is a version controlled database that combines the features of Git and functionality of MySQL. In this release, we added column statistics, getting us one step closer to MySQL's feature set. Our analyzer can utilize these statistics to improve the query plans generated. In this blog post, we'll go over how to calculate these column statistics and their performance implications.

Generate Column Statistics

You can generate column statistics by calling analyze table <table_name>. The calculated statistics can be found in the table information_schema.column_statistics. Currently, the schema displayed is a modified version of MySQL's

stats_db> describe information_schema.column_statistics;
+----------------+-----------------+------+-----+---------+-------+
| Field          | Type            | Null | Key | Default | Extra |
+----------------+-----------------+------+-----+---------+-------+
| SCHEMA_NAME    | longtext        | NO   |     | NULL    |       |
| TABLE_NAME     | longtext        | NO   |     | NULL    |       |
| COLUMN_NAME    | longtext        | NO   |     | NULL    |       |
| MEAN           | double          | NO   |     | NULL    |       |
| MIN            | double          | NO   |     | NULL    |       |
| MAX            | double          | NO   |     | NULL    |       |
| COUNT          | bigint unsigned | NO   |     | NULL    |       |
| NULL_COUNT     | bigint unsigned | NO   |     | NULL    |       |
| DISTINCT_COUNT | bigint unsigned | NO   |     | NULL    |       |
| BUCKETS        | longtext        | NO   |     | NULL    |       |
+----------------+-----------------+------+-----+---------+-------+

Here is an example of generating some column statistics from a simple table:

stats_db> create table t (i int primary key);
stats_db> insert into t values (1), (2), (3), (4), (5), (6);
Query OK, 6 rows affected
stats_db> analyze table t;
+-------+---------+----------+----------+
| Table | Op      | Msg_type | Msg_text |
+-------+---------+----------+----------+
| t     | analyze | status   | OK       |
+-------+---------+----------+----------+
stats_db> select * from information_schema.column_statistics;
+-------------+------------+-------------+------+-----+-----+-------+------------+----------------+---------------------------------------------------------------------------------------------------------------------+
| SCHEMA_NAME | TABLE_NAME | COLUMN_NAME | MEAN | MIN | MAX | COUNT | NULL_COUNT | DISTINCT_COUNT | BUCKETS                                                                                                             |
+-------------+------------+-------------+------+-----+-----+-------+------------+----------------+---------------------------------------------------------------------------------------------------------------------+
| stats_db    | t          | i           | 3.5  | 1   | 6   | 6     | 0          | 6              | [[1.00, 1.00, 0.17],[2.00, 2.00, 0.17],[3.00, 3.00, 0.17],[4.00, 4.00, 0.17],[5.00, 5.00, 0.17],[6.00, 6.00, 0.17]] |
+-------------+------------+-------------+------+-----+-----+-------+------------+----------------+---------------------------------------------------------------------------------------------------------------------+

The BUCKETS column is a representation of a histogram, where each sub-array is a bucket that follows the format [LOWER_BOUND, UPPER_BOUND, FREQUENCY]. These are cached internally, and can be used internally to help predict how many rows a query will touch. The other columns like MIN and MAX are pretty self-explanatory for most readers who made it this far.

Performance Implications

Generating optimal query plans is hard; in fact, it's NP-Hard. However, having statistics can make query planning faster and produce better cost estimates. Currently, we have simplified cost estimates that are just using the number of rows returned from each operation.

For example, let's examine the cost of a filtered join between tables a and b.

This is how tables a and b are created:

stats_db> create table a (x int primary key);
stats_db> insert into a values (1), (2), (3);
stats_db> create table b (y int primary key);
stats_db> insert into b values (4), (5), (6);

This is the expected result from running a filtered join:

test_db> select * from a join b where a.x = 1 and b.y = 4;
+---+---+
| x | y |
+---+---+
| 1 | 4 |
+---+---+

Since our cost estimates only consider row counts, we expect to see a cost of 1.00.

Without calling analyze on either a or b, this is the plan and estimate cost we get:

stats_db> explain select * from a join b where a.x = 1 and b.y = 4;
+----------------------------------------------------------------+
| plan                                                           |
+----------------------------------------------------------------+
| Estimated Cost: 9.00                                           |
| CrossJoin                                                      |
|  ├─ Projected table access on [x]                              |
|  │   └─ IndexedTableAccess(a on [a.x] with ranges: [{[1, 1]}]) |
|  └─ Projected table access on [y]                              |
|      └─ IndexedTableAccess(b on [b.y] with ranges: [{[4, 4]}]) |
+----------------------------------------------------------------+

Without a histogram, the analyzer has no idea what the distribution of values in columns a.x or b.y are and default to assuming every row will be touched, printing a cost of 3.00 for a.x and b.y. As a result, we get an estimated cost of 3.00 * 3.00 = 9.00 (row counts are multiplied during cross join), which is pretty far from what we expect.

Here is the plan after calling analyze on both tables.

stats_db> analyze table a, b;
+-------+---------+----------+----------+
| Table | Op      | Msg_type | Msg_text |
+-------+---------+----------+----------+
| a     | analyze | status   | OK       |
| b     | analyze | status   | OK       |
+-------+---------+----------+----------+
stats_db> explain select * from a join b where a.x = 1 and b.y = 4;
+----------------------------------------------------------------+
| plan                                                           |
+----------------------------------------------------------------+
| Estimated Cost: 1.00                                           |
| CrossJoin                                                      |
|  ├─ Projected table access on [x]                              |
|  │   └─ IndexedTableAccess(a on [a.x] with ranges: [{[1, 1]}]) |
|  └─ Projected table access on [y]                              |
|      └─ IndexedTableAccess(b on [b.y] with ranges: [{[4, 4]}]) |
+----------------------------------------------------------------+

Now, the analyzer has a good idea of the distribution of values in tables a and b, so it knows that there is only 1 row that with value 1 in a.x and only 1 row with value 4 in b.y. Thus, we are able to provide a much more accurate estimation of the actual number of rows returned from this operation, which better reflects the actual amount of work required to complete the query.

Here is example of where statistics are useful for plan selection.

stats_db> explain select * from a join b where b.y < 0;
+-------------------------------------------------------------------+
| plan                                                              |
+-------------------------------------------------------------------+
| Estimated Cost: 0.00                                              |
| CrossJoin                                                         |
|  ├─ Projected table access on [x]                                 |
|  │   └─ Exchange                                                  |
|  │       └─ Table(a)                                              |
|  └─ Projected table access on [y]                                 |
|      └─ IndexedTableAccess(b on [b.y] with ranges: [{(NULL, 0)}]) |
+-------------------------------------------------------------------+

The estimated cost above is correct, but the produced plan is inefficient. Consider if table a had over a million rows, the resulting operation would perform a table scan over those million rows all to return an empty set. A much more optimal plan would be to flip the join order from a join b to b join a; this plan would terminate almost immediately, since column b.y doesn't contain any values less than 0. Once we plumb table stats through the optimizer, our join planner will flip the join order for you, like other modern SQL database engines do. Stay tuned.

Conclusion

Dolt now supports column statistics, but this is just the beginning. Other than providing some useful metadata on our tables, column statistics are necessary to produce efficient query plans, such as join search. Every day, Dolt becomes closer to being a drop-in replacement for MySQL.

Try out column statistics on your existing dolt tables. If for some reason you don't have any swing by our Discord, and we'll be more than happy to help you get started on Dolt!

Blog

Generate Column Statistics

Performance Implications

Conclusion

Get started with Dolt