Automatic Statistic Updates

February 16, 2024

5 min read

Dolt is close to 100% MySQL compatible and only 2x slower than MySQL. Few customers find bugs that we cannot fix in under 24 hours, and the difference between a 10ms query in Dolt that is 5ms in MySQL is usually OK. But on rare occasions we still have users reporting ten table joins that takes 10 seconds in MySQL and 10 minutes in Dolt.

As a result, we have been building defenses against nightmare-scenario execution plans. In our case that means database statistics and costed query planning. In the short term statistics improve some queries using better cardinality estimates. But more importantly, in the long term, costed query planning protects customers with data distributions anomalies that perturb query performance.

We pride ourselves on same day turnaround for most bugs, and now we can build custom statistics types, collection patterns, and optimizations for a new class of performance issues.

Automatic statistics collection is the last remaining piece to safely use database statistics in production. Statistics are still off by default, but Dolt version >=1.34.0 includes options to automatically maintaining statistics in background threads.

Background

Maintaining database statistics includes two problems: accuracy and freshness. Accuracy is the ability of histograms to usefully estimate query cardinality. Previous blogs focused on how statistics accuracy contributes to indexscan and join selection performance. Freshness is how far the contents of a database stray from histogram summaries in between updates.

Most databases only rely on histogram accuracy to optimize a handful of queries. On the other hand, maintaining statistics freshness impacts the entire server. We have two goals to balance: 1) keep statistics accurate enough, 2) avoid starving the server of useful work.

Different databases have different approaches for how to integrate statistics management. Postgres updates freshness via its garbage collection VACUUM interface. MySQL has an innodb_stats_auto_recalc that runs a background update thread. Both use counters to measure freshness and trigger updates when the counters exceed a threshold. A table "recalculate" or "refresh" in these databases samples rows from disk and uses HyperLogLog sketches to estimate distinct and null counts.

The way Dolt configures, estimates freshness, and performs updates is a bit different than traditional relational databases. This blog will focus specifically on how to configure and use statistics auto refresh, saving the technical deep dive for a future blog. All of the features here can also be found in our docs.

Demo

Setup

We will init a database with dolt version >= 1.34.0:

mkdir db1
cd db1
dolt init
dolt sql

Configure

Statistic updates need to consider 1) statistics deviate from reality over time, and 2) updates are expensive. The two levers we have over these goals are: 1) how often we consider refreshing stats (update interval), and 2) what freshness threshold triggers updates (update threshold).

We will set the update interval to 10 seconds and refresh threshold to 10% (~10% of the table has changed) to start (note: these are often used with set @@PERSIST.<var> = <val> syntax to outlast server restarts):

set @@GLOBAL.dolt_stats_auto_refresh_interval = 10;
set @@GLOBAL.dolt_stats_auto_refresh_threshold = .10;

Now let's add a table and data.

create table table1 (x int primary key, y int, key (y,x));
insert into table1 values (-5, 0), (-4, 0), (-3, 0), (-2,0), (-1,0);
-- Query OK, 5 rows affected (0.01 sec)

The table has two indexes we will collect histograms for, primary(x) and yx(y,x). Note that we have also added data, so the update "threshold" is satisfied. But no statistics have been collected because we have not run ANALYZE or a refresh thread yet:

select * from dolt_statistics;
-- no statistics found

Starting A Stats Thread

If we want statistics to initialize on start up we could add a persisted variable and restart:

dolt sql -q "set @@PERSIST.dolt_stats_auto_refresh_enabled = 1;"`

I am going to stick to a single session in this demo. We will use the stats controller functions to avoid restarting servers and manage the end-to-end lifecycle from a client. The only read-only controller is dolt_stats_status(). The status function currently shows the last event for a database (this interface is unstable and will likely become a table function soon). We will use it periodically to monitor what the threads are doing:

call dolt_stats_status();
-- +------------------------+
-- | message                |
-- +------------------------+
-- | no active stats thread |
-- +------------------------+

We will start a refresh thread and then query a few times until we reach the 10 second limit:

-- start refresh thread
call dolt_stats_restart();
-- +----------------------------------------------+
-- | message                                      |
-- +----------------------------------------------+
-- | restarted stats collection: refs/statistics/ |
-- +----------------------------------------------+

select * from dolt_statistics;
-- no statistics found
select * from dolt_statistics;
-- no statistics found
select * from dolt_statistics;
-- no statistics found
select * from dolt_statistics;
-- no statistics found
select * from dolt_statistics;
-- +---------------+------------+------------+----------+---------+----------------------------------+-----------+----------------+------------+---------+---------+-------------+-----------------+---------------------+------+------+------+------+-----------+
-- | database_name | table_name | index_name | position | version | commit_hash                      | row_count | distinct_count | null_count | columns | types   | upper_bound | upper_bound_cnt | created_at          | mcv1 | mcv2 | mcv3 | mcv4 | mcvCounts |
-- +---------------+------------+------------+----------+---------+----------------------------------+-----------+----------------+------------+---------+---------+-------------+-----------------+---------------------+------+------+------+------+-----------+
-- | db1           | table1     | primary    | 0        | 1       | tocmg4b8gvt0e0tigsveoc9cq0s8od8r | 5         | 5              | 0          | x       | int     | -1          | 1               | 0001-01-01 00:00:00 | -1   | -4   | -3   |      | 1,1,1     |
-- | db1           | table1     | yx         | 0        | 1       | c8r6u82rs47ck34r28qvotnjhm4uhg1e | 5         | 5              | 0          | y,x     | int,int | 0,-1        | 1               | 0001-01-01 00:00:00 | 0,-1 | 0,-4 | 0,-3 |      | 1,1,1     |
-- +---------------+------------+------------+----------+---------+----------------------------------+-----------+----------------+------------+---------+---------+-------------+-----------------+---------------------+------+------+------+------+-----------+

After 10 seconds the update runs, the threshold is exceeded, and our statistics appear.

But maybe 10 seconds is too slow. Lets set the interval to 0, which frees the background thread to run continuously. We will need to restart the thread to pickup this configuration change:

-- change the update interval
set @@GLOBAL.dolt_stats_auto_refresh_interval = 0;
call dolt_stats_restart();
-- +----------------------------------------------+
-- | message                                      |
-- +----------------------------------------------+
-- | restarted stats collection: refs/statistics/ |
-- +----------------------------------------------+

Now we'll update some rows and confirm the immediate update:

-- update rows
update table1 set y = 1 where x < 0;
-- Query OK, 5 rows affected (0.00 sec)
-- Rows matched: 5  Changed: 5  Warnings: 0

-- statistics updates should appear immediately
call dolt_stats_status();
-- +---------------------------------------------------+
-- | message                                           |
-- +---------------------------------------------------+
-- | updated to hash: cp4sc7qgkqton854cuqigtmvf6h5o0j9 |
-- +---------------------------------------------------+

In practice statistics updates are asynchronous, and large tables will have a less snappy responsiveness. In this case, queries in-between the start and finalization of the statistics update see the old statistics.

Stop A Stats Thread

The dolt_stats_stop() procedure stops a thread, discontinuing stats collection:

call dolt_stats_stop();
-- +---------------------+
-- | message             |
-- +---------------------+
-- | stopped thread: db1 |
-- +---------------------+
call dolt_stats_status();
-- +-----------------------+
-- | message               |
-- +-----------------------+
-- | cancelled thread: db1 |
-- +-----------------------+

Stopping a thread does not mutate the database statistics.

Dropping Database Statistics

The call dolt_stats_drop() procedure deletes database statistics from disk:

call dolt_stats_drop();
-- +---------------------------+
-- | message                   |
-- +---------------------------+
-- | deleted stats ref for db1 |
-- +---------------------------+
select * from dolt_statistics;
-- no statistics found

This procedure is "irreversible" in that you need to rerun ANALYZE or start an update thread to build a new set of statistics:

analyze table table1;
-- +--------+---------+----------+----------+
-- | Table  | Op      | Msg_type | Msg_text |
-- +--------+---------+----------+----------+
-- | table1 | analyze | status   | OK       |
-- +--------+---------+----------+----------+
call dolt_stats_status();
-- +---------------------------------------------------+
-- | message                                           |
-- +---------------------------------------------------+
-- | updated to hash: cp4sc7qgkqton854cuqigtmvf6h5o0j9 |
-- +---------------------------------------------------+

For large databases or databases with many tables, re-running ANALYZE can be slow.

Database Branches

Statistics are serialized and persisted as Dolt tables, but are not versioned. A database will only have one set of statistics collected from the working set of the active branch at the time of the latest statistics update. In other words, all branches on a server currently share the same database statistics. The branch for a refresh thread tracks the default branch, select active_branch() for new sessions. Changing the default branch changes the statistics source.

Different databases have distinct sets of statistics. A sql-server apples the global statistics configuration on startup equally to all databases. The controller functions give finer control over individual database threads.

Conclusion

Dolt database statistics can now be managed by automatic refresh threads. This blog walks through how to setup, configure, and manage these threads. Refer to previous blogs for how statistics can improve index and join performance.

The next blog in the series will talk about how Dolt's unique storage format lends itself to efficient statistics updates.

Until then, if you have any questions about Dolt, databases, or Golang performance reach out to us on Twitter, Discord, and GitHub!

Blog