Querying Historical Data with AS OF Queries

7 min read

Dolt is Git for data. It's a SQL database that lets you branch, merge, and fork your data just like you would a Git repository. In previous blog posts we announced how you can use special system tables to query the history of your database. Today, we're excited to announce a simpler syntax for exploring the history of your datasets: AS OF queries.

SQL AS OF clauses

SQL introduced the AS OF syntax in SQL 2011. Its basic form is easy to understand:

SELECT * FROM myTable AS OF TIMESTAMP("2019-03-01");

As you would expect, this query returns rows from the table as it existed on the date provided.

Like many SQL features, its adoption and implementation has varied significantly across different vendors. The above simple syntax is actually a deviation from the SQL spec, adopted by both Oracle and Microsoft, but not MySQL / MariaDB. The details vary a lot too. In MySQL and SQL Server, you have to use a special syntax when you create your tables to enable versioning. In Oracle, you just turn on versioning on the entire database, and old-enough versions get garbage collected automatically. Postgres doesn't support versioned tables out of the box, requiring you to install a plugin first.

AS OF in Dolt

Dolt's versioning is a graph of commits, just like Git. So in addition to timestamp queries like in other database, you can query a Dolt database with Git-like refs in your AS OF clauses. The following are all valid Dolt SQL queries:

SELECT * FROM myTable AS OF TIMESTAMP("2020-03-01");
SELECT * FROM myTable AS OF 'HEAD~';
SELECT * FROM myTable AS OF 'HEAD~20';
SELECT * FROM myTable AS OF 'feature-branch';
SELECT * FROM myTable AS OF '4gunscn1j6ijtsj121m7bjaj4j8grjmv';

The latter four examples all use dolt refs instead of a timestamp, and work the way you would expect.

SELECT * FROM myTable AS OF 'HEAD~';

This query selects rows from the table as it existed at the commit prior to the current HEAD commit. It's equivalent to doing this:

% dolt checkout -b prior-commit HEAD~
% dolt sql -q "SELECT * FROM myTable"

Similarly, for other queries:

SELECT * FROM myTable AS OF 'HEAD~20';

This query selects rows from the table as it existed 20 commits before the current HEAD.

SELECT * FROM myTable AS OF 'feature-branch';

This query selects rows from the table as it exists on the current HEAD of the feature-branch branch.

SELECT * FROM myTable AS OF '4gunscn1j6ijtsj121m7bjaj4j8grjmv';

This query selects rows from the table as it exists on the commit with the hash 4gunscn1j6ijtsj121m7bjaj4j8grjmv.

A real-world example

To demonstrate this feature, let's run some queries on DoltHub's Coronavirus dataset.

To start, we'll clone the dataset so we can run SQL on it:

% dolt clone dolthub/corona-virus
% cd corona-virus

Let's use the mortality_rate view we introduced a few weeks ago to examine how mortality of the disease has changed over time.

Here's the current mortality rates per country:

doltsql> select * from mortality_rates limit 10;
+----------------+----------------+--------------+-------+--------+-----------+---------------------+
| country        | state          | last updated | cases | deaths | recovered | mortality_rate      |
+----------------+----------------+--------------+-------+--------+-----------+---------------------+
| Sudan          |                | 2020-03-18   | 2     | 1      | 0         | 0.49751243781094534 |
| United Kingdom | Cayman Islands | 2020-03-19   | 3     | 1      | 0         | 0.33222591362126247 |
| Netherlands    | Curacao        | 2020-03-19   | 3     | 1      | 0         | 0.33222591362126247 |
| Guyana         |                | 2020-03-17   | 7     | 1      | 0         | 0.14265335235378032 |
| Ukraine        |                | 2020-03-19   | 16    | 2      | 0         | 0.12492192379762647 |
| Guatemala      |                | 2020-03-19   | 9     | 1      | 0         | 0.11098779134295228 |
| Algeria        |                | 2020-03-19   | 87    | 9      | 32        | 0.10343638662222733 |
| San Marino     |                | 2020-03-18   | 119   | 11     | 4         | 0.09242920762961096 |
| US             | South Dakota   | 2020-03-17   | 11    | 1      | 0         | 0.09082652134423251 |
| Cuba           |                | 2020-03-19   | 11    | 1      | 0         | 0.09082652134423251 |
+----------------+----------------+--------------+-------+--------+-----------+---------------------+

As you can see, the highest mortality rates are currently places with a small number of deaths and inadequate testing. Let's see what the picture looked like a week ago:

doltsql> select * from mortality_rates as of timestamp('2020-03-13') limit 10;
+-----------+-------------------+--------------+-------+--------+-----------+---------------------+
| country   | state             | last updated | cases | deaths | recovered | mortality_rate      |
+-----------+-------------------+--------------+-------+--------+-----------+---------------------+
| Morocco   |                   | 2020-03-11   | 5     | 1      | 0         | 0.1996007984031936  |
| Bulgaria  |                   | 2020-03-11   | 7     | 1      | 0         | 0.14265335235378032 |
| Panama    |                   | 2020-03-11   | 8     | 1      | 0         | 0.12484394506866417 |
| US        | South Dakota      | 2020-03-11   | 8     | 1      | 0         | 0.12484394506866417 |
| Australia | Western Australia | 2020-03-11   | 9     | 1      | 0         | 0.11098779134295228 |
| Iraq      |                   | 2020-03-11   | 71    | 7      | 15        | 0.09857766511758906 |
| Albania   |                   | 2020-03-11   | 12    | 1      | 0         | 0.08326394671107411 |
| US        | Washington        | 2020-03-11   | 366   | 29     | 1         | 0.07923280784678015 |
| US        | Florida           | 2020-03-11   | 28    | 2      | 0         | 0.07140307033202427 |
| Italy     |                   | 2020-03-11   | 12462 | 827    | 1045      | 0.0663616864374206  |
+-----------+-------------------+--------------+-------+--------+-----------+---------------------+

Still topped by places with small death counts and limited testing, but now Italy and Washington State crack the top 10.

For reference, if we limit the list to places with signficant testing, mortality looks like this:

doltsql> select * from mortality_rates as of timestamp('2020-03-13') where cases > 100 limit 10;
+---------+--------------+--------------+-------+--------+-----------+----------------------+
| country | state        | last updated | cases | deaths | recovered | mortality_rate       |
+---------+--------------+--------------+-------+--------+-----------+----------------------+
| US      | Washington   | 2020-03-11   | 366   | 29     | 1         | 0.07923280784678015  |
| Italy   |              | 2020-03-11   | 12462 | 827    | 1045      | 0.0663616864374206   |
| China   | Hubei        | 2020-03-11   | 67773 | 3046   | 49134     | 0.0449441451692938   |
| Iran    |              | 2020-03-11   | 9000  | 354    | 2959      | 0.039333289629678185 |
| China   | Hainan       | 2020-03-08   | 168   | 6      | 159       | 0.035712159990476756 |
| China   | Heilongjiang | 2020-03-11   | 482   | 13     | 436       | 0.026970394805087033 |
| China   | Hong Kong    | 2020-03-11   | 126   | 3      | 65        | 0.023807634314736925 |
| Spain   |              | 2020-03-11   | 2277  | 54     | 183       | 0.023715310868199963 |
| Japan   |              | 2020-03-11   | 639   | 15     | 118       | 0.023473811051470242 |
| China   | Tianjin      | 2020-03-10   | 136   | 3      | 131       | 0.022057201676347327 |
+---------+--------------+--------------+-------+--------+-----------+----------------------+
doltsql> select * from mortality_rates where cases > 100 limit 10;
+----------------+----------------+--------------+-------+--------+-----------+----------------------+
| country        | state          | last updated | cases | deaths | recovered | mortality_rate       |
+----------------+----------------+--------------+-------+--------+-----------+----------------------+
| San Marino     |                | 2020-03-18   | 119   | 11     | 4         | 0.09242920762961096  |
| Italy          |                | 2020-03-19   | 41035 | 3405   | 4440      | 0.08297792543489084  |
| Indonesia      |                | 2020-03-19   | 311   | 25     | 11        | 0.08038326741905405  |
| Philippines    |                | 2020-03-19   | 217   | 17     | 8         | 0.07833740380627621  |
| Iran           |                | 2020-03-19   | 18407 | 1284   | 5710      | 0.06975603316345241  |
| Iraq           |                | 2020-03-19   | 192   | 13     | 43        | 0.06770480704129994  |
| US             | Washington     | 2020-03-19   | 1376  | 74     | 0         | 0.05377867893401937  |
| United Kingdom | United Kingdom | 2020-03-19   | 2689  | 137    | 65        | 0.05094811845251598  |
| Spain          |                | 2020-03-19   | 17963 | 830    | 1107      | 0.046206064573810296 |
| China          | Hubei          | 2020-03-19   | 67800 | 3130   | 57682     | 0.04616518493138866  |
+----------------+----------------+--------------+-------+--------+-----------+----------------------+

Disturbingly, Italy's mortality rate has actually gotten even worse in the last week.

Comparing alternate versions of your data

So far, every query we've demonstrated can be done in any normal SQL database from most vendors (maybe with a little bit more setup work). But now let's examine a use case where Dolt has unique capabilities and really shines.

Earlier this week, we discussed how we were using branches to manage data of questionable but potentially useful quality. In that blog post, to query the alternate version of the data, you had to drop out of the SQL shell and checkout the branch, like so:

% dolt checkout master
Switched to branch 'master'
% dolt sql -q "select * from mortality_rate_by_age_range"
...
% dolt checkout case_details_virological_dot_org
% dolt sql -q "select * from mortality_rate_by_age_range"

But as of this week's Dolt release, you can query the alternate data without leaving the SQL shell, just by naming the branch in the AS OF clause.

doltsql> select * from mortality_rate_by_age_range AS OF 'case_details_virological_dot_org';

Future work

There's a lot more work to do for this kind of query. SQL 2011 supports several different types of temporal queries that we haven't yet implemented, such as selecting all rows between two revisions:

SELECT * FROM t FOR SYSTEM_TIME BETWEEN (NOW() - INTERVAL 1 YEAR) AND NOW();

If you are interested in Dolt and want to us to prioritize these use cases, please let us know!

We also have a fair amount of bugs to fix. For example, our engine can't currently join a table to itself. This makes it impossible to join rows in a table to earlier revisions of themselves, which is an incredibly useful feature we're working hard to unlock.

Also please note: the examples in this blog entry use features that are brand new and have not yet been released. If you're reading this on the publication date, AS OF timestamp queries are currently only available by building from source. We'll have a release early next week that includes timestamp support.

Conclusion

Dolt is a versioned database that now lets you easily query earlier revisions of your data with no additional setup work. But unlike other versioned databases, Dolt lets you keep alternate or experimental versions of your data on different branches. Querying those alternate versions or their prior revisions is as simple as including an AS OF clause in your SQL. Download Dolt today to try it out yourself!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt