Introducing SQL VIEW Support in Dolt

8 min read

Dolt is a SQL database with Git-style versioning and distribution. The most recent releases of Dolt introduced support for SQL views that are stored as part of, and versioned along with, a Dolt repository. This provides a great way for data sets to distribute pre-built queries that might be useful to consumers of the data — things like aggregations, pivots and joins often appear in SQL view statements. Views can also be a good way to build up more complicated SQL queries from easy-to-understand and self documenting components.

Let's take a look at how the new view functionality in Dolt SQL works.

A Dataset to Work With

To demonstrate views we'll work with a simple dataset that tracks the outbreak of Coronavirus. The dataset comes from a Google Sheet published by Johns Hopkins University which we mirror in a Dolthub repository. We can checkout the repository and get our bearings with the data easily:

shell$ dolt clone dolthub/corona-virus
cloning https://doltremoteapi.dolthub.com/dolthub/corona-virus
112 of 112 chunks complete. 0 chunks being downloaded currently.
shell$ cd corona-virus
shell$ dolt sql
# Welcome to the DoltSQL shell.
# Statements must be terminated with ';'.
# "exit" or "quit" (or Ctrl-D) to exit.
doltsql> show tables;
+--------------+
| Table        |
+--------------+
| cases        |
| dolt_schemas |
| places       |
+--------------+
doltsql> describe cases;
+------------------+----------+------+-----+---------+-------+
| Field            | Type     | Null | Key | Default | Extra |
+------------------+----------+------+-----+---------+-------+
| observation_time | DATETIME | NO   | PRI |         |       |
| place_id         | BIGINT   | NO   | PRI |         |       |
| confirmed_count  | BIGINT   | YES  |     |         |       |
| recovered_count  | BIGINT   | YES  |     |         |       |
| death_count      | BIGINT   | YES  |     |         |       |
+------------------+----------+------+-----+---------+-------+

As described in the repository's README, this table mirrors the three Johns Hopkins sheets linked above. The rows of the sheets have been mapped to a place_id and the sheet has been unpivoted into a SQL table by mapping each observation column into an observation_time row.

Digging In

This makes the table nice to work with for some types of queries. For example, the following shows how cases have progressed in the province of Hubei, where Wuhan is located:

doltsql> SELECT cases.* FROM cases NATURAL JOIN places
         WHERE places.province_state = 'Hubei'
         ORDER BY observation_time ASC;
+----------+-------------------------------+-----------------+-----------------+-------------+
| place_id | observation_time              | confirmed_count | recovered_count | death_count |
+----------+-------------------------------+-----------------+-----------------+-------------+
| 1        | 2020-01-21 22:00:00 +0000 UTC | 270             | <NULL>          | <NULL>      |
| 1        | 2020-01-22 12:00:00 +0000 UTC | 444             | <NULL>          | <NULL>      |
| 1        | 2020-01-23 12:00:00 +0000 UTC | <NULL>          | 28              | <NULL>      |
| 1        | 2020-01-24 00:00:00 +0000 UTC | 549             | 31              | 24          |
| 1        | 2020-01-25 00:00:00 +0000 UTC | 729             | 32              | 32          |
| 1        | 2020-01-25 12:00:00 +0000 UTC | 761             | <NULL>          | 40          |
| 1        | 2020-01-25 22:00:00 +0000 UTC | 1052            | 42              | 52          |
| 1        | 2020-01-26 11:00:00 +0000 UTC | 1058            | <NULL>          | <NULL>      |
| 1        | 2020-01-26 23:00:00 +0000 UTC | 1423            | 44              | 76          |
| 1        | 2020-01-27 09:00:00 +0000 UTC | <NULL>          | 45              | <NULL>      |
| 1        | 2020-01-27 20:30:00 +0000 UTC | 2714            | 47              | 100         |
| 1        | 2020-01-28 13:00:00 +0000 UTC | <NULL>          | 52              | <NULL>      |
| 1        | 2020-01-28 18:00:00 +0000 UTC | 3554            | 80              | 125         |
| 1        | 2020-01-29 14:30:00 +0000 UTC | <NULL>          | 88              | <NULL>      |
| 1        | 2020-01-29 21:00:00 +0000 UTC | 4586            | 90              | 162         |
| 1        | 2020-01-30 11:00:00 +0000 UTC | 4903            | <NULL>          | <NULL>      |
| 1        | 2020-01-31 14:00:00 +0000 UTC | 5806            | 141             | 204         |
| 1        | 2020-02-01 10:00:00 +0000 UTC | 7153            | 168             | 249         |
| 1        | 2020-02-02 21:00:00 +0000 UTC | 11177           | 295             | 350         |
| 1        | 2020-02-03 21:00:00 +0000 UTC | 13522           | 386             | 414         |
| 1        | 2020-02-04 09:40:00 +0000 UTC | <NULL>          | 396             | <NULL>      |
| 1        | 2020-02-04 22:00:00 +0000 UTC | 16678           | 522             | 479         |
| 1        | 2020-02-05 09:00:00 +0000 UTC | <NULL>          | 537             | <NULL>      |
| 1        | 2020-02-05 23:00:00 +0000 UTC | 19665           | 651             | 549         |
| 1        | 2020-02-06 09:00:00 +0000 UTC | <NULL>          | 712             | <NULL>      |
| 1        | 2020-02-07 20:13:00 +0000 UTC | 22112           | 867             | 618         |
| 1        | 2020-02-07 22:50:00 +0000 UTC | 24953           | 1115            | 699         |
| 1        | 2020-02-08 10:24:00 +0000 UTC | <NULL>          | 1218            | <NULL>      |
| 1        | 2020-02-08 23:04:00 +0000 UTC | 27100           | 1440            | 780         |
| 1        | 2020-02-09 10:30:00 +0000 UTC | <NULL>          | 1480            | <NULL>      |
| 1        | 2020-02-09 23:20:00 +0000 UTC | 29631           | 1795            | 871         |
| 1        | 2020-02-10 11:00:00 +0000 UTC | <NULL>          | 1854            | <NULL>      |
+----------+-------------------------------+-----------------+-----------------+-------------+

As you can see, some updates for the spreadsheet include observations for certain fields and not others. This makes it slightly inconvenient to answer the question "what is the most recent observation for each count in each location", for example. For that reason, the dataset travels alongside with some views that can easily provide exactly those answers.

Introducing Views

Views in a Dolt repository travel alongside the dataset and are versioned in exactly the same way. Currently, you can see what views are available in a dolt repository by selecting from the dolt system table dolt_schemas.

doltsql> select type, name from dolt_schemas;
+------+-------------------+
| type | name              |
+------+-------------------+
| view | current           |
| view | current_cases     |
| view | current_deaths    |
| view | current_recovered |
+------+-------------------+

As you can see, we store the view definitions in this table. The table is modified by CREATE VIEW and DROP VIEW statements, but it is versioned alongside the Dolt repository. That means it's available for Dolt CLI interactions like dolt add, dolt blame and dolt reset. When you create a view, you'll see a new row in the dolt_schemas table when you run dolt diff. You should dolt add the dolt_schemas table to your commit if you want the view to travel alongside your dataset.

Let's dig a little bit more into the views that are included in the corona-virus dataset. Selecting from the current view, we can get a row for each location in the dataset, and that locations most recent non-NULL observation for each of the three fields:

doltsql> select * from current limit 10;
+----------------+-----------+-------+--------+-----------+
| country        | state     | cases | deaths | recovered |
+----------------+-----------+-------+--------+-----------+
| Mainland China | Hubei     | 29631 | 871    | 1854      |
| Mainland China | Guangdong | 1159  | 1      | 167       |
| Mainland China | Zhejiang  | 1092  | 0      | 242       |
| Mainland China | Henan     | 1073  | 6      | 191       |
| Mainland China | Hunan     | 879   | 1      | 208       |
| Mainland China | Anhui     | 830   | 3      | 88        |
| Mainland China | Jiangxi   | 771   | 1      | 105       |
| Mainland China | Jiangsu   | 492   | 0      | 81        |
| Mainland China | Chongqing | 473   | 2      | 66        |
| Mainland China | Shandong  | 466   | 1      | 66        |
+----------------+-----------+-------+--------+-----------+

We can inspect how the view is defined by selecting its schema fragment from the dolt_schemas table:

doltsql> select fragment from dolt_schemas where name = 'current';
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| fragment                                                                                                                                                                                                                                                                                                                                                                                                               |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| select current_cases.country, current_cases.state, current_cases.cases, current_deaths.deaths, current_recovered.recovered from current_cases left join current_deaths on current_cases.country=current_deaths.country and current_cases.state=current_deaths.state left join current_recovered on current_cases.country=current_recovered.country and current_cases.state=current_recovered.state order by cases desc |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Which is a little hard to read, but just reformatting gives us:

SELECT
  current_cases.country,
  current_cases.state,
  current_cases.cases,
  current_deaths.deaths,
  current_recovered.recovered
FROM
  current_cases
LEFT JOIN
  current_deaths
  ON current_cases.country=current_deaths.country
    AND current_cases.state=current_deaths.state
LEFT JOIN
  current_recovered
  ON current_cases.country=current_recovered.country
    AND current_cases.state=current_recovered.state
ORDER BY
  cases DESC

Each one of current_cases, current_deaths and current_recovered looks similar. They each return the most recent observation time for each location which has an actual observation for a given field.

This also highlights how views can help build up a complex query from self-contained, self-documenting pieces. The query plan for select * from current can give a good overview of what's going on when Dolt computes the results:

doltsql> explain select * from current;
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan                                                                                                                                                                                            |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| SubqueryAlias(current)                                                                                                                                                                          |
|  └─ Sort(current_cases.cases DESC)                                                                                                                                                              |
|      └─ Project(current_cases.country, current_cases.state, current_cases.cases, current_deaths.deaths, current_recovered.recovered)                                                            |
|          └─ LeftJoin(current_cases.country = current_recovered.country AND current_cases.state = current_recovered.state)                                                                       |
|              ├─ LeftJoin(current_cases.country = current_deaths.country AND current_cases.state = current_deaths.state)                                                                         |
|              │   ├─ SubqueryAlias(current_cases)                                                                                                                                                |
|              │   │   └─ Sort(cases ASC)                                                                                                                                                         |
|              │   │       └─ Project(country, state, MAX(convert(cases.observation_time, datetime)) as last updated, cases)                                                                      |
|              │   │           └─ GroupBy                                                                                                                                                         |
|              │   │               ├─ Aggregate(places.country_region as country, places.province_state as state, MAX(convert(cases.observation_time, datetime)), cases.confirmed_count as cases) |
|              │   │               ├─ Grouping(cases.place_id)                                                                                                                                    |
|              │   │               └─ Filter(NOT(cases.confirmed_count IS NULL))                                                                                                                  |
|              │   │                   └─ IndexedJoin(cases.place_id = places.place_id)                                                                                                           |
|              │   │                       ├─ cases                                                                                                                                               |
|              │   │                       └─ places                                                                                                                                              |
|              │   └─ SubqueryAlias(current_deaths)                                                                                                                                               |
|              │       └─ Sort(deaths DESC)                                                                                                                                                       |
|              │           └─ Project(country, state, MAX(convert(cases.observation_time, datetime)) as last updated, deaths)                                                                     |
|              │               └─ GroupBy                                                                                                                                                         |
|              │                   ├─ Aggregate(places.country_region as country, places.province_state as state, MAX(convert(cases.observation_time, datetime)), cases.death_count as deaths)    |
|              │                   ├─ Grouping(cases.place_id)                                                                                                                                    |
|              │                   └─ Filter(NOT(cases.death_count IS NULL))                                                                                                                      |
|              │                       └─ IndexedJoin(cases.place_id = places.place_id)                                                                                                           |
|              │                           ├─ cases                                                                                                                                               |
|              │                           └─ places                                                                                                                                              |
|              └─ SubqueryAlias(current_recovered)                                                                                                                                                |
|                  └─ Sort(recovered DESC)                                                                                                                                                        |
|                      └─ Project(country, state, MAX(convert(cases.observation_time, datetime)) as last updated, recovered)                                                                      |
|                          └─ GroupBy                                                                                                                                                             |
|                              ├─ Aggregate(places.country_region as country, places.province_state as state, MAX(convert(cases.observation_time, datetime)), cases.recovered_count as recovered) |
|                              ├─ Grouping(cases.place_id)                                                                                                                                        |
|                              └─ Filter(NOT(cases.recovered_count IS NULL))                                                                                                                      |
|                                  └─ IndexedJoin(cases.place_id = places.place_id)                                                                                                               |
|                                      ├─ cases                                                                                                                                                   |
|                                      └─ places                                                                                                                                                  |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Recap

SQL view support is a new Dolt feature that allows defined views to be stored inside and versioned alongside a Dolt repository. SQL views can be included in a repository to provide interesting ways of reframing the data, including aggregations, joins and pivots. In a Dolt repository, views definitions are stored in the dolt_schemas table and can be managed from there like any other table in a Dolt repository.

Coming Soon for Dolt View Support

This is just the start of view support in Dolt, and there's still more work to be done. Our roadmap includes the following missing functionality that will be coming shortly:

  • First Class Dolthub Support.
  • Views Appear In SHOW FULL TABLES.
  • SHOW CREATE VIEW Support.
  • ALTER VIEW Support.

Further Data Resources on Coronavirus

There are lots of resources across the Web for tracking data related to Coronavirus. Here are some useful links:

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt