Introducing Cell History Inspection on DoltHub

12 min read

Dolt and DoltHub are Git and GitHub for data. Having a versioned database makes collaborating on data more fluid and reliable in the same way that Git improves source code collaboration for software engineers. Using both Git and GitHub, engineers are able to inspect every line of code and dive into who made changes and when over time.

Dolt does the same thing for data. You can inspect the changes between versions of every cell in every table. We've exposed this functionality through our command line interface on Dolt, but we wanted to make it easier for users to get familiar with the features that make Dolt unique.

Dolt System Tables

In order to fully take advantage of what version control can do for data, we made the resulting metadata and data available through Dolt system tables. You can query this information using SQL in the same way you can query a table.

Our new cell history feature uses the dolt_diff system table, so I will use it as an example. You can see a description of all Dolt system tables in our docs.

The dolt_diff system table is a queryable table that shows the changes between versions of a row. Every table in Dolt has a corresponding dolt_diff_[tablename] system table. For example, we have a repository called dolthub/corona-virus with a table named places, which lists all places with Coronavirus cases. This is what the first 15 rows of the dolt_diff_places table look like using Dolt:

$ dolt sql -q "SELECT * FROM dolt_diff_places LIMIT 15"
+--------------+-------------------+-------------------+-------------+-------------+----------------------------------+-----------------------------------+----------------+---------------------+---------------------+---------------+---------------+----------------------------------+-----------------------------------+-----------+
| to_longitude | to_province_state | to_country_region | to_latitude | to_place_id | to_commit                        | to_commit_date                    | from_longitude | from_province_state | from_country_region | from_latitude | from_place_id | from_commit                      | from_commit_date                  | diff_type |
+--------------+-------------------+-------------------+-------------+-------------+----------------------------------+-----------------------------------+----------------+---------------------+---------------------+---------------+---------------+----------------------------------+-----------------------------------+-----------+
| <NULL>       | <NULL>            | <NULL>            | <NULL>      | <NULL>      | 0sieej4vrv3lnh8bu87n35les82piq7m | 2020-05-14 18:59:35.053 +0000 UTC | 0              | Recovered           | Canada              | 0             | 569           | qp2kccaqt3e7hdc3fcebbm18tq5j9dhm | 2020-05-14 17:38:34.042 +0000 UTC | removed   |
| 28.2336      |                   | Lesotho           | -29.61      | 590         | qp2kccaqt3e7hdc3fcebbm18tq5j9dhm | 2020-05-14 17:38:34.042 +0000 UTC | <NULL>         | <NULL>              | <NULL>              | <NULL>        | <NULL>        | 1u96d6g0h99ai7a8j4jl2a2jarff4dpu | 2020-05-14 16:32:36.568 +0000 UTC | added     |
| 43.3333      |                   | Comoros           | -11.6455    | 588         | kg3tr9s7k9832j7it9gbnrafb3n1rmkk | 2020-05-01 20:47:29.907 +0000 UTC | <NULL>         | <NULL>              | <NULL>              | <NULL>        | <NULL>        | ha2t0c6f5cfm9t4lnlaf4ak979jqiphc | 2020-05-01 20:26:18.34 +0000 UTC  | added     |
| 71.2761      |                   | Tajikistan        | 38.861      | 589         | kg3tr9s7k9832j7it9gbnrafb3n1rmkk | 2020-05-01 20:47:29.907 +0000 UTC | <NULL>         | <NULL>              | <NULL>              | <NULL>        | <NULL>        | ha2t0c6f5cfm9t4lnlaf4ak979jqiphc | 2020-05-01 20:26:18.34 +0000 UTC  | added     |
| 112.2707     | Hubei             | China             | 30.9756     | 1           | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 112.2707       | Hubei               | China               | 30.9756       | 1             | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified  |
| 113.4244     | Guangdong         | China             | 23.3417     | 2           | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 113.4244       | Guangdong           | China               | 23.3417       | 2             | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified  |
| 113.614      | Henan             | China             | 33.882      | 3           | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 113.614        | Henan               | China               | 33.882        | 3             | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified  |
| 111.7088     | Hunan             | China             | 27.6104     | 4           | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 111.7088       | Hunan               | China               | 27.6104       | 4             | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified  |
| 115.7221     | Jiangxi           | China             | 27.614      | 5           | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 115.7221       | Jiangxi             | China               | 27.614        | 5             | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified  |
| 117.2264     | Anhui             | China             | 31.8257     | 6           | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 117.2264       | Anhui               | China               | 31.8257       | 6             | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified  |
| 107.874      | Chongqing         | China             | 30.0572     | 7           | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 107.874        | Chongqing           | China               | 30.0572       | 7             | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified  |
| 119.455      | Jiangsu           | China             | 32.9711     | 8           | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 119.455        | Jiangsu             | China               | 32.9711       | 8             | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified  |
| 118.1498     | Shandong          | China             | 36.3427     | 9           | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 118.1498       | Shandong            | China               | 36.3427       | 9             | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified  |
| 102.7103     | Sichuan           | China             | 30.6171     | 10          | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 102.7103       | Sichuan             | China               | 30.6171       | 10            | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified  |
| 116.4142     | Beijing           | China             | 40.1824     | 11          | m15v3hh1oemcoh2b6d1sctervmm7j4os | 2020-04-15 10:23:41.946 +0000 UTC | 116.4142       | Beijing             | China               | 40.1824       | 11            | 7ur10o0i77m2085c4omdo8qbjhq17dki | 2020-04-15 09:47:05.476 +0000 UTC | modified  |
+--------------+-------------------+-------------------+-------------+-------------+----------------------------------+-----------------------------------+----------------+---------------------+---------------------+---------------+---------------+----------------------------------+-----------------------------------+-----------+

Every dolt_diff table has these columns:

+------------------+----------+
| field            | type     |
+------------------+----------+
| from_commit      | LONGTEXT |
| from_commit_date | DATETIME |
| to_commit        | LONGTEXT |
| to_commit_date   | DATETIME |
| diff_type        | LONGTEXT |
+------------------+----------+

As well as to_[column] and from_[column] columns for every column in the table.

You can view the dolt_diff_places table on DoltHub as well by using the SQL console to run a query:

Dolt diff for corona-virus places

Note: Queries are limited to 200 rows on DoltHub. For unlimited query results, you can clone this repository by installing Dolt and running dolt clone dolthub/corona-virus

Using this table, we were able to construct a query to show the history of changes in a row or cell with just a click.

Cell and Row History Inspection on DoltHub

While viewing table changes has been possible on DoltHub through SQL queries, we made it even easier. Introducing cell inspection: you can now click on a table cell and copy the value, get the row or cell history, and filter by that cell value. Clicking on a column heading also gives you the option to sort by that column.

Cell Inspection Dropdown

Not only does this make exploring a dataset easier, but it also teaches SQL and brings awareness to system tables.

Our tables use unique primary keys to tell when there are changes to a row. Clicking on a cell in a primary key column will generate a dolt_diff query that shows you the full row history, since primary keys cannot change.

Row history for places

Clicking on a cell in a non-primary key column generates a dolt_diff query that shows changes only in that particular cell.

Cell history for places

A Cell Inspection Use Case

This is a very useful feature for seeing change over time, and can help users audit data for where and when information changed.

Many of our DoltHub datasets are generated from an ETL job that automatically gets the data from its source when there have been updates, transforms it to match the Dolt table schema, and pushes the newly transformed data to DoltHub. Specifically, our dolthub/corona-virus repo runs on an ETL job that gets data from various resources on an hourly basis. You can see the import script for this dataset here.

Tim, the creator and maintainer of this dataset, was inspecting the diff for the places table, when he found a floating point import bug for the latitude and longitude columns. Using cell history, he was easily able to identify exactly what commit altered these values by looking at the history of changes for those cells.

History of latitude cell

You can get even deeper into when and who made the commit that changed a cell by viewing the dolt_history table by clicking on the link at the bottom of the dolt_diff table, which gets every commit associated with the row, as well as who authored the commit and when.

dolt_history table for cell

Tim was able to go back to his import script and fix the floating point bug. Without Git-like features such as diffs and cell inspection, Tim could have easily missed data inconsistencies created by his import script.

Making Data on DoltHub Interactive

Over the past few months we've been adding more features to DoltHub to better exhibit the benefits of a version-controlled database. Part of that is teaching users about the power of Dolt system tables in a user-friendly way, as well as exposing repository information like filesystem size and when a repo was last updated. Here are some other recent additions to DoltHub that help achieve this goal:

Conclusion

Dolt is a database that uses version control, which makes debugging data between versions easy to trace. Exposing this functionality on DoltHub is an important step to improve the data discovery interface for our users. Data can be audited using our new cell history feature on DoltHub, which shows the history of changes for each table, row, and cell.

This is just the beginning of features like this that we're building for Dolt and DoltHub. If this interests you and you'd like to stay up-to-date with new features like this, please sign up for DoltHub or join our mailing list on the right.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt