Novel Coronavirus Dataset in Dolt: A Case for Branches

DATASET
4 min read

Here at DoltHub, we've been working on COVID-19 data since February 5, 2020. First, we started importing John Hopkins data and then we worked on assembling the largest open, regularly-updated set of case details from Singapore, Hong Kong and South Korea.

Upon publishing our case details blog, we were pointed at another case details dataset via Twitter:

virological.org case details tweet

A new set of case details data published by virological.org in a Google Sheet. This blog explains how we ingested this data in Dolt and why we chose to do what we did.

Using Dolt Branches

Dolt is Git for data. Instead of versioned files, Dolt has versioned tables; making Dolt the only database with branches. These two sets of case details data make an ideal use case for branches.

This case details data from virological.org is of lower quality than the case details data provided by the governments of Singapore, Hong Kong, and South Korea. The data is manually sourced from news reports. It's unclear how the maintainers are tracking cases over their lifecycle. For instance, say a maintainer finds a case reported in a local newspaper. How is that case tracked through hospital admission and recovery? Doing so is difficult so the collection methodology is likely to produce case detail inaccuracies. Contrast that with the data from the governments of Singapore, Hong Kong, and South Korea. The data is incredibly detailed. Based on the detail of the current data, the collection and tracking appears quite robust and diligent.

This difference in data quality from different sources makes maintaining the two datasets separately but with the same schema an ideal case for branches. We maintain the government data on the master branch. We create a new branch for the virological.org data. On every update, we merge in master so the virological.org branch has the latest Singapore, Hong Kong, and South Korea data. We then update the branch with the virological.org data, ignoring data from Singapore, Hong Kong, and South Korea because we assume the government is a better source. You can see how we do that in our import script.

With a traditional relational database, we could create a new table with this data, duplicating the views and queries for both tables. Alternatively, we could insert all the data into the same table, adding a column to denote which set it belongs to. This might be okay with one additional column but gets limited pretty quickly as you add more sets. Dolt conveniently scales to a virtually unlimited number of branches, just like Git.

Using the data

With this branch structure, all of our views and saved queries work. All we have to do to switch between the datasets is dolt checkout. We made SQL views for mortality_rate_by_age_range, mortality rate_by_sex, and mortality_rate_by_age_sex for our last blog post. If we're doing an analysis and want to see how it looks with the bigger sample size but less accurate data, we just switch branches.

Case counts and deaths with an age go from ~1,000 and 40 on the master branch to ~3,000 and 108 on the branch that combines master and the new virological.org data.

timsehn$ dolt checkout master
Switched to branch 'master'
timsehn$ dolt sql -q "select sum(cases), sum(deaths) from mortality_rate_by_age_range where age_range is not null"
+----------------------------------------+-----------------------------------------+
| SUM(mortality_rate_by_age_range.cases) | SUM(mortality_rate_by_age_range.deaths) |
+----------------------------------------+-----------------------------------------+
| 1067                                   | 40                                      |
+----------------------------------------+-----------------------------------------+
timsehn$ dolt checkout case_details_virological_dot_org
Switched to branch 'case_details_virological_dot_org'
timsehn$ dolt sql -q "select sum(cases), sum(deaths) from mortality_rate_by_age_range where age_range is not null"
+----------------------------------------+-----------------------------------------+
| SUM(mortality_rate_by_age_range.cases) | SUM(mortality_rate_by_age_range.deaths) |
+----------------------------------------+-----------------------------------------+
| 3151                                   | 108                                     |
+----------------------------------------+-----------------------------------------+

The mortality rate by age range did not change much at all comparing master to our new combined branch.

timsehn$ dolt checkout master
Switched to branch 'master'
timsehn$ dolt sql -q "select * from mortality_rate_by_age_range"
+-----------+-------+--------+----------------------+
| age_range | cases | deaths | mortality_rate       |
+-----------+-------+--------+----------------------+
| <NULL>    | 7203  | <NULL> | <NULL>               |
| 0         | 9     | <NULL> | <NULL>               |
| 1         | 15    | <NULL> | <NULL>               |
| 2         | 194   | <NULL> | <NULL>               |
| 3         | 178   | 2      | 0.01123594874384902  |
| 4         | 170   | 1      | 0.005882349480970894 |
| 5         | 229   | 6      | 0.026200861921020994 |
| 6         | 173   | 9      | 0.05202309131613218  |
| 7         | 67    | 9      | 0.1343281577191676   |
| 8         | 28    | 11     | 0.3928557398009293   |
| 9         | 4     | 2      | 0.49998750031249223  |
+-----------+-------+--------+----------------------+
timsehn$ dolt checkout case_details_virological_dot_org
Switched to branch 'case_details_virological_dot_org'
timsehn$ dolt sql -q "select * from mortality_rate_by_age_range"
+-----------+-------+--------+----------------------+
| age_range | cases | deaths | mortality_rate       |
+-----------+-------+--------+----------------------+
| <NULL>    | 46229 | <NULL> | <NULL>               |
| 0         | 60    | <NULL> | <NULL>               |
| 1         | 142   | 1      | 0.007042248561796787 |
| 2         | 533   | <NULL> | <NULL>               |
| 3         | 457   | 4      | 0.008752733314500369 |
| 4         | 536   | 3      | 0.005597013881154127 |
| 5         | 590   | 11     | 0.018644064636599216 |
| 6         | 500   | 25     | 0.049999990000002    |
| 7         | 221   | 27     | 0.12217189041995909  |
| 8         | 96    | 31     | 0.3229163302954893   |
| 9         | 16    | 6      | 0.37499765626464837  |
+-----------+-------+--------+----------------------+

The data mortality rates do not materially change, suggesting both these sources are tracking mortality pretty similarly. This is a surprising result. I would expect the government data to have a higher death rate given collection methods. One possible explanation is that crowdsourced data accurately covers deaths because those are more newsworthy. Feel free to clone a copy and play with this dataset for yourself.

This example shows how simple it is to use Dolt to work with two distinct sets of data with the same schema using branches. We think this multiple-source merging happens a lot in the data world and Dolt provides a very elegant solution.

Conclusion

We are continually updating the Coronavirus dataset on Dolt. Come back often and explore. Or feel free to rely on the dataset to back a dashboard or application. If there's any case details data we missed, please let us know at tim@dolthub.com. We hope our work here is helping in small part contain the virus.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.