April Dataset Spotlight

6 min read

This blog entry is the first in a new series. Every month we will highlight some interesting datasets on DoltHub. The focus will be on new or updated datasets but sometimes we'll shed fresh light on a classic.

For those new to Dolt and DoltHub, Dolt is Git for data. Git versions files. Dolt versions SQL tables. DoltHub is a place on the internet to share Dolt repositories.

We think the way we share data with each other is broken and we think Dolt is the fix. Whenever you see a link to a CSV, JSON, or XML file, you should think of Dolt. Whenever you see an API but want all the data, not just a few entries, you should think of Dolt. We are working hard to move data shared in these formats to Dolt. This series of blogs will update you on our progress.

COVID Tracking

Link: tomdottom/covid-tracking
Contributor: tomdottom
First Published: March 31, 2020

tomdottom has been a very active user and bug submitter. We really enjoy his work. His dataset is a mirror of the COVID tracking project. tomdottom added some sample queries to track latest infection rate by state and [USA country-wide infection rate](https://www.dolthub.com/repositories/tomdottom/covid-tracking/query/master?q=SELECT%0A%20%20%20%20date%0A%20%20%2C%20IFNULL(SUM(positive)%2C%200)%20AS%20positive%0A%20%20%2C%20IFNULL(SUM(positiveIncrease)%2C%200)%20as%20positive_increase%0A%20%20%2C%20IFNULL((100.0%20*%20SUM(positiveIncrease)%2F(SUM(positive)%20-%20SUM(positiveIncrease)))%2C%200)%20as%20percentage_increase%0AFROM%20states_daily%0AGROUP%20BY%20date%0AORDER%20BY%20date%20DESC%0A%3B). COVID-19 is on all of our minds right now and it's our pleasure to host this project.

If you are looking for other COVID-19 data on DoltHub check out dolthub/corona-virus. It's a mirror of the John Hopkins dataset and a collection of open case details.

English WordNet

Link: dolthub/english-wordnet
Contributor: dolthub
First Published: April 20, 2020

Princeton WordNet was one of the first datasets we published on DoltHub. Querying this data in SQL is much easier than other interfaces. WordNet is a linked web of Synonym Sets (Synsets). It is used in many computational linguistics applications.

We discovered the Global WordNet organization maintains an active fork of Princeton WordNet. The team made two new releases in a new format that preserves Synset IDs so Dolt can produce proper diffs. They also released an import of WordNet 3.1 in their format. We imported all three versions of English Wordnet into Dolt. Each version is on it's own branch. We are working with them to get Dolt as a format they support.

Open Flights

Link: dolthub/open-flights
Contributor: dolthub
First Published: March 19, 2020

Open Flights is a database of airports, airlines, flights between airports, and the types of airplanes used. It has suspect accuracy. If you want better data you will have to pay for it. However, you can start your project with Open Flight data and if you think you are on to something, you can upgrade to a paid provider.

Stock Ticker Symbols

Link: dolthub/stock-tickers
Contributor: dolthub
First Published: March 19, 2020

This is a list of publicly traded companies and their tickers on the New York Stock Exchange (NYSE), National Association of Securities Dealers and Automated Quotations (NASDAQ), and American Stock Exchange (AMEX). The closing price and other information for these securities is also imported. But the data only goes back a month or so. The data is sourced from the NASDAQ website. The versioning of the company information is particularly useful. If you want to know a companies naming or ticker history and when it changed, Dolt can be useful.

Tatoeba Sentence Translations

Link: dolthub/tatoeba-sentence-translations
Contributor: dolthub
First Published: September 17, 2019

This is a classic Dolt dataset. Tatoeba is a crowd sourcing platform for translations. The database contain 8.3M sentences across 355 languages with 17.3M translation relationships between sentences. It comes out once per week and we have been importing it into Dolt since September so it has a pretty rich version history. The sample queries give you a good idea how to query it. Here's a query to find all the sentences containing the word dolt.


That's it for this month. As you can see, most of the datasets are published by us. For Dolt and DoltHub to continue to exist, we need a community of data publishers to emerge. Please help us build a community by publishing. We published a blog on how to publish with SQL and another on how to publish CSVs.

That said, if you want data in Dolt format but don't have the time or expertise to import and maintain it, send us a note. We're happy to be an open data provider for your projects.



Get started with Dolt