November Dataset Spotlight

4 min read

Every month we highlight some interesting datasets on DoltHub. The focus is on new or updated datasets but sometimes we shed fresh light on a classic.

For those new to Dolt and DoltHub, Dolt is Git for data. Git versions files. Dolt versions SQL tables. DoltHub is a place on the internet to share Dolt databases.

We think the way we share data with each other is broken and Dolt is the fix. Whenever you see a link to a CSV, JSON, or XML file, you should think of Dolt. Whenever you see an API but want all the data, not just a few entries, you should think of Dolt. We are working hard to move data shared in these formats to Dolt. This series of blogs will update you on our progress.

Popular Datasets

New for this month, we have the five most viewed DoltHub datasets for the month of November.

  1. dolthub/pa_mail_ballots_2020
  2. dolthub/nfl-play-by-play
  3. dolthub/fbi-nibrs
  4. alexis-evelyn/presidential-tweets
  5. dolthub/corona-virus

Three of these are covered below.

Datasets

Pennsylvania Mail In Ballots

Link: dolthub/pa_mail_ballots_2020
Contributor: dolthub
First Published: November 6, 2020

November was a US Presidential Election month. After the votes were counted, President-elect Biden won. The election is disputed by President Trump, claiming mass voter fraud without much substantial evidence. A key state was Pennsylvania which published this mail in voter dataset which we downloaded on Nov. 6. We published a blog about some of the claims of fraud made online regarding this data. Since then, the data has been moved behind a log in so DoltHub is the only place to get it. One of the claims is debunked by the blog, the pother is open ended. On Twitter, a user named @eppievojt explained the second discrepancy even further. He may be making a guest DoltHub blog appearance in the next couple weeks so stay tuned. This whole experience shows how open data shared with Dolt and DoltHub can be a force for greater data collaboration.

Presidential Tweets

Link: alexis-evelyn/presidential-tweets
Contributor: alexis-evelyn
First Published: November 10, 2020

User alexis-evelyn has been really active in our Discord channel. Besides getting this dataset together, alexis-evelyn has been trying to get Dolt working on Raspberry Pi. Users like alexis-evelyn are the lifeblood of the open source community. We want to put a little spotlight on alexis-evelyn's work. If you're looking for a dataset of tweets by American presidents, check this out.

NFL PLay-by-Play Data

Link: dolthub/nfl-play-by-play
Contributor: dolthub
First Published: May 19, 2020

The NFL is running a data science contest on Kaggle and another starting in a month. We think that may be the reason this dataset is so popular. This dataset contains all NFL play-by-play data since 2000. The NFL stopped the API serving this data in May 2020. This data was originally scraped using the nflfastR package. The data was scraped May 18, 2020.

Conclusion

That's it for this month. For Dolt and DoltHub to continue to exist, we need a community of data publishers to emerge. Help us build a community by publishing. We published a blog on how to publish with SQL and another on how to publish CSVs.

That said, if you want data in Dolt format but don't have the time or expertise to import and maintain it, send us a note or chat with us on Discord. We're happy to be an open data provider for your projects.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt