July Dataset Spotlight

3 min read

Every month we highlight some interesting datasets on DoltHub. The focus is on new or updated datasets but sometimes we shed fresh light on a classic.

For those new to Dolt and DoltHub, Dolt is Git for data. Git versions files. Dolt versions SQL tables. DoltHub is a place on the internet to share Dolt repositories.

We think the way we share data with each other is broken and we think Dolt is the fix. Whenever you see a link to a CSV, JSON, or XML file, you should think of Dolt. Whenever you see an API but want all the data, not just a few entries, you should think of Dolt. We are working hard to move data shared in these formats to Dolt. This series of blogs will update you on our progress.

NFL Pick'em Data

Link: tylergothmann/NFL_PickEm_Data
Contributor: tylergothman
First Published: September 2, 2019

Shout out to our first publisher! This user has been putting NFL Pick'em data into a Dolt database for the 2019-2020 NFL season and he has returned for the 2020-2021 season. We're glad to see him back. This dataset contains information about games and scores of NFL games with a bend towards picking winners for gambling purposes.

USDA All Foods

Link: dolthub/usda-all-foods
Contributor: dolthub
First Published: September 30, 2019

The US Department of Agriculture maintains a list of food products and various nutritional information about them. The database is divided into raw foods like broccoli and hamburger and branded foods like Doritos. It's a good database for anyone looking into food or food products. This dataset has been recently requested on Reddit so we thought we'd point ity out here.

Open Elections

Link: open-elections/voting-data
Contributor: dolthub/Open Elections Partnership
First Published: June, 2020

This is the first dataset where we are working with the data producer to build a version of the dataset in Dolt format. Here's a blog article about the data. We're going to keep building on this work over the next few months so stay tuned.

NOAA Weather Data

Link: dolthub/noaa
Contributor: dolthub
First Published: March 2, 2020

This dataset is an interesting one. You can read about it in this cool blog article complete with visualizations. What's interesting about it is the way that it is modeled in Dolt. This dataset contains the current climate information at a station on HEAD and uses the commit log to handle historic information. We're not sure if that's the best way to model time series information in Dolt but it was cool to try.

USPS Crosswalk Data

Link: dolthub/usps-crosswalk-data
Contributor: dolthub First Published: July 13, 2020

The USPS Crosswalk data is a tool provided by the United States Postal Service for mapping from ZIP codes to other kinds other geographic entities such as counties. It allows users of the data to change the spatial resolution of an existing dataset. We find this kind of data exciting precisely because it is not specifically interesting, but is rather generically useful. For an example of what we mean by that, we took a historical analysis of the IRS Sources of Income by ZIP code data, and transformed it to the country level. You can read about how we used USPS Crosswalk data to do that transformation here.

Conclusion

That's it for this month. As you can see, most of the datasets are published by us. For Dolt and DoltHub to continue to exist, we need a community of data publishers to emerge. Please help us build a community by publishing. We published a blog on how to publish with SQL and another on how to publish CSVs.

That said, if you want data in Dolt format but don't have the time or expertise to import and maintain it, send us a note. We're happy to be an open data provider for your projects.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt