August Dataset Spotlight

August 31, 2020

3 min read

Every month we highlight some interesting datasets on DoltHub. The focus is on new or updated datasets but sometimes we shed fresh light on a classic.

For those new to Dolt and DoltHub, Dolt is Git for data. Git versions files. Dolt versions SQL tables. DoltHub is a place on the internet to share Dolt repositories.

We think the way we share data with each other is broken and we think Dolt is the fix. Whenever you see a link to a CSV, JSON, or XML file, you should think of Dolt. Whenever you see an API but want all the data, not just a few entries, you should think of Dolt. We are working hard to move data shared in these formats to Dolt. This series of blogs will update you on our progress.

FBI NIBRS Crime database

Link: dolthub/fbi-nibrs
Contributor: dolthub
First Published: August 7, 2020

The Federal Bureau of Investigation in the United States collects a database of offenses from local law enforcement called the National Incident Based Reporting System. This dataset is topical given the current cultural focus on policing and police reform. The FBI collects voluntary reports of criminal offenses from law enforcement nationwide. If you're looking to ask questions about crime and policing in the United States, this is a good place to start. Dolt makes it really accessible with all 50 states cleaned and in one database. Here's the blog about the database.

AI Dungeon Transcripts

Link: dolthub/ai-dungeon
Contributor: dolthub
First Published: July 20, 2020

A theme this month is collaborative datasets. We think Dolt has the unique capability of producing datasets that thousands of people collaboratively edit. In pursuit of that theme we bootstrapped the following two datasets and made scrapers for collaborators to add to them. The first is trying to capitalize on the GPT-3 hype and produce a dataset of AI Dungeon transcripts. AI Dungeon is powered by GPT-3. We seeded this dataset, produced a scraper, and blogged about it. Not much interest so far but maybe the monthly spotlight will pique some interest.

Open Resume Database

Link: dolthub/open-resumes
Contributor: dolthub
First Published: July 20, 2020

Similar to AI Dungeon transcripts we think there should be an open resume database on the internet. We started with our resumes and we'd love more contributors. We think forks on DoltHub are a prerequisite for large scale distributed dataset collaboration but any motivated individual can add their information now.

Bad Words

Link: dolthub/bad-words
Contributor: dolthub
First Published: April 13, 2020

We published a blog about the Bad Words dataset back in April. This dataset contains bad words from numerous languages. Sticking with our collaborative theme, we think this is the type of dataset that begs collaboration since bad words change frequently and are use case specific. Maintain your own branch that has more or fewer bad words than master but still get updates when master changes. This is the promise of Dolt.

ImageNet

Link: dolthub/image-net
Contributor: dolthub
First Published: Oct 30, 2019

The dataset that started a revolution. We managed to procure all four released versions. They all exist on their own branch. We're trying to make Dolt the place the next ImageNet will happen so it's only fitting that we're the best place on the internet to get the original.

Conclusion

That's it for this month. As you can see, most of the datasets are published by us. For Dolt and DoltHub to continue to exist, we need a community of data publishers to emerge. Help us build a community by publishing. We published a blog on how to publish with SQL and another on how to publish CSVs.

That said, if you want data in Dolt format but don't have the time or expertise to import and maintain it, send us a note or chat with us on Discord. We're happy to be an open data provider for your projects.

Blog