Jails Bounty Retrospective

BOUNTY
2 min read

Dolt is the first version controlled database in the world. It looks like a database, but can be branched, merged, and cloned like any Git repository.

A year and a half ago we launched data bounties and used Dolt's cell-lineage features to track each bounty-hunter's contributions. We pay each bounty hunter for their proportional contribution to this crowd-sourced database. We aim for projects that are hard to replicate with just one person, as well as ones that serve the public good.

We just completed doing a survey of US jail and prison data.

How we did

We learned that not only is jail data hard to track down, it's also hard to scrape. Most jails appear to publish their data in pdf format. (One participant even noted that the pdf was an Excel spreadsheet exported to pdf.) Jails make no effort to make this data machine-readable or easy-to-find. At first, the data only trickled in.

What we learned

After 8 long weeks, we can tally up the data and begin to take a first look.

Completeness

Many jails offer sparse records. Some offer complete records. And many, many more offer none. (This is a pattern that we sometimes see in nationwide bounties, perhaps due to different state laws.)

By the numbers: we collected 138,446 snapshots directly from 1022 jails. For context, there are a little over 3,000 jails in America. This means around 33% of jails are publishing some kind of data. Interestingly, hospitals — which are required to published their chargemasters — are approximately as transparent as jails, which are not required (as far as I know) to publish any records publicly.

While a few jails offer rather complete data — a set of monthly population snapshots going back at least 20 years — more often the records are incomplete or nonexistent. How much these records reveal will come during our next coming analysis.

Intriguingly, Texas(!) and California were the two most transparent states, with 70k and 30k rows respectively.

Too much of a good thing

A bit of a challenge keeps us engaged. It can drive us to learn a new skill, or sharpen our existing ones.

But bounties are hard enough for newcomers. They have to learn the basic syntax of SQL, the Git interface, and how to use Dolt, and that's before they can even start to scrape the web.

As much as finding data can feel like pulling a rabbit out of a hat, it's not magic. The data has to be there for the bounty hunters to find it. And despite digging around the web, few of our contestants struck paydirt. Those that did were often greeted with a hard-to-scrape pdf.

Finally, even if they did find the information, the data had to meet requirements like: not too frequent, not older than a certain date, and not from a pre-existing census. These were to prevent monster data sources from overwhelming the database (and prize money), but also created headaches.

What we learned was this: there is such a thing as too challenging. Bounties need to stay simple and fun.

Awards

@abmyii swooped in in the end to snatch the top prize with a PR that captured most of the Texas jails.

jails scoreboard

Conclusion

That's it for jails, for now. Come help us put together a single database that unites the world's museum collections. For questions check out our Discord.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.