Dolt is a MySQL database that can branch, diff and merge. Every 6 weeks we launch a data bounty and pay contributors to build unique, public databases. Payment is based on the percent of cell edits a participant makes.
This Wednesday, July 7, we concluded a $5,000 collaborative data bounty with The Police Data Accessibility Project. PDAP is a volunteer organization with the goal of centralizing public police data. If you're just joining us, here is the data bounty launch blog which will offer more context.
Here's how we did...
- 22 Pull Requests merged
- 6 contributors
- 162,494 cell edits
- One contributor made $4,744.69
- Every contributor will receive at least $50 payout for participating
What it's like to collaborate on data
The first couple days of every bounty there is a scrutinous discussion about the schema and bounty rules. In order to maintain bounty cell-wise attribution, we cannot change the schema of the table once we've started merging submissions to
master. Because of this we made a couple minor schema changes adding constraints early on, and tabled the rest for after the bounty. The participants let me know what is not clear about the data or the acceptance criteria. Here is the bounty details page in its final form. Once the goal was understood, collaboration skewed city and site specific. For example:
Foray into data validation
I worked with PDAP to validate bounty submissions. We reviewed the first PR together, and came up with a starter list of validation checks:
- Inspecting additions
- Verifying homepage_url's doesn't have excluded keywords
- Removing whitespaces
- No changes to date_insert (unless there are additions)
- Modifications to location fields (lat, lng, city, state, county_fips)
- Spot checked and confirmed source as valid
These formatting and quality checks were driven by submissions, which in turn inform the validation scripts. The cycle of collaboration introduces us to edge cases and makes us smarter about the data. For PDAP Bounty V2 we will have these validations public and accessible to contributors.
The PDAP bounty was the first we've run where we expected, and preferred, manual entry data. We primarily intended to build out the
datasets table, which serves as a reference to law enforcement URL's of public datasets. We did not restrict submissions to the
datasets table alone, but also the
agency_types tables. This caused confusion as it was not all the tables in the database. The task was harder to understand at face value when expected to contribute to a random subset of tables to the database. An all or nothing method, or singular table would clear up that confusion.
Prior to the bounty most of the
pdap/datasets data was entered manually. We considered restricting the use of scrapers, but decided we didn't want to limit creativity or submissions. The first bounty PR was a large scraped contribution to the
agencies table, which we reviewed and merged based on the criteria above.
The problem with merging a ton of scraped data, in a data bounty that we expected to be manual, is it discourages other contestants from participating. There's a lot of catching up to do to make actual money. Admittedly we should not have accepted both scrapers and manual data as it greatly muddies the review process and discourages new participation.
But do we care that the cell edit distribution usually skews in favor of one or two people? We have managed to build unique, public databases. The few contributors who do the most work have dominated the percentage, and get paid handsomely (in the thousands). We are experimenting with restrictions on pull requests in the
dolthub/menus bounty, but as you can see we already have two contributors pulling ahead.
We will be running a v2 of this bounty with PDAP to build out the
datasets table specifically, allowing only manual entry. We'll post another blog for the launch, so keep an eye out.
The same day this bounty ended, we launched a
menus data bounty which goes deeper into data validation. Drop by our Discord if you want to chat. And if you're interested, donate to PDAP to help them grow and run more bounties.