OpenElections Follow Up

BOUNTY
6 min read

As some of you are aware, we finished our first data bounty on Feb. 14 to collect US Presidential Precinct results for 2016 and 2020. On Feb. 15, we published a bounty review. The bounty review gained some distribution on HackerNews after one of the bounty participants posted it there.

Derek Willis of Open Elections discovered the bounty and took umbrage with the use and method of attribution of Open Elections data contributed in the bounty. I spoke with Derek the next day, apologized, and explained my side of the story. We arrived at a good place. This blog is an attempt to explain to you all what I explained to him to clear the air and make peace.

What is Dolt anyway?

I think it's important to explain what Dolt and DoltHub are, DoltHub's motivations for bounties, and how we decided to do precinct level election data as our first bounty before we get into the specifics of the conflict with OpenElections.

What is DoltHub Inc?

We started this company two and a half years ago because we wanted to build a place on the internet to get access to interesting, maintained data. We believed if we added branch/merge functionality like we had in software to data, more data would be shared.

Eventually we landed on adding branch/merge functionality to a SQL database and Dolt was born. Dolt is free and open source. It's the only SQL database you can branch and merge. DoltHub is a place on the internet to share Dolt databases in the style of GitHub. We are a database company, not a data company. We're building a tool and in the long run, we intend to profit from selling support licenses for that tool.

What are Data Bounties?

We launched Dolt in August 2019 and DoltHub in September 2019. Since then, we've been trying to get people to use it. DoltHub is not a thing yet. We have hundreds of monthly users. In July 2020, we asked ourselves how we could get more people to use Dolt and DoltHub.

Our answer was data bounties: pay people to collect and transform open data into Dolt format. We act on the buy side of the bounty and make the data free and open for people to use on DoltHub. We attract bounty participants as well as consumers of the open data after the bounty is complete. We completed the development work on data bounties in November and searched for a launch dataset.

What is the Election data bounty?

On December 2, this Reddit post inspired us. We would pay $25,000 to have people wrangle data posted on state and county websites from 2020. We would start with what was in the MIT Election Lab database from 2016. We sprinted to get our house in order and launched the bounty December 14.

We imported the MIT Elections Lab database poorly. Some bounty participants noticed and completed the work for us as seen in this Pull Request (PR).

Why didn't we contact OpenElections at the time?

Oscar had attempted to get OpenElections data into Dolt earlier in 2020. We were aware of OpenElections mission and work. In fact, we mention them throughout many of our blog posts about this bounty.

Oscar asked me if we should try and work with OpenElections on this before we started the bounty. I decided that we should not waste Derek's time until we saw the results. We weren't sure anyone would actually do the bounty. If we were going to fail spectacularly, I would rather fail spectacularly in private.

How did OpenElections data get into this database?

When we started the contest there was no OpenElections data published. So, the bounty participants started with results from state and county websites. Here's an example PR for 2020 Virginia results. As a rule for accepting PRs, we required the source of the data be posted in the PR. 65 of the 80 PRs merged during the bounty are sourced from the MIT Elections Lab dataset, state and county websites, or are data cleaning PRs.

With about 10 days remaining in the bounty, a participant noticed that OpenElections started having data on GitHub that we did not have in our database. The participant asked me in our Discord if it was ok to insert it. I checked out the license for the data and decided it was ok as long as we attributed OpenElections in the PR. You can see the first PR here. In the end, 15 of 80 PRs accepted were sourced from OpenElections. This represents 9.34% of the cells contributed in the bounty, whereas 90.66% are from other sources, including MIT Elections Lab.

We had to use a Google Sheet to calculate the contribution from the OpenElections source, so feel free to check our work. This would be a cool DoltHub feature.

Derek informed me that the seed database provided by MIT Elections Lab includes OpenElections data. I'm not sure to what degree. If they used Dolt, we could examine the commit graph as we did above and answer what exact percentage.

Was that a good decision?

I think there are a few issues here. First, is it ok for the data to be in this database? Second, did we communicate the use correctly? Third, is it ok for someone to profit from inserting OpenElections data into Dolt?

Is it ok for the data to be in Dolt?

I think the answer here is a resounding yes. The spirit of open source and open data is that the more places the thing is used, the better it gets. In fact, our bounty participant made multiple PRs, here's one, against OpenElections GitHub repositories.

The data on DoltHub is also free and open to use and improve. Dolt is a different consumption format than flat files stored in Git, the current distribution method of OpenElections data. Some people find SQL databases easier to use. A lot of work goes into transforming data into a common SQL schema. The resulting database is different and complementary to what OpenElections offers.

Did we communicate the use correctly?

The answer here is a resounding no and that's on me. There was a series of bad decisions I made that led to Derek calling me out in the virtual public square to be tarred and feathered. First, I should have contacted OpenElections before we started to accept contributions using their data. I should have gotten permission instead of asking for forgiveness. Second, I should have credited OpenElections in the README as well as the bounty review article I posted. I apologized publicly. This blog article is a continuation of that apology.

Is it ok for someone to profit from inserting OpenElections data into Dolt?

This is sort of a gray area and one of the more interesting questions generated by this dispute. I think it's ok. It takes code and data science skills to convert the data in OpenElections Git repository to Dolt format. The bounty is compensating people for those skills and that effort.

Should there be a right of first refusal to the people who collected the source data in the first place? Maybe. We reached out to other folks and presented them with this opportunity for our hospital price bounty. I clearly did not do that here and that was a mistake. If bounties are successful, I think an etiquette will form around this space. Part of the tension here is forming that etiquette.

What are you doing to make this right?

I proposed a number of penances to Derek so he would understand my apology was sincere. This blog article is one of them. I hope this explanation and apology is satisfactory to most people, especially those folks who kindly volunteer their time on the OpenElections project.

More importantly, DoltHub is going to work with OpenElections to get their data into Dolt and DoltHub! We're funding the initiative with $10,000. We here at DoltHub have a ton of respect for what OpenElections has accomplished. Both of us want more data to be shared on the internet. We are mission aligned.

Conclusion

Before publishing, I ran this blog article by Derek to make sure he agreed with everything in it. He did. We look forward to working together.

We are running more data bounties if this topic interests you. Currently we are collecting US hospital prices and US College Course catalogs. Or, just come by our Discord if you'd like to discuss.

Let's build an internet with more open data.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.