Today, we're launching a way to make money building Dolt databases called Bounties. We'll have a follow on blog post Wednesday explaining the motivations for the Bounties feature. But today, we're going to jump right to the chase and explain how you can make money wrangling data on DoltHub. Help us build a database of US Presidential Election Precinct results and get paid for your contributions.
In the United States, we recently had a Presidential Election. Election data should be free and open. We've worked with the open elections project in the past. We also admire the MIT election data lab. Inspired by these two data projects, for our first bounty, we decided to offer a $25,000 prize pool for building precinct level election data for the 2016 and 2020 US presidential elections.
We started with a cleaned version of the MIT Election lab database. The cleaned results are in dolthub/us-president-precinct-results. We are accepting any inserted or updated rows for either the 2016 or 2020 presidential elections. The $25,000 bounty will be divided based on the percentage of cells modified, last write wins, in any table in the database during the three month duration of the bounty. The bounty ends February 14, 2021 and payouts will be calculated and processed then.
The database contains four tables:
vote_tallies. There are foreign key constraints so make sure
precincts are defined before adding the vote data to
vote_tallies. We think structured data of this form will be immensely more useful for analysis than the patchwork of CSVs that exist today.
candidates table, we have
fec stands for Federal Election Commission and presidential candidate data can be sourced from their website.
counties table we have
state are the keys. The MIT Elections data came with the
ansi column but we're unsure where to source that data. Looks like the census bureau uses
precincts table we have
jurisdiction. All four columns are keys. There are some precincts where the
county and the
jurisdiction are different, it's pretty rare. So for now, drop the 'COUNTY' or 'PARISH' from the
county and call it the
jurisdiction if it's the same.
Finally, in the
vote_tallies table we have
votes. Make sure
precincts are populated with the key columns before you insert your vote data into
vote_tallies as their are foreign key constraints into those tables.
We wanted to build an example of what a bounty contribution would entail to help get people started. I went to DoltHub and figured out which states had data already by running a query on the web. This gave me an idea of the states MIT imported for 2016.
I started by googling to see if Florida had published results for 2020. First link is the Florida Department of State. Seemed promising. Denied. Results for 2020 not up yet.
Not to be deterred, I modified my Google search a bit to see if some 2016 data existed for a state that wasn't imported yet. Having spent 12 years in Seattle, I have a soft spot for Washington State. Bingo. Both 2016 and 2020 precinct level presidential election data is published. It may be incomplete but it's something. Looks like the Washington data is published in a downloadable CSV for 2016 and 2020.
Now, I need to coerce this CSV into the schema Dolt wants. Based on my experience wrangling data, the best way to do this is to write a Python script to parse the CSV and output SQL. We have doltpy to make this easier. But, if you are an Excel or Google Sheets expert, creating CSVs that match the schema of the Dolt tables and then using
dolt table import -u or Upload a CSV on DoltHub works as well.
Here's the Python script I used to input the data into Dolt. I downloaded the 2016 CSV, parsed it, and inserted the results into Dolt using
The first challenge is that the data is distributed with a two letter "county code" so I had to manually build a
county_map with the additional county information information the Dolt database requires. I sourced Washington State County information from Wikipedia. I ignored
lat/long in the interest of expedience. Filling these columns will get someone a piece of the bounty! I ignored
ansi because I couldn't figure out what
ansi was. Looks like the census bureau uses
ansi? We started to debate internally whether
fips was sufficient in the
counties table? Entering the county data I noticed that King County, where Seattle is located, does not have precinct level data. Another way for a bounty participant to grab their share of the bounty!
Next, I needed to add the precincts.
jurisdiction is a primary key in this database because in some states, some precincts are in the same county but different jurisdiction. In Washington, that doesn't seem to be the case so by convention I just stripped 'COUNTY' off the end of
county and made that
Then, I looked at the candidates. All the candidates were already in the
candidates table but I needed to map them to the name they existed under in that table. There were only seven unique candidates so this was not that much trouble.
Finally, it was putting it all together and inserting into the
vote_tallies table. I chose to set
writein to false and
vote_mode to 'ELECTION DAY' because those variables were not represented in the CSV data I downloaded.
Throughout, I used
dolt diff to check my work after I had made my writes. When I made a mistake, I ran
dolt reset --hard and started fresh from the tip of my branch. Dolt makes it really forgiving to make writes to databases.
There were a couple gotchas. First, the standard MySQL connector in Python turns off autocommit by default. This does not play well with Dolt so you need to run
SET @@autocommit = 1 after you open a connection. Second, I wanted to make the script re-runnable. Dolt does not support
ON DUPLICATE KEY so I had to do a
select count(*) on the primary keys of a table before I inserted.
replace queries threw foreign key constraint errors for everything but the
ON DUPLICATE KEY will be supported in Dolt shortly.
Next I made my own fork repo of the dolthub/us-president-precinct-results on DoltHub. I set
personal-origin to my fork using
dolt remote add personal-origin https://doltremoteapi.dolthub.com/timsehn/us-president-precinct-results. I pushed to
personal-origin. Then I went to DoltHub and made a pull request from my fork to the dolthub/us-president-precinct-results repository. In the Pull Request, I included the information relevant to getting my work accepted like the source of the data and the assumptions I made.
In the bounties case, once my Pull Request was accepted, I could check out the current state of my bounty payout and watch it change as more Pull Requests were accepted. I could either be happy with the payout or do more work to gain a greater share of the bounty.
All told, it took me about four hours to insert 622,848 cells. If there are 10,000,000 cells inserted over the duration of the bounty, I would have made $1,557.12 when the bounty closes on February 14, 2021. That's $389.28 per hour. If more cells are committed by other bounty participants I would make less. If fewer cells are committed I would make more. Not bad at all for a job as a part time data wrangler.
I left the 2020 CSV for the first enterprising bounty participant who reads this. With a couple small changes to my script, you can insert a bunch of cells for the bounty and be on your way to getting your piece of that $25,000 bounty. First Pull Request wins. Get data wrangling right now!
Start Contributing Today
We are really excited to launch Bounties but we'll be even more excited to start seeing contributions come in. Come hang out with us on our Discord channel. We have a Bounties room to help get you started. Let's build the best precinct level US presidential election database out there and get paid doing it.