DoltHub is a collaboration platform for data stored in Dolt, a relational database and data storage format with Git-like version control features for structured data. The vision of Dolt and DoltHub together is empowering decentralized communities to combine their efforts to create high quality open datasets. As part of our journey building DoltHub we have been looking at existing "open data" projects. One of the most exciting we have encountered has been Open Elections. We decided to replicate the Open Elections data set on DoltHub as we believe that a SQL database containing comprehensive dataset of American elections would be of interest to the data community in presidential election year, as well as showing the value of using Dolt as a data distribution format.
We are excited to share a DoltHub repository with precinct level voting data from 41 states. The remaining 9 states either have data issues, or are not yet present in the Open Elections dataset. We are committed to getting to 50 as soon as possible. Furthermore, we have built infrastructure for transforming the Open Elections data into the Dolt format, so as Open Elections continues to increase the scope of their coverage this dataset will improve.
Open data projects get better when people contribute, so if you are interested in this data being freely available, we highly encourage you to contribute. This blog aims to show how distributing that data as a SQL database can make those contributions even more valuable by lowering the barrier to getting insights.
Open Elections Project
The goal of the Open Elections project is simple to state:
Our goal is to create the first free, comprehensive, standardized, linked set of election data for the United States, including federal and statewide offices.
Like most interesting problems this has the familiar property of being "easy to state, and hard to solve." It is hard to solve because the United States has multiple levels of government and no standardization of reporting mechanisms and channels across the various state governments that are responsible for administering elections. States vary greatly in the granularity of their reporting formats and modernity of their data distribution mechanisms, exemplified by this (truly surreal) Tweet from the Open Elections Twitter handle showing the paper reports that one county insisted on using to provide election data:
There are 50 states in America, which are further divided up into 3,143 counties. Each state can use subtly different terminology, as well as providing different tabulation buckets for their voting data. Specifically different states provide different breakdowns of voting totals. Some merely report the total votes for candidates or propositions, others report a rich set set of breakouts such as mail-in ballot counts. Not only are the schemas not uniform, but the reporting mechanisms vary widely. In order to create a "comprehensive and standardized" dataset Open Elections has to coordinate a disparate set of volunteers to post CSV files on their state level GitHub repositories. This is a huge effort which has to be optimized for getting as many volunteers as possible to devote their time to the project. This in turn requires making choices that make it easy for the volunteer, making it hard to be too restrictive in choice of toolchain.
Open Elections Data
This requirement of creating a lower "barrier to participation" manifests technically as using GitHub to coordinate volunteers providing CSV files. CSVs are conceptually easy to understand, and their age and ubiquity means data table manipulation interfaces from Excel down to Python libraries, wherever the volunteer feels they can most comfortably do their data munging, provide export tools for this format. With any open data project, volunteer hours are the lifeblood of the project, so it can be necessary to prioritize lowering barriers to participation over technical elegance.
By using a CSV file as the atomic unit of output for project volunteers Open Elections has a model for data gathering that nicely fits into GitHub's workflows. Project maintainers can scope the required data for a state (counties, data sources, etc.), and volunteers can pick up tasks and provide one or more CSVs corresponding the results reported by some administrative body.
The CSV files look something like this, where you can see that the filenames embed some metadata in the name:
CSV also presents several clear benefits: they are conceptually easy to understand, and their age and ubiquity means data table manipulation interfaces from Excel down to Python libraries, wherever the volunteer feels they can most comfortably do their data munging, provide export tools for this format. With any open data project, volunteer hours are the lifeblood of the project, so it can be necessary to prioritize lowering barriers to participation over technical elegance.
While a CSV per unit of collection makes sense from a project management standpoint, it doesn't make for the easiest data distribution story. If a user is interested in data across multiple reporting bodies, states, or years, they must write some kind of filter to effectively acquire the relevant data. Then they must use some kind of parsing tools to create a query interface, and transform the various files into the unified format for comparison. This makes it tough to answer questions like "what are the precincts with the largest swings in congressional party preference from 2016 to 2018 elections?"
At DoltHub we are really excited about open data projects, and we think Open Elections is an amazing example. We also think that the data could benefit from being stored in a SQL database.
Using a Database
Getting data into a databases is a lot of work. When data is stored in CSVs or Excel sheets correctness is often understood in the "human" sense. For example, humans "know" that "-" means the integer "0" in the context of voting data, but try loading this to a database and it will produce an error if the column type requires an integer. At a high level this highlights the tradeoff between how "loose" a format is, and how much "power" a query engine can bring to bear on the data. Databases have more rigid data formats, which in turn allows their query engines to be very powerful. Dolt is a database, so getting data stored in "looser" formats into Dolt requires doing some work.
We decided to embark on loading all of the Open Elections data to Dolt, distributed on DoltHub, because we wanted to bring the power of a SQL engine to the data. The "gap" between Open Elections data as it exists on GitHub represents the work needed to transform this enormous collections of CSVs into a format that will be readable by Dolt's query engine. You can see how much work went into that looking at the code in this repository, and it's important to point out that Open Elections is a generally high quality dataset. At the core of open data projects such as this one is the fact that this work will be amortized as community benefit every time a user is able to immediately get insight from the data without doing any acquisition and cleaning work. Concretely, let's see how easy Dolt makes it to acquire a SQL interface on the Open Elections data:
$ dolt clone open-elections/voting-data && cd open-elections
1,112,799 of 1,112,799 chunks complete. 0 chunks being downloaded currently.
$ dolt sql
# Welcome to the DoltSQL shell.
# Statements must be terminated with ';'.
# "exit" or "quit" (or Ctrl-D) to exit.
open_elections> show tables;
| Table |
| national_voting_data |
A fair question to ask is what value this creates for the community of folks interested in election data. We will share some examples of how to inspect the data across states and election cycles writing only SQL.
Earlier we mentioned the huge number of voting districts across states, and we can immediately make this concrete using SQL:
open_elections> select state, count(distinct(precinct)) from national_voting_data where year = 2018 group by state;
| state | COUNT(DISTINCT national_voting_data.precinct) |
| AK | 1569 |
| AL | 2267 |
| AR | 3176 |
| AZ | 1489 |
| CA | 21709 |
Each of those states has its own repository, and repository has thousands of CSVs containing the amazing work done by volunteers to gather the relevant data. Exposing it in Dolt immediately creates a query interface for members of the data community to harvest that value without doing an ETL project. This also extremely useful for data validation. We would expect the number of counties in a state to stable across election cycles. For example, let's look at county count stability though time, another dimension that is spread across directories in the raw data:
open_elections> select year, count(distinct(county)) from national_voting_data where state = 'CA' group by year;
| year | COUNT(DISTINCT national_voting_data.county) |
| 2014 | 58 |
| 2015 | 5 |
| 2016 | 58 |
| 2017 | 1 |
| 2018 | 60 |
We can see here that there has been a change of two counties. We can use this kind of analysis to rapidly sanity check the data with absolutely no ETL work. Dolt distributes data that is ready to ask questions, and does not require the writing of procedures to begin looking for insights.
Installing Dolt is easy, check out installation docs for more details.
Open Elections is aiming to make voting data from all legislative elections available in CSV format on GitHub. We took that data and transformed it into a single table that captures precinct level voting data as a first step toward creating an easily distributed SQL database of national elections.