Data Integrity for Open Data

August 7, 2020

8 min read

Open Data Validation

Recently an article made the rounds at our company about "data integrity" checks. The article advocates that in the absence of perfect code that never corrupts data, it's wise to have "data integrity checks" that ensure data quality. The article is written by data engineers at Slack, and it focuses on the goal of ensuring integrity of data internal to an organization. The article provides several examples of "integrity checks" in the context of adding a new table to store ZIP codes:

What is enforcing that a one-to-many relationship exists between zip codes and addresses?
Are any zip codes outside of the 0–99999 range of integers?
What if the code is within range but invalid according to the USPS?

Last week I wrote an article about open data projects, highlighting similarities and differences with open source software projects. Testing is hugely important in open source projects. The open nature means that trust has to be more technical in nature. When code is maintained within an organizations a certain degree of trust in collaborator motives is often assumed, but that's not present in open source, elevating the importance of automated testing as an objective measure of quality.

Last week's article highlighted that "open data" often just referred to "publicly available" data, and that the presence of a single publisher to some degree short circuited this need for testing to establish trust. Given that datasets can generally be understood to be collections of facts, the existence of a single reliable publisher obviates some of the need to validation. But we also highlighted true open data projects with a large number of collaborators. Can we draw on the example of data integrity checks internal to an organization, and the heavy use of automated testing for open source software projects, to propose ways to advertise the quality of open data projects to prospective users? Can this motivate organizations to evaluate using open data as drop in data dependencies for their own data infrastructure and also become contributors to the open data ecosystem? The reason so many organizations contribute resources to open source are often selfish, they are in a very real sense stakeholders in those open source tools. Can the same effect take hold in the open data ecosystem?

The balance of this article is devoted to answering those questions suggesting the use of existing well understood datasets to validate open data projects. We use Dolt and DoltHub to illustrate how to validate values for counties in the Open Elections dataset.

Types of Correctness

The discussion above highlighted two kinds of data "correctness":

technical correctness: is the format, and type, of the data correct?
factual correctness: does the data represent a valid collection of facts, to the extent it possible to actually validate that?

Let's dive into each of them in more detail.

Technical Correctness

Quoting again from the article referenced in the introduction, the authors produce a useful definition of what I called "technical correctness":

Some databases can enforce relationships at the systems level, using foreign keys, constraints, and other similar concepts, but these built-in checks and constraints represent only a small set of possible characteristics that may be modeled in any given system.

There is a hint of an analogy here to code compiling. The code might produce total gibberish, but the "form" is correct. The compiler can read it. Additionally, databases use concepts such as foreign keys, and CHECK functions, to expose the ability to bake some additional concepts of correctness into our schema. Writing data into a database acts as "forcing function" for certain kinds of correctness. And the more domain specific requirements of the data that we bake into the schema via tools such as foreign keys and CHECK, such as using CHECK to ensure a zip code is in the range 00000-99999, the more powerful that forcing function becomes.

Ensuring technical correctness is one of the fundamental reasons to distribute data in the Dolt format, rather than via collections of CSVs, or an API. The requirement to define a schema and account for corner cases in types elevates the quality of the data, enables the use of a query engine, and along with the clone functionality of Dolt creates a radically better experience for the consumer of the data. When users have a delightful experience consuming data from an open data project we believe they are more likely to become stakeholders who can further improve the scope and quality of the project over time.

Data Integrity

Having put the data in a database, and imbued the database with data domain constraints (such as zip code in the range of numerical character strings 00000-99999, or foreign key constraints), we are left with the hardest kind of check: is the data actually factually correct? This is obviously a hard thing to establish for some kinds of data, as data could capture subjective opinion, or some other difficult to define concept.

For now we can stick to "facts" that have a clearly defined definition of correctness. For example, the Open Elections data contains a "county" column, and it's conceptually straight forward to answer the question "Is the county provided as a value in that row actually a county in the United States?" However operationalizing such checks to your public and private data can be more challenging.

We now turn to discussing Open Elections, an example of a true "open data" project, and how we are using Dolt to improve the quality of the data, and show how to use other datasets on DoltHub to validate the data.

Open Elections

In another recent blog post we highlighted the Open Elections data that is now on DoltHub. To briefly recap Open Elections aims to:

Our goal is to create the first free, comprehensive, standardized, linked set of election data for the United States, including federal and statewide offices.

Our goal is to turn that data, roughly consisting of fifty state level GitHub repositories of CSVs into a single coherent SQL database that can be cloned, and immediately analyzed across election cycles and states. Currently doing so requires "stitching" together CSVs across repositories and directories. One issue with soliciting contributions in the form of CSVs is that the CSV format does not act as a "forcing function" for valid data. While the examples in the introduction highlighted ways data in a database can be corrupt, data that is not in a database has a substantially wider set of possible kinds of corruption. Some examples include:

incorrectly formatted CSVs, for example an extra comma could make a row unreadable
encoding errors making files unreadable in certain contexts, see an example of a pull request I made against Open Elections to fix such an instance
data could have the wrong type, or depend on the CSV parser to interpret "1,000" as the integer 1000

Thus the issues identified in the introduction are issues that arise once your data has been put through a query engine which implicitly validates certain aspects of correctness by rejecting data that cannot be safely coerced to the schema. In loading Open Elections data to Dolt we hoped to elevate the quality of the data by enforcing structure and type correctness on the data.

Having loaded the data to Dolt, we now turn to the second type of correctness, namely "data integrity." These are ultimately questions of fact, and thus far more complex. We show an example of how to use a canonical dataset from DoltHub to validate the Open Elections data, using Dolt in multi database mode so we can run SQL across different data sources, using one to validate the other.

Dolt and DoltHub

At DoltHub we have made it our goal to assemble and maintain a large and growing catalog of "canonical" datasets. Some recent examples:

USPS crosswalk data for traversing to and from zip codes and other levels of statistical geographic area
US Census Bureau County Populations going back to 2010
Wikipedia mirror of FIPS to county

Let's use the last of these, the mirror of Wikipedia table mapping US counties and county equivalents, to their FIPS code. We can use that to validate that the values for counties in the Open Elections data are valid counties. First let's setup the analysis by cloning the relevant data:

$ mkdir validata-open-elections && cd validate-open-elections
$ dolt clone wikipedia-mirrors/county-fips-code
cloning https://doltremoteapi.dolthub.com/wikipedia-mirrors/county-fips-codes
39 of 39 chunks complete. 0 chunks being downloaded currently
$ dolt clone open-elections/voting-data
cloning https://doltremoteapi.dolthub.com/open-elections/voting-data
10 of 1,497,188 chunks complete. 262,144 chunks being downloaded currently.
$ dolt sql --multi-db-dir .
# Welcome to the DoltSQL shell.
# Statements must be terminated with ';'.
# "exit" or "quit" (or Ctrl-D) to exit.
> show databases;
+--------------------+
| Database           |
+--------------------+
| county_fips_codes  |
| information_schema |
| voting_data        |
+--------------------+

We are now running the Dolt SQL shell with our two datasets loaded. One is the dataset we want to validate, voting_data from Open Elections, and the other is a dataset we are using to validate it. Let's check the counties in California for an example. A few things to note:

Open Elections data for California does not include the word "county" in counties
Some county names in the reference dataset include 'City and County of', which is in fact correct
We remove data that does not have a county specified, that is has the value 'NA'

voting_data> select distinct(concat(county, ' County')) from national_voting_data where county != 'NA' and state = 'CA' and concat(county, ' County') not in (select county_or_equivalent from county_fips_codes.county_by_fips);
+------------------------------------------------+
| concat(national_voting_data.county, " County") |
+------------------------------------------------+
| San Francisco County                           |
| Riverside County County                        |
+------------------------------------------------+

We can easily drill into both of these examples:

voting_data> select * from county_fips_codes.county_by_fips where county_or_equivalent like 'San Francisco%';
+-------+-----------------------------------+---------------------+
| fips  | county_or_equivalent              | state_or_equivalent |
+-------+-----------------------------------+---------------------+
| 06075 | San Francisco, City and County of | California          |
+-------+-----------------------------------+---------------------+
voting_data> select distinct(county) from national_voting_data where county != 'NA' and state = 'CA' and county like 'Riverside%' ;
+------------------------------------------------+
| concat(national_voting_data.county, " County") |
+------------------------------------------------+
| Riverside                                      |
| Riverside County                               |
+------------------------------------------------+

The first is quirk of our reference dataset, and can verify this, quickly using SQL to ensure we have the correct value for San Francisco County:

voting_data> select distinct(county) from national_voting_data where county != 'NA' and state = 'CA' and county like 'San Francisco%' ;
+---------------+
| county        |
+---------------+
| San Francisco |
+---------------+

The second has an obvious cause, namely there are records in the open elections data where county "Riverside" has been provided as "Riverside County". While both of these are factually correct, as noted above, the convention in the data is to omit the word "County". We can immediately write a SQL query to update the values, create a branch, and make a pull request against the repo:

voting_data> update national_voting_data set county = 'Riverside' where county = 'Riverside County';
Query OK, 1832 rows affected
Rows matched: 1832  Changed: 1832  Warnings: 0
$ dolt branch correct-california-counties
$ dolt add national_voting_data
$ dolt commit -m 'Made values for Riverside county consistent for California data'
$ dolt push origin correct-california-counties

You can inspect the resulting pull request against the DoltHub database here. This is how open source software improves, users submit improvements as pull requests when they find issues. By maintaining high quality generally useful canonical datasets alongside more unique datasets of specific interest on DoltHub, we make it possible to make these kinds of improvements, and provide the collaboration tools to do so with a familiar Git/GitHub workflow. In a post next week we will cover implementing a wider set of automated checks on Open Elections using canonical datasets, and using our web hooks tooling alongside existing CI solutions!

Conclusion

At DoltHub we are continually creating, as well as maintaining, canonical datasets that can be used to validate more specifically interesting datasets. We also have a web hooks feature that can be used to kick of Jenkins, or any other CI tool, providing users with the building blocks for validating public and private datasets in an automated fashion. We hope that Dolt and DoltHub can accelerate the dynamics of the open data ecosystem. With robust data validation via CI open data projects can advertise their quality to users and organizations, and motivate those users to recognize they can drive value by putting resources back into those projects and relying on them.

Blog