A core motivation for the DoltHub team is a belief that obtaining and distributing data should be seamless and robust. Correctness and power combined with simplicity make for positive user experiences. We want users to think in terms of queries on datasets, not the work they have to do to translate collections of CSVs or JSON files into a format which can be queried. Similarly we don't think the tedious work querying APIs and parsing the results should be required to construct useful queries. We have previously blogged about the importance of choosing an appropriate format for data distribution, and advocated for Dolt as candidate, and DoltHub as a collaboration platform for Dolt. Fundamentally, we want to reduce the amount of effort required to get insights from data.
This blog post presents a concrete example of using Dolt to make an existing public dataset, spread over many files, into a coherent database that can be obtained in a single command using
dolt clone. More generally we want to highlight the notion of a dataset as a "production resource" that can be plugged into your infrastructure in the same way a Python package can be, with near zero "marginal technical effort".
USPS Crosswalk Database
Many datasets of public interest are based on ZIP code, a United States Postal Service (USPS) description of a geographic area. However, we are often interested in answering questions using these datasets that require a different geographic unit of aggregation. For example, we host a dataset from Open Elections, which we blogged about last week. We also have an IRS dataset on sources of income. The Open Elections data is identified by geographic administrative areas such as county, district, and precinct. Essentially the administrative units of elections. The IRS data by ZIP code. This is a specific example of the general problem we started with.
If we wanted to do an analysis that involved both datasets, an example being analyzing sources of income as predictors of voting patterns, we need to find a way to map one dataset into the "ID space" of the other. This isn't just a problem in social science research, but also in many financial domains. Models are built using raw data, and the signals they generate need to be "proxied" into a tradable security in order to express the "view" the model implies. If the raw data has a different "ID space", then a mapping has to be established for the signal built on top of it to be useful.
USPS "crosswalk" data provides a way to translate from ZIP code to various other geographic administrative areas, and vice versa. To obtain that data currently requires choosing a quarter (as in one quarter of the year), and identifier type, and downloading the file. Users then need to choose how to incorporate this into their analysis (via Pandas, etc.), and download the new file when they want to update their data. We are pleased to make a Dolt database, updated quarterly, available on DoltHub. It comes with a SQL query interface and a complete history, and can be cloned easily:
$ dolt clone dolthub/usps-crosswalk-data
$ dolt sql
# Welcome to the DoltSQL shell.
# Statements must be terminated with ';'.
# "exit" or "quit" (or Ctrl-D) to exit.
usps_crosswalk_data> select * from zip_county limit 10;
| zip | county | res_ratio | bus_ratio | tot_ratio | oth_ratio | month | year |
| 00501 | 36103 | 0 | 1 | 1 | 0 | 3 | 2010 |
| 00501 | 36103 | 0 | 1 | 1 | 0 | 3 | 2011 |
| 00501 | 36103 | 0 | 1 | 1 | 0 | 3 | 2012 |
| 00501 | 36103 | 0 | 1 | 1 | 0 | 3 | 2013 |
| 00501 | 36103 | 0 | 1 | 1 | 0 | 3 | 2014 |
| 00501 | 36103 | 0 | 1 | 1 | 0 | 3 | 2015 |
| 00501 | 36103 | 0 | 1 | 1 | 0 | 3 | 2016 |
| 00501 | 36103 | 0 | 1 | 1 | 0 | 3 | 2017 |
| 00501 | 36103 | 0 | 1 | 1 | 0 | 3 | 2018 |
| 00501 | 36103 | 0 | 1 | 1 | 0 | 3 | 2019 |
We believe that using Dolt and DoltHub for datasets such as this one can create value for both researchers and engineering teams that need to rely on them. You can find the repo on the web here.
It's all too common to hear folks in the data science community lament the amount of time that they spend cleaning and munging data. A primary reason for this is the distribution formats are just formats, they do not make guarantees about data type, or structure. They certainly stipulate structure, but a database enforces structure. These are radically different levels of "guarantee". Dolt enforces structure, meaning it's more work to get data into Dolt, but these powerful guarantees enable the use of a query engine on the format.
Furthermore, data often have to be spread across multiple files, or in the case of JSON buried in an unpleasant nesting structure necessitating parsing to setup something usable. Dolt arrives as a SQL database. Thus researchers pulling data from DoltHub, especially "generically useful" datasets such as this one, get the benefit a query interface out of the box, eliminating the need for boiler plate ETL code.
The USPS Crosswalk data is a prime example of this. Most researchers aren't using it as a primary resource, it's a generically useful mapping table to enable insights to be drawn from more unique data sources, or for insights to be drawn across data sources. In this sense it has much in common with a software library, such as Pandas. We don't have to manually download the new version of Pandas, we have tools that eliminate that friction so we can focus on our analysis.
Dolt is a tool for making this data dependency simple to work with. It does this by providing a Git inspired model for distribution that's elegant, simple, and robust. Here is an example of obtaining and updating the USPS data in Python using Doltpy:
from doltpy.core import Dolt
repo = Dolt.clone('dolthub/usps-crosswalk-data')
We can then immediately explore the data in SQL, or load it into Pandas or R data-frame interfaces. We can also subscribe to the data, opening the possibility of "living" analyses. You can find examples of this, and a whole lot more, on our documentation page.
Core to our product vision for Dolt is users being able to obtain a "drop-in production resource." Specifically instead of engineers doing ETL work every time they want to integrate data into their workflows and production systems, they simply run
dolt clone and then use
dolt sql-server to standup a SQL server which works in conjunction with their existing production relational database resources. We envision users directly pulling data from a Dolt instance, or syncing to their current production database using the Python libraries we make available for that purpose. The bottom line is we want users to get data into production with no more than a few lines of code, and keep it updated with a single line of code, but without tying them into a hosted a solution. Dolt is agnostic as to where users want to host their data.
We just articulated the benefits of Dolt as a data distribution format to both researchers and engineers involved in creating and maintaining data driven systems. But what about the handoff? What happens when researcher wants to take their model, stitched together using a number of processes for parsing various data resources, and get it in production? Part of "productionalizing" a model for an engineering team can be figuring out how to turn the model's various data inputs into production resources.
Dolt solves this problem for researchers and their partners in engineering by elevating the quality of the distribution format to that of a production resource, while at the same time working within the ecosystem of relational database infrastructure.
In this post we highlighted a concrete example of a dataset that users can obtain using DoltHub, in the Dolt format, where the distribution model leads to a vastly improved researcher and developer experience. Part of our strategy for adoption is to make generally useful datasets such as this one available on the platform, and we will continue to expand our offering as time goes on. Please don't hesitate to reach out to us if you have an idea for a dataset you think would be a good fit.