"Open Data" is quickly attaining the hype and ambiguity of previous tech crazes like "Big Data" and "Block Chain".
The motivation behind Open Data is easy to understand: data is one of the most valuable but
closely guarded resources in the tech industry.
Democratizing these resources has the potential to catalyze new industries and innovation.
The ultimate goal of open data would be to replicate the magic of open source software: the ability to create online
communities dedicated to producing trusted, in-demand data resources.
Open Data has a lot of growing to do to reach the maturity of the open source software ecosystem.
Open Source projects are trusted because of the number contributors building, fixing, and maintaining them.
Git and Github marked a major advancement in OSS by enabling distributed collaboration, and advancing the tools and
review processes powering that collaboration.
Automated test, linting, and human code review create structure and certainty for oss projects
These tools don't exist yet for Open Data.
We created Dolt to solve these problems, and built DoltHub
as a place to collaborate on data.
We modeled them after Git and Github with the goal of bringing "Open Data" closer to Open Source Software.
This year we launched Data Bounties to foster Open Data projects and pay contributors for their work.
Each month we're hosting a $10,000 contest to source data and create unique public datasets.
Bounties are our solution to the Open Data problem, but we were curious what other people.
This is our round up of where Open Data is today and what's coming next:
Data catalogs represent the state of Open Data today.
These are "Open Data" platforms in the most literal sense of the term: the data is free and available to use.
Whether private or public, they're the most common source of datasets for most use cases.
Data catalogs aggregate datasets and provide some level search indexing and categorization.
Data is distributed as static files and they generally don't have support for data that changes over time.
Kaggle is likely the best known of this category. It's found a niche hosting machine-learning competitions and has
some support for sharing notebooks and data analysis code. Data is commonly shared as CSV, Excel, and sometimes
SQLite databases. Kaggle has interesting features for learning and developing data science work, but the datasets
themselves aren't central. Most datasets are posted once and not updated.
Data.world is a more curated catalog. Their focus is enterprise customers looking for research datasets. Their
site is somewhat reminiscent of Github in that it has support for discussion and documentation around a dataset,
but it lacks the features to collaborate on the data itself. Data is distributed as CSV and Excel files, and updates
to datasets are published as separate copies. This example
from the Associated Press shows how data is maintained over time.
Government data portals are similar to data catalogs, but you paid for it!
Data on these sites is maintained by the departments responsible for collecting it and is updated regularly.
The tradeoff is that government data is non-commercial and navigating the user-interfaces can be tedious.
Also, the federated nature of US government means that data collection and hosting is split among federal, state,
county and city governments. The US federal government's open data site data.gov maintains a
list of state and local government data portals.
When it comes to data resources, government data represents most of the high-quality, maintained public data.
Most other data sources are either private or not well maintained. However, government agencies only maintain
datasets that they collect, they don't take requests. As we'll see with Open Elections, even some datasets that
you would expect to be distributed by a government entity are not availible.
So far we've seen a lot of data publishers, but nothing in the way of data collaboration communities.
Launching and running an open data community is hard.
Open Street Map and Open Elections
have succeeded through a combination of overwhelming demand and a lot of dedicated resources.
Replicating their processes isn't scalable, but they have paved the way for future projects and helped to show
what's possible and what's challenging.
From openstreetmap.org: "OpenStreetMap is a free, editable map of the whole world that is being built by volunteers
largely from scratch." In many ways, OpenStreetMap is a response to the gated wealth of data owned by Google Maps.
The OSM project provides a free and open alternative that approaches the capabilities of its private counterpart.
However, managing the volume and complexity of this data and its change management processes is a major challenge.
Dozens of bespoke software tools have been written to manage the project,
and it likely wouldn't be possible without support from tech giants like Snap Inc, Apple, and Amazon.
Open Elections was created to aggregate and standardize US election data. As was mentioned before, government data
reporting is divided among political jurisdictions. Even federal election data is not collected at the national level.
Creating detailed precinct-level datasets of presidential election results is extremely labor-intensive, in some
cases requiring manual data-entry of paper documents, and takes months to complete.
Data Collaboration Services
There are a few entrants into the world of Open Data who are trying to make this process scalable and cost-efficient.
Information Evolution provides "managed crowdsourcing", which is essentially data collaboration as a service.
They research a data collection problems and then manage projects on Amazon Mechanical Turk
to outsource data collection. This is an innovative approach to collecting novel datasets, but
isn't exactly "Open Data", as the data isn't publicly available.
Qri.io is building a data collaboration tool and hosting platform.
Their tool is built on top of Ipfs and uses
content-addressing to version dataset files.
They are currently hosting a data-collection project to map NYC Capital Projects.
As of publishing time, the project has found latitude and longitude coordinates for 2058 out of 5210 capitals
projects in the dataset. Using a syncable data versioning technology as the center of a data collaboration project
is a promising direction for Open Data.
This is our venture into Open Data, and it's gaining momentum. Most recently, DoltHub contributors
assembled a dataset of 72.7M hospital prices. This
dataset can be used to compare prices between hospitals for common, standardized healthcare procedures. Using Dolt's
branch-and-merge collaboration model, contributors created syncable, versioned SQL database. You can
query it right now on DoltHub.com.
Every contribution was reviewed through DoltHub's Pull Request process and approved by the bounty manager.
The power of Dolt is structuring datasets as SQL databases, allowing data integrity to be enforced at the schema level.
Foreign Keys, Column Types and Constraints allow users to assert facts about their dataset. The dataset is much
more powerful when you can use the full SQL query language to inspect and analyze it!
New technology is needed to bring the promise of Open Data to fruition. We believe that Dolt is exactly the tool for the job.
Data Bounties are in the early days, but their initial success shows great potential to revolutionize data collaboration.
We continue to iterate on Dolt and DoltHub by adding new features, improving SQL support and accelerating performance.
We would love to hear your thoughts on the future of Open Data.
If you have an idea for a data project, join us on Discord and let us know.
And of course, if you're interested in making some money, check out our current logo bounty.