Data Bounties and Open Data

5 min read

"Open Data" is quickly attaining the hype and ambiguity of previous tech crazes like "Big Data" and "Block Chain". The motivation behind Open Data is easy to understand: data is one of the most valuable but closely guarded resources in the tech industry. Democratizing these resources has the potential to catalyze new industries and innovation. The ultimate goal of open data would be to replicate the magic of open source software: the ability to create online communities dedicated to producing trusted, in-demand data resources.

Open Data has a lot of growing to do to reach the maturity of the open source software ecosystem. Open Source projects are trusted because of the number contributors building, fixing, and maintaining them. Git and Github marked a major advancement in OSS by enabling distributed collaboration, and advancing the tools and review processes powering that collaboration. Automated test, linting, and human code review create structure and certainty for oss projects These tools don't exist yet for Open Data.

We created Dolt to solve these problems, and built DoltHub as a place to collaborate on data. We modeled them after Git and Github with the goal of bringing "Open Data" closer to Open Source Software. This year we launched Data Bounties to foster Open Data projects and pay contributors for their work. Each month we're hosting a $10,000 contest to source data and create unique public datasets. Bounties are our solution to the Open Data problem, but we were curious what other people. This is our round up of where Open Data is today and what's coming next:

Data Catalogs

Data catalogs represent the state of Open Data today. These are "Open Data" platforms in the most literal sense of the term: the data is free and available to use. Whether private or public, they're the most common source of datasets for most use cases. Data catalogs aggregate datasets and provide some level search indexing and categorization. Data is distributed as static files and they generally don't have support for data that changes over time.

  • Kaggle

    Kaggle is likely the best known of this category. It's found a niche hosting machine-learning competitions and has some support for sharing notebooks and data analysis code. Data is commonly shared as CSV, Excel, and sometimes SQLite databases. Kaggle has interesting features for learning and developing data science work, but the datasets themselves aren't central. Most datasets are posted once and not updated.

  • Data.world

    Data.world is a more curated catalog. Their focus is enterprise customers looking for research datasets. Their site is somewhat reminiscent of Github in that it has support for discussion and documentation around a dataset, but it lacks the features to collaborate on the data itself. Data is distributed as CSV and Excel files, and updates to datasets are published as separate copies. This example from the Associated Press shows how data is maintained over time.

  • Government Data

    Government data portals are similar to data catalogs, but you paid for it! Data on these sites is maintained by the departments responsible for collecting it and is updated regularly. The tradeoff is that government data is non-commercial and navigating the user-interfaces can be tedious. Also, the federated nature of US government means that data collection and hosting is split among federal, state, county and city governments. The US federal government's open data site data.gov maintains a list of state and local government data portals. When it comes to data resources, government data represents most of the high-quality, maintained public data. Most other data sources are either private or not well maintained. However, government agencies only maintain datasets that they collect, they don't take requests. As we'll see with Open Elections, even some datasets that you would expect to be distributed by a government entity are not available.

Mature Communities

So far we've seen a lot of data publishers, but nothing in the way of data collaboration communities. Launching and running an open data community is hard. Open Street Map and Open Elections have succeeded through a combination of overwhelming demand and a lot of dedicated resources. Replicating their processes isn't scalable, but they have paved the way for future projects and helped to show what's possible and what's challenging.

  • Open StreetMap

    From openstreetmap.org: "OpenStreetMap is a free, editable map of the whole world that is being built by volunteers largely from scratch." In many ways, OpenStreetMap is a response to the gated wealth of data owned by Google Maps. The OSM project provides a free and open alternative that approaches the capabilities of its private counterpart. However, managing the volume and complexity of this data and its change management processes is a major challenge. Dozens of bespoke software tools have been written to manage the project, and it likely wouldn't be possible without support from tech giants like Snap Inc, Apple, and Amazon.

  • Open Elections

    Open Elections was created to aggregate and standardize US election data. As was mentioned before, government data reporting is divided among political jurisdictions. Even federal election data is not collected at the national level. Creating detailed precinct-level datasets of presidential election results is extremely labor-intensive, in some cases requiring manual data-entry of paper documents, and takes months to complete.

Data Collaboration Services

There are a few entrants into the world of Open Data who are trying to make this process scalable and cost-efficient.

  • Information Evolution

    Information Evolution provides "managed crowdsourcing", which is essentially data collaboration as a service. They research a data collection problems and then manage projects on Amazon Mechanical Turk to outsource data collection. This is an innovative approach to collecting novel datasets, but isn't exactly "Open Data", as the data isn't publicly available.

  • Qri.io

    Qri.io is building a data collaboration tool and hosting platform. Their tool is built on top of Ipfs and uses content-addressing to version dataset files. They are currently hosting a data-collection project to map NYC Capital Projects. As of publishing time, the project has found latitude and longitude coordinates for 2058 out of 5210 capitals projects in the dataset. Using a syncable data versioning technology as the center of a data collaboration project is a promising direction for Open Data.

  • DoltHub Bounties

    This is our venture into Open Data, and it's gaining momentum. Most recently, DoltHub contributors assembled a dataset of 72.7M hospital prices. This dataset can be used to compare prices between hospitals for common, standardized healthcare procedures. Using Dolt's branch-and-merge collaboration model, contributors created syncable, versioned SQL database. You can query it right now on DoltHub.com. Every contribution was reviewed through DoltHub's Pull Request process and approved by the bounty manager. The power of Dolt is structuring datasets as SQL databases, allowing data integrity to be enforced at the schema level. Foreign Keys, Column Types and Constraints allow users to assert facts about their dataset. The dataset is much more powerful when you can use the full SQL query language to inspect and analyze it!

Conclusion

New technology is needed to bring the promise of Open Data to fruition. We believe that Dolt is exactly the tool for the job.
Data Bounties are in the early days, but their initial success shows great potential to revolutionize data collaboration. We continue to iterate on Dolt and DoltHub by adding new features, improving SQL support and accelerating performance. We would love to hear your thoughts on the future of Open Data. If you have an idea for a data project, join us on Discord and let us know. And of course, if you're interested in making some money, check out our current logo bounty.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.