Introducing Forks

September 18, 2020

5 min read

Today, DoltHub released forks. It is the same system that Github uses for collaboration on over 100 million repositories contributed to by their 40+ million users. For the first time there is a general platform for data collaboration, and we hope it moves open data to the next level.

Why Data Collaboration Matters

When we started this company in August 2018 something that excited us was expanding the types of businesses that you could start, and succeed at. In many of the spaces today, it is difficult to come in and compete with the large players simply because they have huge amounts of data that isn't available publicly.

As an example, Google launched "Google Maps" in 2005 and has heavily invested in that space since then. It had offered an API freely until 2012 when it began charging. If you want to come in and compete with Google Maps you will need data that is at least as good or better than Google's. You can pay Google for access to their data, but at that point you are paying to make their data better, and the gap between the data that you have, and the data that they have widens. You could spend the money to acquire that data, but the cost of acquiring it is beyond the budget of any startup. So the only way to compete is with the help of others. Open Street Maps is an open data project with hundreds of contributors which is being used and contributed to by companies such as Apple, Facebook, Foursquare, Mapbox, MapQuest, Tesla, Wikipedia and Snapchat.

Projects such as Wikipedia, other Wikimedia projects, and Open Street Maps have shown the power of community collaboration on data. However, there hasn't ever been a platform that made collaborating on data feasible.

Problems with Data Collaboration

As obvious as the benefits of data collaboration are, there are very few successful collaborative data projects. The ones that have been successful created platforms for getting data mainlined using specialized processes for editing, merging, and handling conflicts that are specific to their data.

Though not a data product, I'll also be looking at how Git/Github approached these problems in order to become the largest collaborative coding platform in the world, and how this approach can be used to provide a general purpose collaborative data platform, and how Dolt/DoltHub extend that to data.

Merging and Conflicts

Any time you have multiple editors working on something together, merging and conflicts are a problem. Whether it's people collaboratively editing a document online, working on source code managed by some version control system, or editing data in a database there is always the potential for two or more users to be modifying the same data.

There are different strategies for dealing with this employed by different systems. A simple solution is to just allow the last write to win. Some systems might force manual merges, while others may have complex domain-specific rules for completely automated merges. Git and Dolt attempt to automatically merge multiple edits into one, and force manual resolution when item cannot be merged without conflict. Dolt takes it a step further by allowing you to analyze the differences, and conflicts via SQL, and then lets you write SQL to resolve them.

Data Quality and Trust

Any time you are working on a project that is open to the world, you will have to deal with bad actors. Some have bad intentions, others are just having a laugh, and others may be adding incorrect data unintentionally.

Different moderation strategies can be employed each with their own strengths and weaknesses. Automated moderation systems can detect some types of data errors quickly, but they can take a lot of work to train and tune in order to have a good hit rate for erroneous changes. User based moderation systems give control to community members, and they are easy and low cost to deploy, but their success is highly variable depending on the abilities of the moderators.

GitHub and DoltHub organize their projects into repositories, and grant users different privileges. Users may be given write access to the project by one of its owners. These users are trusted by the project to maintain data quality and may make changes to the data directly. In GitHub, untrusted users may fork the data, and submit changes back to the main dataset via a "Pull Request". As of today, you can do that on DoltHub too (Details below).

Community Disagreement and Ownership

Even when you have a good moderation system, datasets evolve, and disagreements can arise. As an example, In 2007 Open Street Maps had an "edit war" over the language that should be used for locations in Turkish controlled Northern Cypress. Wikipedia keeps a page dedicated to the lamest edit wars seen on their platform. Other types of disputes could be simple disputes over schema, or formatting.

GitHub, and now DoltHub handle this with forks. In the event that you do not like the direction that a project is going you can always fork the project, and take it in your own direction, and you can still continue to integrate changes from the project that you forked from. Additionally, you can still send PRs to get your changes pushed back onto the project you forked from. You can continue to collaborate with the entire community, even after you have taken your version of the project in another direction. One major example of a successful fork is MariaDB. In 2009 MySQL was forked after a couple of acquisitions left concerns about MySQL as an open source project. Today MariaDB is a thriving project, with a robust community.

Introducing DoltHub Forks and Cross Fork Pull Requests

Today DoltHub is launching forks, and it is a leap forward for collaborative data projects. This is the first solution for open data collaboration which addresses all these problems in a generalized way.

What is a Fork

A fork is a copy of the data which you become the owner of. You control who can modify your data, and those users determine what data gets merged. You can continue to pull changes from the repository that you forked from, and you can submit pull requests (PRs) back to it. You can use it as a tool to get your changes onto a repository, or you can use it to take that repository in a different direction.

What is a Pull Request

A pull request or PR is a request sent to the contributors of a repository to merge your changes into their repository. It will encapsulate all the changes that were made between the first common ancestor of the source of your repository, and the destination branch of the repository you are submitting to. Owners of the pull request's destination repository can then review and integrate these changes into their repository.

A Living Example

At the end of july I wrote an article about Open Resumes, where I talked about the motivations for scraping linked in, and the desire for an Open Resumes dataset. With the arrival of forks I invite you to fork the dataset, and send us a pull request containing your scraped LinkedIn resume. More than anything, our goal here is to show off Dolt/DoltHub as a data collaboration platform.

Wrapping up

With today's release we feel we are a step closer to being the platform that we envisioned in 2018. We have built the most important features of a collaborative data platform. We will continue to develop features to this end which will improve the experience, but the next step is to get people to start collaborating on data on the platform. We are getting ready to put our money where our mouth is. Stay tuned for some announcements that could make you real money collaborating on some of our datasets.

Blog