Today, DoltHub released forks. It is the same system that Github uses for collaboration on over
100 million repositories contributed to by their 40+ million users. For the first time there is a general platform
for data collaboration, and we hope it moves open data to the next level.
Why Data Collaboration Matters
When we started this company in August 2018 something that excited us was expanding the types of businesses that you
could start, and succeed at. In many of the spaces today, it is difficult to come in and compete with the large players
simply because they have huge amounts of data that isn't available publicly.
As an example, Google launched "Google Maps" in 2005 and has heavily invested in that space since then. It had offered
an API freely until 2012 when it began charging. If you want to come in and compete with Google Maps you will need data
that is at least as good or better than Google's. You can pay Google for access to their data, but at that point
you are paying to make their data better, and the gap between the data that you have, and the data that they have widens. You
could spend the money to acquire that data, but the cost of acquiring it is beyond the budget of any startup.
So the only way to compete is with the help of others. Open Street Maps
is an open data project with hundreds of contributors which is being used and contributed to by companies such as Apple, Facebook,
Foursquare, Mapbox, MapQuest, Tesla, Wikipedia and Snapchat.
Projects such as Wikipedia, other Wikimedia projects, and
Open Street Maps have shown the power of community collaboration
on data. However, there hasn't ever been a platform that made collaborating on data feasible.
Problems with Data Collaboration
As obvious as the benefits of data collaboration are, there are very few successful collaborative data projects. The ones
that have been successful created platforms for getting data mainlined using specialized processes for
editing, merging, and handling conflicts that are specific to their data.
Though not a data product, I'll also be looking at how Git/Github approached these problems in order to
become the largest collaborative coding platform in the world, and how this approach can be used to provide a general
purpose collaborative data platform, and how Dolt/Dolthub extend that to data.
Merging and Conflicts
Any time you have multiple editors working on something together, merging and conflicts are a problem. Whether it's
people collaboratively editing a document online, working on source code managed by some version control system, or
editing data in a database there is always the potential for two or more users to be modifying the same data.
There are different strategies for dealing with this employed by different systems. A simple solution is to just allow
the last write to win. Some systems might force manual merges, while others may have complex domain-specific rules for
completely automated merges. Git and Dolt attempt to automatically merge multiple edits into one, and force manual
resolution when item cannot be merged without conflict. Dolt takes it a step further by allowing you to analyze
the differences, and conflicts via SQL, and then lets you write SQL to resolve them.
Data Quality and Trust
Any time you are working on a project that is open to the world, you will have to deal with bad actors. Some have
bad intentions, others
are just having a laugh, and others may be adding incorrect
Different moderation strategies can be employed each with their own strengths and weaknesses. Automated moderation systems
can detect some types of data errors quickly, but they can take a lot of work to train and tune in order to have a
good hit rate for erroneous changes. User based moderation systems give control to community members, and they are easy
and low cost to deploy, but their success is highly variable depending on the abilities of the moderators.
GitHub and DoltHub organize their projects into repositories, and grant users different privileges. Users
may be given write access to the project by one of its owners. These users are trusted by the project to maintain
data quality and may make changes to the data directly. In GitHub, untrusted users may fork the data, and submit changes
back to the main dataset via a "Pull Request". As of today, you can do that on DoltHub too (Details below).
Community Disagreement and Ownership
Even when you have a good moderation system, datasets evolve, and disagreements can arise. As an example, In 2007 Open
Street Maps had an "edit war"
over the language that should be used for locations in Turkish controlled Northern Cypress. Wikipedia keeps a page
dedicated to the lamest edit wars seen on their platform. Other
types of disputes could be simple disputes over schema, or formatting.
GitHub, and now DoltHub handle this with forks. In the event that you do not like the direction that a project is going
you can always fork the project, and take it in your own direction, and you can still continue to integrate changes
from the project that you forked from. Additionally, you can still send PRs to get your changes pushed back onto the
project you forked from. You can continue to collaborate with the entire community, even after you have taken your
version of the project in another direction. One major example of a successful fork is MariaDB. In 2009 MySQL was forked
after a couple of acquisitions left concerns about MySQL as an open source project. Today MariaDB is a thriving
project, with a robust community.
Introducing Dolthub Forks and Cross Fork Pull Requests
Today Dolthub is launching forks, and it is a leap forward for collaborative data projects. This
is the first solution for open data collaboration which addresses all these problems in a generalized way.
What is a Fork
A fork is a copy of the data which you become the owner of. You control who can modify your data, and those users determine
what data gets merged. You can continue to pull changes from the repository that you forked from, and you can submit
pull requests (PRs) back to it. You can use it as a tool to get your changes onto a repository, or you can use it to
take that repository in a different direction.
What is a Pull Request
A pull request or PR is a request sent to the contributors of a repository to merge your changes into their repository. It
will encapsulate all the changes that were made between the first common ancestor of the source of your repository, and
the destination branch of the repository you are submitting to. Owners of the pull request's destination repository can
then review and integrate these changes into their repository.
A Living Example
At the end of july I wrote an article about Open Resumes,
where I talked about the motivations for scraping linked in, and the desire for an
Open Resumes dataset. With the arrival of forks I invite
you to fork the dataset, and send us a pull request containing your scraped LinkedIn resume. More than anything, our goal
here is to show off Dolt/DoltHub as a data collaboration platform.
With today's release we feel we are a step closer to being the platform that we envisioned in 2018. We have built the most
important features of a collaborative data platform. We will continue to develop features to this end which will improve
the experience, but the next step is to get people to start collaborating on data on the platform. We are getting ready
to put our money where our mouth is. Stay tuned for some announcements that could make you real money collaborating on
some of our datasets.