Bounty Attribution

5 min read

On Monday we launched Bounties, a product that pays users to gather and clean data. In less than a week, our first data bounty has already shown the power of Dolt as a collaborative data platform. In that time our bounty has received 22 Pull requests and has had over 6.6M cells edited. We have all 50 states from the 2016 presidential election, and so far 7 from 2020..

In launching the product, we wanted to build trust that our users are getting fairly rewarded for their contributions, and to encourage feedback and improvement to the way we run our bounties. To that end, we open sourced this code, and today I'll be talking you through the payment model being used for our first bounty.

Goals and Design

In designing our bounty product we wanted to encourage collaboration, and incentivize users to help us get the best dataset possible. We wanted to reward contributors proportionally to the value their contributions added. With these goals in mind, we developed our first bounty to pay based on the percentage of cell edits which make it into the final dataset which are attributed to a user's PRs. A cell edit is attributed to the first merged commit that contains the value that makes it into the final dataset.

A nice quality of this is it also allows for the community to help us review and moderate the contests. If a reviewer misses something in review where a user makes a bunch of bad edits, a user could submit a PR reverting the bad PR, or any portion of it. When that PR gets merged the contributors that provided those edits originally will have the edits, re-attributed to their original commits.

One problem with this type of bounty is that edits to a primary key column, appear to the system that a row has been deleted, and a new row has been added. In this case users get credit for every cell in the row, and the editors that had these edits attributed to them previously lose credit for their work.

Despite its problems, this is the way our first data bounty is run. We are actively talking about ways to handle shortcomings of this system, in the meantime we are reviewing PRs that show numerous row deletes more closely.

Calculating Percentage of Cell Edits attributed to PRs

The overall approach here is pretty straightforward. We will walk from the state of the dataset at the start of the bounty until the latest commit. We track every changed cell and which commit the change is attributed to. If at any point a cell changes again we update the attribution and save the previous state and attributed commit to a history for that cell. If at any point that cell is changed yet again we will check the history to determine the first commit that contained that value.

An Example

Typically, a repository where a bounty is being run would be seeded with schema, and a bit of data to give some examples of the type of data that is being looked for. For our example we will start with a dolt database that has a single table and this initial state.

| Key | Col1 | Col2 | Col3 |
|  0  |  0   |  0   |  0   |

If User 1 forks the repository, modifies it, and creates a PR which gets merged with the following table state:

| Key | Col1 | Col2 | Col3 |
|  0  |  0   |  1   |  1   |
|  1  |  1   |  1   |  1   |
|  2  |  1   |  1   |  1   |
|  3  |  1   |  1   |  1   |
|  4  |  1   |  1   |  1   |
|  5  |  1   |  1   |  1   |

User 1 would be attributed with 22 cell edits. They added 20 cells and modified 2 existing cells. If we imagine that a bounty is being run here with a $1000 prize then you would expect the scoreboard on DoltHub to show

User   | Edit Count | Edit Percentage | Amount    |
User 1 | 22         | 100.00%         | $1,000.00 |

If User 2 has a PR merged with the following state:

| Key | Col1 | Col2 | Col3 |
|  0  |  0   |  2   |  1   |
|  1  |  1   |  2   |  1   |
|  2  |  1   |  2   |  1   |
|  3  |  1   |  2   |  1   |
|  4  |  1   |  2   |  1   |
|  5  |  1   |  2   |  1   |
|  6  |  2   |  2   |  2   |

User 2 would receive credit for adding 4 cells and modifying 6 Col2 cells for a total of 10 attributed cell changes, and the total number of changes attributed to user 1 would decrease by 6. The updated scoreboard would show:

User   | Edit Count | Edit Percentage | Amount  |
User 1 | 16         | 61.54%          | $615.39 |
User 2 | 10         | 38.46%          | $384.61 |

Finally if User 3 then came in and had their PR merged with the following state

| Key | Col1 | Col2 | Col3 |
|  0  |  0   |  0   |  3   |
|  1  |  1   |  1   |  3   |
|  2  |  1   |  1   |  3   |
|  3  |  1   |  1   |  3   |
|  4  |  1   |  1   |  3   |
|  5  |  1   |  1   |  3   |
|  6  |  2   |  2   |  3   |
|  7  |  3   |  3   |  3   |

They would receive credit for adding 4 cells, and modifying 7 Col3 cells which were previously attributed to user 1.
The 5 Col2 cells that had their value change from 2 to 1 would have their attribution changed back from User 2 to User 1 (the original contributor that had them with value 1). Finally, row 0, Col2 had its value change back to the value it had when the bounty started, so there is no one attributed with changing that cell. The final scoreboard would be:

User   | Edit Count | Edit Percentage | Amount  |
User 1 | 15         | 51.72%          | $517.21 |
User 2 |  3         | 10.34%          | $103.44 |
User 3 | 11         | 37.93%          | $379.31 |

Concluding

The features we provide in dolt lend themselves to a one of a kind data collaboration experience. The ability to do cellwise attribution is only possible because we've built a database which tracks the history of each cell.

We spent a good deal of time trying to find a good way to score bounties. We ran multiple bounties internally before launching with the current model. There are certainly problems that we are looking to address, but we have users actively gathering data and making money.

Scoreboard

Come check it out, make some money, and give us your feedback on Discord so we can make this even better.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt