Merging SHAQ Players for Fun and Profit

BOUNTY
7 min read

Dolt is a version controlled database. Dolt uses the Git-model of source control and applies it to the data and schema of a SQL database. That makes Dolt the first and only SQL database you can branch and merge. We also built DoltHub, a place on the internet to share and collaborate on Dolt databases. DoltHub adds forks, pull requests, and issues to the Dolt ecosystem.

After we built the above, we were looking for clever ways to bootstrap the data ecosystem on DoltHub. We came up with data bounties. We put up a cash prize to have data sourced and put into a Dolt database. We use pull requests to ensure data quality. We pay the bounty based on the amount of data added to the database. Bounties are fun, collaborative, and profitable for bounty hunters.

Our current bounty is called SHAQ. We're trying to assemble a database of every basketball player and team in the world. We want to join all the players across leagues in a single ID space. We've collected a lot of data so far, 153 leagues, 9360 teams, and 342742 players. Bounty submissions have slowed down as we've collected almost all of the readily available data.

That's great but we have a problem.

The Problem

Bounties incentivize data quantity over data quality. As a bounty hunter, you want to do just enough work to get your PR merged and have your cell edits counted on the scoreboard.

SHAQ Scoreboard

I have been the primary reviewer on five bounties to date and I definitely feel the tension. Here's a productive exchange on a recent PR. The PRs only tell half the story. We're also usually chatting on our Discord trying to get the data up to snuff. But there's only so much you can do as a reviewer. You're going to miss things. In this bounty, it's duplicate players.

Players need to be merged into same player.

SHAQ $ dolt sql -q "select concat(lower(first_name), ' ', lower(last_name)) as name, count(*) from players group by name having count(name) > 1 order by count(*) desc limit 10"
+-----------------+----------+
| name            | count(*) |
+-----------------+----------+
| brandon brown   | 37       |
| brandon johnson | 28       |
| jordan jones    | 27       |
| marko popovic   | 27       |
| jordan johnson  | 25       |
| chris williams  | 23       |
| chris johnson   | 23       |
| jordan davis    | 23       |
| justin robinson | 22       |
| jamar smith     | 22       |
+-----------------+----------+

How do we know if there are 37 Brandon Browns or 30 or 5? We need to do some manual work.

SHAQ $ dolt sql -q "select count(*) from (select concat(lower(first_name), ' ', lower(last_name)) as name, count(*) from players group by name having count(name) > 1 order by count(*) desc) as player_dupes"
+----------+
| count(*) |
+----------+
| 37644    |
+----------+

And we have to do manual work 37,644 times.

I'm not sure this will happen in this current bounty as you're not going to rack up many cell edits from doing this intense manual work.

The Fix

But let's say we didn't care about money and wanted to clean one up. How would we do that?

So, I manually scanned the list and found a rather famous duplicated player, current New Jersey Net, LaMarcus Aldridge.

SHAQ $ dolt sql -q "select * from players where lower(first_name)='lamarcus' and lower(last_name)='aldridge'"
+-----------+---------------+-------------------------------+------------+-----------+---------------+-----------+
| player_id | nba_player_id | date_of_birth                 | first_name | last_name | height_inches | weight_lb |
+-----------+---------------+-------------------------------+------------+-----------+---------------+-----------+
| 42        | 200746        | 1985-07-19 00:00:00 +0000 UTC | LaMarcus   | Aldridge  | 83            | 250       |
| 418005    | NULL          | 1985-07-19 00:00:00 +0000 UTC | Lamarcus   | Aldridge  | 83            | 265       |
+-----------+---------------+-------------------------------+------------+-----------+---------------+-----------+

Looks like the second one is incorrect and due to the lowercase 'm'.

So I'm going to see what is tied to that erroneous player_id 418005 and decide whether to delete the data or rekey it with player_id 42.

The good news there are no entries in the player_season_stat_totals table for player_id 418005.

SHAQ $ dolt sql -q "select * from player_season_stat_totals where player_id=418005"
+-----------+---------+-----------+---------+---------------+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+----------------+-----------+
| player_id | team_id | season_id | minutes | games_started | games_played | 2pm | 2pa | 3pm | 3pa | ftm | fta | ast | stl | blk | tov | pts | orb | drb | trb | pf | season_type_id | league_id |
+-----------+---------+-----------+---------+---------------+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+----------------+-----------+
+-----------+---------+-----------+---------+---------------+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+----------------+-----------+

So, it's safe to delete lowercase 'm' LaMarcus Aldridge.

SHAQ $ dolt sql -q "delete from players where player_id=418005"
Query OK, 1 row affected

Unfortunately, because you won't end up with any cells in the final database, this change will not get you on the final scoreboard. Let's find a player that will.

Next I tried "Kiana Evans", and after a bit of Googling it seems there are indeed two Kiana Evans that played Women's college basketball, one at Marshall and one at Texas A&M.

Next I hit a duplicate, "Oscar Robles"!

SHAQ $ dolt sql -q "select * from players where lower(first_name)='oscar' and lower(last_name)='robles'"
+-----------+---------------+---------------+------------+-----------+---------------+-----------+
| player_id | nba_player_id | date_of_birth | first_name | last_name | height_inches | weight_lb |
+-----------+---------------+---------------+------------+-----------+---------------+-----------+
| 7400660   | NULL          | NULL          | Oscar      | Robles    | 73            | NULL      |
| 11500955  | NULL          | NULL          | Oscar      | Robles    | 73            | NULL      |
+-----------+---------------+---------------+------------+-----------+---------------+-----------+
SHAQ $ dolt sql -q "select * from player_season_stat_totals where player_id=7400660 or player_id=11500955"
+-----------+---------+-----------+---------+---------------+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+----------------+-----------+
| player_id | team_id | season_id | minutes | games_started | games_played | 2pm | 2pa | 3pm | 3pa | ftm | fta | ast | stl | blk | tov | pts | orb | drb | trb | pf | season_type_id | league_id |
+-----------+---------+-----------+---------+---------------+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+----------------+-----------+
| 7400660   | 743     | 2014      | 250     | 0             | 31           | 15  | 35  | 13  | 29  | 8   | 14  | 17  | 19  | 2   | 14  | 77  | 15  | 27  | 42  | 28 | 0              | 74        |
| 7400660   | 743     | 2015      | 69      | 1             | 10           | 5   | 15  | 0   | 12  | 5   | 10  | 7   | 5   | 0   | 6   | 15  | 1   | 6   | 7   | 7  | 0              | 74        |
| 7400660   | 743     | 2016      | 152     | 0             | 31           | 12  | 21  | 2   | 14  | 6   | 8   | 7   | 10  | 1   | 8   | 36  | 3   | 9   | 12  | 19 | 0              | 74        |
| 11500955  | 745     | 2013      | 13      | 0             | 5            | 2   | 4   | 0   | 1   | 1   | 2   | 2   | 2   | 0   | 2   | 5   | 0   | 0   | 0   | 5  | 0              | 115       |
| 11500955  | 745     | 2014      | 18      | 0             | 8            | 0   | 1   | 0   | 1   | 1   | 2   | 0   | 1   | 0   | 2   | 1   | 0   | 0   | 0   | 1  | 0              | 115       |
+-----------+---------+-----------+---------+---------------+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+----------------+-----------+

It looks like Oscar Robles played in two different Mexican leagues. This is the same player. So, I pick the first player id and set all rows in player_season_stat_totals to that one. Finally, I delete the duplicate player that I removed from the player_season_stat_totals table.

SHAQ $ dolt sql -q "update player_season_stat_totals set player_id=7400660 where player_id=11500955"
Query OK, 2 rows affected
Rows matched: 2  Changed: 2  Warnings: 0
SHAQ $ dolt sql -q "delete from players where player_id=11500955"
Query OK, 1 row affected

Here's my final diff.

SHAQ $ dolt diff
diff --dolt a/player_season_stat_totals b/player_season_stat_totals
--- a/player_season_stat_totals @ beu0j7h698mseisrii1ciljmhn3dc36h
+++ b/player_season_stat_totals @ c3kghqu7ffol7i4rflvrs99koq8n486d
+-----+-----------+---------+-----------+---------+---------------+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+----------------+-----------+
|     | player_id | team_id | season_id | minutes | games_started | games_played | 2pm | 2pa | 3pm | 3pa | ftm | fta | ast | stl | blk | tov | pts | orb | drb | trb | pf | season_type_id | league_id |
+-----+-----------+---------+-----------+---------+---------------+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+----------------+-----------+
|  +  | 7400660   | 745     | 2013      | 13      | 0             | 5            | 2   | 4   | 0   | 1   | 1   | 2   | 2   | 2   | 0   | 2   | 5   | 0   | 0   | 0   | 5  | 0              | 115       |
|  +  | 7400660   | 745     | 2014      | 18      | 0             | 8            | 0   | 1   | 0   | 1   | 1   | 2   | 0   | 1   | 0   | 2   | 1   | 0   | 0   | 0   | 1  | 0              | 115       |
|  -  | 11500955  | 745     | 2013      | 13      | 0             | 5            | 2   | 4   | 0   | 1   | 1   | 2   | 2   | 2   | 0   | 2   | 5   | 0   | 0   | 0   | 5  | 0              | 115       |
|  -  | 11500955  | 745     | 2014      | 18      | 0             | 8            | 0   | 1   | 0   | 1   | 1   | 2   | 0   | 1   | 0   | 2   | 1   | 0   | 0   | 0   | 1  | 0              | 115       |
+-----+-----------+---------+-----------+---------+---------------+--------------+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----+----------------+-----------+
diff --dolt a/players b/players
--- a/players @ 05moiukovgffgf4ppii9ani92e3rpnp0
+++ b/players @ q579pqtum73r5i4hja4v7jjb1i09n4jp
+-----+-----------+---------------+---------------+------------+-----------+---------------+-----------+
|     | player_id | nba_player_id | date_of_birth | first_name | last_name | height_inches | weight_lb |
+-----+-----------+---------------+---------------+------------+-----------+---------------+-----------+
|  -  | 418005    | NULL          | 1985-07-19    | Lamarcus   | Aldridge  | 83            | 265       |
|  -  | 11500955  | NULL          | NULL          | Oscar      | Robles    | 73            | NULL      |
+-----+-----------+---------------+---------------+------------+-----------+---------------+-----------+

Do it yourself

So, if run the write queries listed above on your own fork and make a PR, you will get credit for about 20 cells edited and make $50 when the bounty finishes.

We're not sure data bounties have the right incentive structure long term to facilitate this type of data cleaning task. We're open to your ideas about how we might encourage this type of cleaning work to happen on DoltHub. Come chat about your ideas with us in the #data-bounties room on our Discord.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.