$12,000 US Housing Prices Bounty

BOUNTY
4 min read

Coming off the heels of our successful museum collections data bounty is a sequel to one of our flagship projects: tracking every single housing sale record in existence.

The link to the latest bounty is here.

If you've never heard of Dolt, we use a unique technology to build databases in a distributed way. It's the same way codebases are built with Git. We were able to build the world's largest open database of hospital prices this way.

For the current bounty, we're looking for current and historical data on houses that have actually been sold. That means not current "for sale" prices on sites like Zillow, but data from primary sources like government records (which are public.)

Our new payment model will make this our best bounty yet.

Data bounties can create perverse incentives

DoltHub is the only place on the internet where anyone can get paid to scrape data. Dolt data bounties are built like Git repositories: you make a pull request (PR) against a database (like a Git repository) which is reviewed by the bounty coordinator (me, at present). If your PR gets merged, you get paid.

But our system of paying per cell edit resulted in some classic perverse incentives.

  1. PRs slowed down as time went on, since the more data is already in the database, the less subsequent additions are worth
  2. We never get the schema right on the first try, since this involves knowing the entire data landscape in advance, which is usually only possible after the bounty ends
  3. Some huge dataset "whales" were worth a lot of "points" (total cell edits) but didn't make the database that much more useful. This leads to people going after datasets that get them a lot of points, but don't make for interesting datasets.

We lobbed around a few solutions and settled on one:

We're chunking one big bounty into lots of smaller ones

In this case, we'll be running a fresh $2k bounty (for the same data) every week for six weeks, for a total of $12k.

This solves the above problems:

  1. We "restart" the bounty every week to keep you motivated to keep bounty hunting
  2. We have the chance to update the schema weekly if necessary

Plus, with this model, we can continue the bounty if things are going well, or stop it early if PRs slow down too much.

The new Dolt storage format

You'll need to get the latest Dolt to accommodate our new, faster, Dolt storage format. As beta testers of this new format, you'll probably encounter some bugs. Just be patient with us.

The sales table

There's just one important table in the bounty: the sales table. Here you can see what we're collecting (the columns have been ordered here for clarity).

CREATE TABLE `sales` (
  `state` char(2) NOT NULL,
  `property_zip5` char(5),
  `property_street_address` varchar(1024) NOT NULL,
  `property_city` varchar(255),
  `property_county` varchar(255),
  `property_id` varchar(255),
  `property_type` varchar(255),
  `property_township` varchar(255),
  `property_lat` float,
  `property_lon` float,
  `sale_datetime` datetime NOT NULL,
  `sale_id` varchar(255),
  `book` int,
  `page` int,
  `sale_price` bigint NOT NULL,
  `seller_1_name` varchar(1024),
  `seller_1_state` enum('ak','al','ar','as','az','ca','mn','mo','mp','ia','id','il','ms','mt','nc','de','fl','fm','ga','gu','hi','nj','nm','nv','ny','oh','ok','co','ct','dc','sd','tn','in','tx','ks','ky','ut','va','vi','or','pa','pr','la','ma','md','vt','wa','wi','me','mh','mi','pw','ri','sc','wv','wy','nd','ne','nh'),
  `seller_2_name` varchar(1024),
  `seller_2_state` enum('ak','al','ar','as','az','ca','mn','mo','mp','ia','id','il','ms','mt','nc','de','fl','fm','ga','gu','hi','nj','nm','nv','ny','oh','ok','co','ct','dc','sd','tn','in','tx','ks','ky','ut','va','vi','or','pa','pr','la','ma','md','vt','wa','wi','me','mh','mi','pw','ri','sc','wv','wy','nd','ne','nh'),
  `buyer_1_name` varchar(1024),
  `buyer_1_state` enum('ak','al','ar','as','az','ca','mn','mo','mp','ia','id','il','ms','mt','nc','de','fl','fm','ga','gu','hi','nj','nm','nv','ny','oh','ok','co','ct','dc','sd','tn','in','tx','ks','ky','ut','va','vi','or','pa','pr','la','ma','md','vt','wa','wi','me','mh','mi','pw','ri','sc','wv','wy','nd','ne','nh'),
  `buyer_2_name` varchar(1024),
  `buyer_2_state` enum('ak','al','ar','as','az','ca','mn','mo','mp','ia','id','il','ms','mt','nc','de','fl','fm','ga','gu','hi','nj','nm','nv','ny','oh','ok','co','ct','dc','sd','tn','in','tx','ks','ky','ut','va','vi','or','pa','pr','la','ma','md','vt','wa','wi','me','mh','mi','pw','ri','sc','wv','wy','nd','ne','nh'),
  `transfer_deed_type` varchar(100),
  `deed_date` datetime,
  `building_num_units` int,
  `building_year_built` int,
  `building_num_stories` int,
  `building_num_beds` int,
  `building_num_baths` int,
  `building_area_sqft` int,
  `building_assessed_value` bigint,
  `building_assessed_date` datetime,
  `land_area_acres` int,
  `land_area_sqft` int,
  `land_assessed_value` bigint,
  `land_assessed_date` datetime,
  `source_url` varchar(2048),
  PRIMARY KEY (`state`,`property_street_address`,`sale_datetime`),
  KEY `state` (`state`),
  CONSTRAINT `lcr47pg1` FOREIGN KEY (`state`) REFERENCES `states` (`code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin

Prize structure

For each bounty you take cash out proportional to how many cell edits you create, with a max prize of $2,000. So if you create 45% of all the cell edits, you take home $900 each week.

Visit here to learn more.

What next?

The complete bounty is scheduled to run 6 weeks, until Friday Sep. 21, 2022. Fork the database and make your first pull request. We'll be waiting for you on Discord if you have any questions on how to get started. Our #data-bounties channel would love to meet you.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.