We just finished SHAQ, one of our most competitive bounties ever. And off the heels of that, comes an even more ambitious project: tracking the US housing market. Every. Single. Sale.
Let's track housing sales across the US
After a lull in 2009, housing prices are at an all-time high. At the same time, so is housing insecurity. Nearly 3 apartments remain vacant for every homeless person in NYC. Help us put a microscope on what's happening with home, apartment, building, and land prices across the US. What's housing selling for, what's selling, and where? If we aggregate all the data into one place, maybe we can get some insight into it. Help us create that database.
What we want
We're looking for current and historical data on houses that have actually been sold. That means not current "for sale" prices on sites like Zillow, but actual data from official sources. As a distributed database by design, Dolt is uniquely suited for the challenge. The sum-total creativity of our bounty participants if higher than that of any lone worker.
Plus, part of the appeal in scraping primary sources is that we learn how to continuously update our data, allowing the database to become more valuable over time instead of less.
There's just one important table in the bounty: the
sales table. Here you can see what we're collecting.
| Field | Type | Null | Key | Default | Extra |
| state | char(2) | NO | PRI | | |
| zip5 | char(5) | YES | | | |
| physical_address | varchar(1024) | NO | PRI | | |
| city | varchar(255) | YES | | | |
| county | varchar(255) | YES | | | |
| property_id | varchar(255) | YES | | | |
| sale_date | datetime | NO | PRI | | |
| property_type | varchar(255) | YES | | | |
| sale_price | bigint | NO | | | |
| seller_name | varchar(1024) | YES | | | |
| buyer_name | varchar(1024) | YES | | | |
| num_units | int | YES | | | |
| year_built | int | YES | | | |
| source_url | varchar(2048) | YES | | | |
| book | int | YES | | | |
| page | int | YES | | | |
It's tough to structure any data that relies on addresses for its keys. We made the decision to group
physical_address together, with uniqueness on the entire address/sale tuple:
primary key (zip5, physical_address, sale_date)
This is hopefully enough to disambiguate sales, without constraining our bounty participants too much.
Due to the sheer amount of data we collect it can be hard to moderate incoming submissions. Plus, asking participants to clean data they've already submitted can be taxing on them. This time around we get to show off our newly-implemented SQL
CHECK CONSTRAINT functions, giving instant feedback to our submitters and getting a more consistent database.
constraint valid_year_built check (
year_built <= 2022 and
year_built >= 1492
constraint valid_sale_date check (
sale_date >= '1492-1-1' and
sale_date <= '2022-2-28'
You take home money proportional to how many cell edits you create, with a max prize of 10,000 USD. So if you create 45% of all the cell edits, you take home 4500 USD.
Visit here to learn more.
The bounty runs until Friday Feb. 18, 2022. Fork the database and make your first pull request. We'll be waiting for you on Discord if you have any questions on how to get started. Our #data-bounties channel would love to meet you.