Announcing the $10,000 US Businesses Bounty

3 min read

This guest blog post is by Spacelove, a top performer of several of our former bounties and winner of many thousands of dollars in prize money. He agreed to be the coordinator and judge for this new data bounty.

Announcing the $10,000 US Businesses Bounty

Today we're launching a $10,000 bounty to build a free, open dataset of all businesses in the United States. Dolthub bounties are a great way to make money while learning python, pandas, SQL, and webscraping.

There are millions of businesses in the US, but no open dataset of them exists. Some vendors do maintain their own lists of US businesses that they sell, but those lists can only be obtained through a subscription fee, a large one-time payment, or university access. To create a public, freely-available and consolidated dataset, we'll have to build it ourselves. That's what we hope to achieve with this bounty.

The Schema

There's just one important table in the bounty: the businesses table.

+-------------------+--------------+------+-----+---------+-------+
| Field             | Type         | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| name              | varchar(180) | NO   | PRI |         |       |
| business_type     | varchar(20)  | NO   | PRI |         |       |
| state_registered  | char(2)      | NO   | PRI |         |       |
| street_registered | varchar(180) | YES  |     |         |       |
| city_registered   | varchar(100) | YES  |     |         |       |
| zip5_registered   | char(5)      | YES  |     |         |       |
| state_physical    | char(2)      | YES  |     |         |       |
| street_physical   | varchar(180) | YES  |     |         |       |
| city_physical     | varchar(100) | YES  |     |         |       |
| zip5_physical     | char(5)      | YES  |     |         |       |
| filing_number     | varchar(15)  | YES  |     |         |       |
| public            | char(1)      | YES  |     |         |       |
| naics_2017        | char(6)      | YES  | MUL |         |       |
| ein               | char(9)      | YES  |     |         |       |
| sic4              | char(4)      | YES  | MUL |         |       |
| parent            | varchar(180) | YES  |     |         |       |
| website           | varchar(100) | YES  |     |         |       |
| duns              | char(9)      | YES  |     |         |       |
+-------------------+--------------+------+-----+---------+-------+

The hardest part of coming up with this schema was coming up with a good primary key. Generally speaking, active business names should be unique within a state and within a type (LLC/Corporation/etc). That's the hope, anyway. That's why we chose name, business_type, and state_registered as a 3 part composite primary key.

An example is A ALASKA CRUISE TRANSFER AND TOURS LLC, LLC, AK, which comes from the data we used to seed the database.

Some other fields which might need explanation:

  1. Every business has to register with the secretary of state. Upon formation it is given a number called its entity_id (or sometimes filing ID, which is equivalent.) This number is unique for every business within a state
  2. NAICS and SIC numbers identify the kind of business (food service, metal mining, winery)
  3. In some states businesses have a parent company which owns them

There are a few other rules, which you can read in the README for the database.

Prize structure

We're trying out a new prize structure. Past bounties allowed the quickest scrapers to make off with the lion's share of the prize money, leaving little for the rest of the bounty hunters. This time we're trying out a fixed-payout scheme to keep people motivated to scrape for second, third, and fourth place.

Your place is determined by the fraction of cells that are yours in the final table.

Your place You make
1st $5,000
2nd $2,500
3rd $1,250
4th $625
5th $320
6th $150
7th $100
8th-20th $50

What next?

The bounty runs until November 1, 2021. Fork the database and make your first pull request. We'll be waiting for you on Discord if you have any questions on how to get started. Our #data-bounties channel would love to meet you.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt