On January 1, 2021, a US law was passed requiring hospitals to publish their prices in human and machine readable format. We would like to assemble the best open dataset of hospital prices in the US to aid researchers. To this end, we’re launching our second bounty! The $10,000 bounty will be divided based on the percentage of cells modified, last write wins, in any table in the database during the two month duration of the bounty. The bounty ends March 1, 2021 and payouts will be calculated and processed then.
Bounties on DoltHub are a way to pay users to wrangle data. Because Dolt has Git-style versioning, interesting payout methods like percentage of cells edited are possible. We will be publishing new bounties approximately monthly so join our Discord to stay apprised!
The database consists of three tables:
cpt_hcpcs table consists of three columns,
code is the CPT or HCPCS code for the service. The CPT code is a standard maintained by the American Medical Association. HCPCS is the Medicare standard. We believe most hospitals will publish using the CPT code but HCPCS codes will be used instead if the pricing data was published using HCPCS. CPT codes are numerical while HCPCS codes start with a letter followed by 4 numbers
hospitals table has the columns
url column needs to be populated with the url where the pricing data can be found. The
npi_number is the 10 digit national provider identifier for the corresponding hospital and
name is the name of the hospital.
prices table consists of four columns:
code, with a foreign key constraint on the
npi_number, with a foreign key constraint on the
payer which is the insurer and takes on the value ‘CASH’ in the case where the payer is the individual, and
price, the price of the medical shoppable service in USD.
I started by picking a hospital, Ben Taub hospital in Houston, TX and googled
Ben Taub Hospital shoppable services, the second result had a reference to Hospital Price Transparency. Following the link, I found a zip download for the hospital’s Charge Description Master, which contains the descriptions, billing codes, and charge amounts for each service.
Unzipping the file yields an excel spreadsheet. In order to read it programmatically in python I rely on openpyxl. Next I google the NPI number for the hospital, which nets me the hospital’s NPI number and address, and I add them to the
hospitals table. Now that’s complete, I can rev up a dolt mysql server
dolt sql-server and run my script to add each of the entries in the Charge Description Master to the database.
You can look at the script here to help you get started.
Update on the US Election Bounty
The results for the US Election bounty so far have been fantastic, and there’s still one month left to vie for a piece of the prize! 6 people had 43 Pull Requests accepted. We’ve had over 10.3M cells edited in the database. All 50 states from 2016 are covered. 25 states from 2020 are covered.
If the bounty ended today, our top contributor would earn almost $12,000 and every participant would earn over $1,000. You can follow current payouts on the live scoreboard . Data wrangling for bounties can be lucrative work.
Feel free to clone the data and start analyzing it even if you don’t want to contribute. As updates are published just run dolt pull to get the latest results. Run dolt diff to see what changed. This is the power of using a version controlled database. If you have questions, stop by our Discord to talk it over with us and other bounty participants.