Hospital data for all: Part I

3 min read

I announced a few months ago that we were planning to make hospital prices accessible in a single, public, freely-accessible database, giving the public access to secret negotiations insurers have with hospitals.

We’ll capture these rates in two parts: first from the insurer side, and finally from the hospital side.

The work for part I starts this week, and you can follow our progress by watching our public database and following our Discord chat.

The big picture

The hospital data, combined with the insurance data, will be the biggest public data repository of its kind. We hope it'll be used by journalists and policymakers alike. Here are some directions:

  • price shoppers won't be able to use the data immediately. It'll take time before we can reverse engineer something like a final cost-to-consumer, if that's even possible. (We'll try.)

  • some hospitals, especially monopoly hospitals, are bad sports when it comes to pricing. With market power comes leverage, and hospitals can potentially ask for whatever prices they want. As long as nobody's lying about their published rates (or just omitting them) we'll be able to see this in the data.

  • we have evidence (unpublished) that insurers are publishing incorrect data in their MRFs: we see rates that should be in the files but aren't, and rates that shouldn't be, but are. Combining this data with the hospital data can give us a hint to whose side the errors are on.

Why now

There are three reasons why we're doing this in March 2023:

  • We finally have a proven data pipeline that can take the insurance MRFs and filter them down to just the information we need. We can easily use this to get prices for all public hospitals, as long as we have their NPI numbers
  • We’ve collected the standard charge files via a data bounty. You can find those here
  • We now have experience collecting healthcare data and have a diverse, talented pool of paid volunteers that have been doing ETL with us

On the insurance side

Transparency in Coverage gave us access to huge, well-structured files the list contracted rates between insurance companies and healthcare providers. But those files are unworkably huge and full of junk information.

We had to write a workflow that allowed us to pull just the information we needed and in a distributed way. That work made it to a python project called mrfutils which you can check out here. We used mrfutils to pull in millions of lab tests prices from all over the US and we're going to do the same for hospital prices.

On the hospital side: "the fourth time’s the charm"

Hospital chargemasters are notoriously ugly, diverse, and hard to put into a common schema. We’ve built three hospital price databases, each better than the previous one and all with some limited success. But they float around as prototypes because we aren't sure we can really trust the data.

First, we linked hospitals to the wrong standard charge files. This can happen when you have multiple hospitals named St. Mary, or hospitals that operate under multiple names. This time, we’ve carefully matched the links to the files in advance and added other identity checks.

Second, we assumed a schema that was too simple. Billing codes can have multiple parts and come in multiple formats. We’re going to have our volunteers keep as much data as possible, but also structure it in a way so that we can both recover the original data if needed, but also allow them to make their best guesses about whether the right data should go.

Third, we allowed the collection of any chargemaster data. This time, we're only allowing chargemasters which have key datapoints.

Finally, we used to launch bounties as a surprise. This time, we’ve told everyone that it’s coming, and have been including the volunteers’ input as part of the planning process. We’re hoping that this will give us better prepared volunteers and, with their knowledge, help us build a more robust system.

This is a public project and you can help

You can donate your time or money to help us.

  1. If you have expertise, take a look at this Google sheet. It's open to the public for comments.
  2. We're looking for paid volunteers who can help us wrangle datafiles. We pay out a fixed amount per week. If you're interested in being a bounty hunter, join our Discord and ask for @spacelove.
  3. We'll also accept funds that allow us to keep sourcing data from paid volunteers indefinitely.

We'll be publishing on this periodically, so stay tuned!



Get started with Dolt

Or join our mailing list to get product updates.