Last year I wrote about how much new health insurance data spilled onto the web (petabytes). It turned out that for practical purposes most of the rates could probably be discarded, with the usable fraction probably being less than 10%.
But it looks like even the good data might be busted. The insurance machine-readable-files, or "MRFs", are broken, with contracted rates that are wrong, mislabeled, or missing.
And if you haven't heard about this, it's because the companies that work with this data are also the ones that sell it.
Here's the problem: you can't verify the published rates, because there's nothing to compare them against. So companies that have the data are spending more money on external data (like claims) to "verify" these rates, making it pretty useless for everyone else.
And with what little we can verify, it looks wrong. The best hope we have is comparing the insurance data to the also-now-public hospital data, a happy accident that lets us cross-check the rates. Still, it leads us to a pretty grim picture of transparency in healthcare.
At DoltHub, where we build databases like codebases, we're running a data bounty, collecting rates for popular medical procedures for all US hospitals. Then we'll release the data under CC. Find out more here.
I'd cut the insurance companies some slack on compliance, except they've made my life analyzing these rates so much harder than it had to be:
- Humana published their MRFs across 11,000,000 different files, then rate-limited how fast you could download them
- Anthem/Florida Blue gave all their rates an ambiguous label ("fee schedule") which makes interpreting them impossible (see my discussion with the CMS here)
- Kaiser Permanente ignored the schema requirements for many of their files by including extra fields, mislabeled fields, or bad character encoding
- Aetna seemingly published most of their rates, ignored my calls, emails, and tweets asking for clarification on some of the published rates.
The list goes on. Anyway, this is what the insurance companies want. They've referred to their negotiated rates as "trade secrets" which give them a competitive advantage. What do they care if their published rates are wrong? It only helps them.
The first sign that something was wrong
Last year I got contacted to download and flatten some insurance data for a private company using our in-house tooling. They wanted to figure out how much cheaper they were compared to their competition. I got back to them with a list of rates for different billing codes, for them, and a bunch of other providers on their list.
They told me there must be something wrong with the data I gave them, because their own companies rates were mislabeled in the dataset:
- Aetna's rates matched around 45% of this company's contracted rates
- Anthem Blue Cross had varying degrees of matching, ranging from 0% to 75%, depending on the state
- United HealthCare's rates matched just 2% of the time
And this was when we included the spurious "fee schedule" rates, which elevates the match percentage. It happened over and over that for an item that should have been reimbursed at $250, United had listed said it only reimbursed the company $5, and so on, for many other matches.
The data was just wrong. Was this a pattern we could reproduce?
Cross checking with hospital data
It's actually a lot harder to cross check these rates than you would think. Insurers negotiated by tax ID (TIN), which for a hospital, is not always easy to find. Luckily, the Transparency in Pricing Act required hospitals to put the TIN in the filenames of their price sheets, in the format
We happen to have built a database of those standard charge files, so for the fraction of hospitals that are compliant, we can extract those TINs, and then use them to filter down to just hospital contracted rates.
For convenience, I'm only going to compare insurance rates with hospitals who've also published their rates in CSV format. And to be extra sure that these rates actually get charged, I'm going to filter down to just this list of confirmed rates. That leaves only a tiny fraction of rates to compare, but they should be the highest quality.
So after untold gigabytes downloaded, I started going down the list of hospitals, in order. Here are the first hospitals that I found rates for. Because these are "negotiated, fee for service" rates, they ought to match exactly:
- Allegiance has a contract with Benefis, a hospital system. Allegiance says it pays Benefis, the hospital, $14.66 for a wrist X-ray. But that's not right — according to the hospital, that's just what the radiologist gets. The hospital gets $203.40 from Allegiance.
- Aetna has a contract with Pennsylvania Hospital. It says it pays the hospital $771.00 for an upper GI endoscopy. But the hospital claims the actual reimbursement is much higher: $1,954-$3,900, depending on the plan.
- United HealthCare has a contract with St. Joseph Medical Center. It says that it pays the hospital $200-$830 for an a biopsy of the upper GI. The hospital disagrees: it says the rate is actually $2,036.72, for all the United plans.
In a table:
I've starred one rate to highlight the single close match (which is still off by 10%!) and to demonstrate how far off all the others are (when they even exist!) How come most of the rates are so far apart? How come the insurer rates are systematically low?
The root problem is this: the CMS should be verifying the rates, or giving the public a way to cross-check the rates, so that insurers are forced to publish correct data. The CMS can enforce audits, or claims data should be made public to link insurers to procedures and prices.
If you can't trust the data, what's all this transparency stuff for, anyways?
What we're doing at DoltHub, and why
We're extending our coverage of the MRF data by collecting the rates for the most popular billing codes for all US hospitals. If you want to check our progress, you can click here. You can even download the data as we import it.
Once we're done collecting the insurance data, we're expanding our cross-checks by also compiling all of the hospital standard charge data into a single relational database. Click that link to check out the existing database of standard charge files. You can catch up on where we're at by checking out this Google Doc.
DoltHub doesn't care about selling this data. Our database, Dolt, which works like Git, allows us to build this databases with people all over the world, working collaboratively.
If you want to know more or just follow up with a question, join our active Discord or write me at firstname.lastname@example.org.
Thanks to David Gaines from CareIgnition and Rob Archibald from My Price Health from for help brainstorming, fact-checking, and writing this. Both of them are experts in healthcare transparency data.