Today, I’m thrilled to announce a second round of the Hospital Price Transparency Bounty. Our first pass was very elucidating; we learned a lot about the “shape” of the data and we’re excited to let our Discord lovelies take a second crack at it. We’ve made more informed choices about the schema and methodology for wrangling the data, with the hopes that the resulting dataset will be more amendable to analysis.
I anticipate those that contributed to the first version will be able to get up to speed quickly with this redux of the original bounty. Furthermore, newcomers will be able to rely on the groundwork laid during the first run making it super simple to jump in here at round two, especially if the first one seemed too daunting.
In this blog post we’re going to look at a couple of the challenges that have cropped up so far working with this hospital data, challenges that led us to host a second round of the bounty.
Scheming of a Perfect Schema
Schema changes mid-contest are rife with hassle. When tweaking the schema dual challenges loom—there is both the fear of breaking the credit attribution of already closed PRs and of invalidating open, yet to be merged-PRs. Changing the schema while the bounty is already in the water is thus a very expensive operation, mostly to be avoided. This creates a lot of pressure to design the schema perfectly from the outset, which proves to be nearly impossible without already having deep familiarity with the yet-to-be-collected data.
The changes centered around creating a more refined way to deal with primary key collisions. For instance a hospital might have multiple price points for a given procedure and payer, and our initial schema didn’t leave a way to handle that. During the run of the bounty we relied on ad-hoc work-arounds for issues like these. Another example is that we had one canonical description for each procedure code, which means we had no affordance for the myriad different descriptions each hospital might attach to a procedure code. We’d end up taking the first encountered description, which felt a bit arbitrary. A detailed account of all the changes is available in the database readme.
Secondly, when the datasets and the PRs grow very large, it becomes increasingly more difficult for a single reviewer to keep out crust and mistakes. Manual inspection of only a subset of the PR works when the PRs are small, but with this bounty the need for automatic algorithmic assaying of PRs revealed itself in full effect. We’re hard at work conceiving of the best ways to go about validating the database when it grow to be gargantuan in size.
Why go Through the Weeds?
Third, participants are not strongly incentivized to avoid mistakes nor clean-up them up, instead they maximize their payout by moving quickly and with little concern for mis-steps. From their perspective, what more is there than passing PR review? Going back to weed out errors is tedious and offers little reward. It’s far more lucrative to seek out a new, hefty chunk of unwrangled data than it is to refine what’s already been gathered. We’re considering paid issues as a way to mitigate this problem.
A New Slice of Pie
After a few weeks, the participants have a rough idea of the portion of the reward they shall receive and it becomes increasingly harder to move the needle. Having multiple rounds clears the the scoreboard and renew interest with a new pie to divvy. To me this is the most compelling reason to have multi-round bounties, to give latecomers a chance at a significant prize, and to invigorate the long-haulers.
If you’re interested in participating in this bounty, check the repository page and the original how-to contribute blog post, and don’t forget to pop in and say hi in our Discord.