Dolt is the world's first SQL database with Git-style
version control. One of our more popular use cases is assembling large datasets
with an army of volunteers, like with our medical pricing
bounties. The same
people who care about these large datasets typically use Python data tools
(Pandas, pytorch, etc.) to analyze them. So we decided to travel to one of the
largest meetups of Python data science practitioners to see what we could learn.
PyData NYC took place at the Microsoft Times Square office, which is a pretty
incredible location for sightseeing and transit. The Microsoft office seemed
pretty deserted, like a lot of tech offices where remote work became the norm
since Covid. But that was great for the purposes of the conference, more space
Data version control talks
We were at PyData NYC to preach the gospel of data version control, and to
introduce the version control features that Dolt provides. My talk was about how
to use these to implement reproducibility in data science. The importance of
version control for data is something we've talked about several times
before, but the data science
community in general hasn't really adopted this practice, which is why the main
part of my talk is a high-level overview of why it's important in the first
place. You can watch it below.
The talk generated a lot of interest and tons of questions from the
audience. The data science world is definitely curious about what version
control tools have to offer their discipline, even if that curiosity hasn't
translated into widespread adoption yet.
To the extent that data science as a field has adopted data version control,
they're mostly using a product called (appropriately) Data Version
Control, or DVC. A data scientist working on ML drug
discovery named Estefania Barreto-Ojeda gave an excellent overview of how to use
DVC in ML pipelines. Like me, she spent a good portion of her talk just making
the case for why the practice of version control matters, at all. Unlike me, she
is actually a data scientist with first-hand experience using these tools in
production. Her talk is definitely worth watching for anyone in data science.
I got a chance to talk to Estefania after her session to compare notes about DVC
vs. Dolt. Her perspective is that the things Dolt can do that DVC can't (namely
diffing between versions of a dataset, merging two branches together, or
reducing storage costs), are mostly beyond the understanding or interest of data
scientists today. The main struggle is to get them to use version control on
their data in the first place, and DVC gets them the most crucial feature,
reproducibility of results, with minimal changes to their workflow. Data version
control is still in its infancy for the data science use case, and we'll keep
revisiting it as the industry continues to mature.
The other talks
As a non-data-scientist and non-python programmer, the rest of the talks that
appealed to me focus less on technical products or details and more about the
real-world results they achieved. Here are a few that I enjoyed.
Thomas Caswell, a scientist at Brookhaven National Lab, gave a very
thought-provoking talk about software development as it relates to scientific
research. Like my talk and Estefania's, his talk had a large component of
evangelism, attempting to get researchers to adopt better practices. In his
case, he wants researchers to approach the software they write more like
reusable library development than one-off scripts. It's a point that's
applicable to any kind of software development, not just research, and well
Rohit Supekar from the New York Times data science team presented on how The
Times uses machine learning to increase subscription rates. As someone who
didn't know The Time even had a data science team, much less used sophisticated
ML algorithms to get more subscribers, this was pretty fascinating. The basic
idea is that site gives different numbers of free articles per month to
different users in an attempt to nudge them to subscribe more. They make the
decision on how many free articles to offer based on a reader's habits and inferred characteristics, and then adjust it up and down based on a prediction of what will maximize subscriptions. It's a very interesting topic and a great presentation.
One of my favorite talks was almost completely non-technical, although it used
data analysis to produce its result. It was called Chasing the Overton Window,
and analyzed the common adage that people tend to get more conservative as they
get older. To spoil the talk a bit, this isn't really true: people's beliefs
stay relatively constant as they age, but society steadily drifts leftwards in
its values over time, making older people more conservative on a relative but
not absolute basis.
PyData puts on conferences around the world
on a regular basis and releases all the videos for free on YouTube, so subscribe
to their channel for a steady stream of Python and data science talks. There's
lots of great content for anyone using Python to analyze data.
PyData was a great opportunity to learn what's happening in the data science
space and how Dolt fits in there. We'll be attending other conferences on a
regular basis, so watch this blog for details.
Like the talk? Have some comments about it, or questions about Dolt?
Join us on Discord to talk to
our engineering team and meet other customers.