DoltHub goes to PyData NYC

March 20, 2023

4 min read

Dolt is the world's first SQL database with Git-style version control. One of our more popular use cases is assembling large datasets with an army of volunteers, like with our medical pricing bounties. The same people who care about these large datasets typically use Python data tools (Pandas, pytorch, etc.) to analyze them. So we decided to travel to one of the largest meetups of Python data science practitioners to see what we could learn.

The conference

PyData NYC took place at the Microsoft Times Square office, which is a pretty incredible location for sightseeing and transit. The Microsoft office seemed pretty deserted, like a lot of tech offices where remote work became the norm since Covid. But that was great for the purposes of the conference, more space for us!

Data version control talks

We were at PyData NYC to preach the gospel of data version control, and to introduce the version control features that Dolt provides. My talk was about how to use these to implement reproducibility in data science. The importance of version control for data is something we've talked about several times before, but the data science community in general hasn't really adopted this practice, which is why the main part of my talk is a high-level overview of why it's important in the first place. You can watch it below.

The talk generated a lot of interest and tons of questions from the audience. The data science world is definitely curious about what version control tools have to offer their discipline, even if that curiosity hasn't translated into widespread adoption yet.

To the extent that data science as a field has adopted data version control, they're mostly using a product called (appropriately) Data Version Control, or DVC. A data scientist working on ML drug discovery named Estefania Barreto-Ojeda gave an excellent overview of how to use DVC in ML pipelines. Like me, she spent a good portion of her talk just making the case for why the practice of version control matters, at all. Unlike me, she is actually a data scientist with first-hand experience using these tools in production. Her talk is definitely worth watching for anyone in data science.

I got a chance to talk to Estefania after her session to compare notes about DVC vs. Dolt. Her perspective is that the things Dolt can do that DVC can't (namely diffing between versions of a dataset, merging two branches together, or reducing storage costs), are mostly beyond the understanding or interest of data scientists today. The main struggle is to get them to use version control on their data in the first place, and DVC gets them the most crucial feature, reproducibility of results, with minimal changes to their workflow. Data version control is still in its infancy for the data science use case, and we'll keep revisiting it as the industry continues to mature.

The other talks

As a non-data-scientist and non-python programmer, the rest of the talks that appealed to me focus less on technical products or details and more about the real-world results they achieved. Here are a few that I enjoyed.

Thomas Caswell, a scientist at Brookhaven National Lab, gave a very thought-provoking talk about software development as it relates to scientific research. Like my talk and Estefania's, his talk had a large component of evangelism, attempting to get researchers to adopt better practices. In his case, he wants researchers to approach the software they write more like reusable library development than one-off scripts. It's a point that's applicable to any kind of software development, not just research, and well worth watching.

Rohit Supekar from the New York Times data science team presented on how The Times uses machine learning to increase subscription rates. As someone who didn't know The Time even had a data science team, much less used sophisticated ML algorithms to get more subscribers, this was pretty fascinating. The basic idea is that site gives different numbers of free articles per month to different users in an attempt to nudge them to subscribe more. They make the decision on how many free articles to offer based on a reader's habits and inferred characteristics, and then adjust it up and down based on a prediction of what will maximize subscriptions. It's a very interesting topic and a great presentation.

One of my favorite talks was almost completely non-technical, although it used data analysis to produce its result. It was called Chasing the Overton Window, and analyzed the common adage that people tend to get more conservative as they get older. To spoil the talk a bit, this isn't really true: people's beliefs stay relatively constant as they age, but society steadily drifts leftwards in its values over time, making older people more conservative on a relative but not absolute basis.

PyData puts on conferences around the world on a regular basis and releases all the videos for free on YouTube, so subscribe to their channel for a steady stream of Python and data science talks. There's lots of great content for anyone using Python to analyze data.

Conclusion

PyData was a great opportunity to learn what's happening in the data science space and how Dolt fits in there. We'll be attending other conferences on a regular basis, so watch this blog for details.

Like the talk? Have some comments about it, or questions about Dolt? Join us on Discord to talk to our engineering team and meet other customers.

Blog

The conference

Data version control talks

The other talks

Conclusion

Get started with Dolt