Dolt + KAPSARC: DoltLab in Production

USE CASEDOLTLAB
3 min read

Dolt is a version controlled SQL database. How would you use such a thing?

Does your organization manage a lot of disparate data from a number of different sources? Do you want to track what or who made changes to your data? Are people stomping on each others' changes? Do you want human review?

The King Abdullah Petroleum Studies and Research Center or KAPSARC for short chose Dolt to solve these problems for their researchers. KAPSARC uses DoltLab in conjunction with Dolt to get lineage, collaboration via branches, and human review via Pull Requests on some of their most important datasets. In this blog we'll tell their story.

Dolt + KAPSARC

KAPSARC

KAPSARC is a Riyadh, Saudi Arabia based think tank focused on global energy economics and sustainability founded in 2007.

As part of their larger charter, KAPSARC researchers build and publish data models. The 2022 Annual Report provides a good sample of the KAPSARC's goals, research, and publications. Some recent research highlights include modeling emissions in the presence of carbon pricing, a survey on energy efficient and sustainable cities, and, of course, an oil market outlook.

KAPSARC researchers collect data from a number of sources. KAPSARC hosts an extensive data portal from which you can get an idea of the extent of the data they ingest and maintain. Managing data from disparate sources is a challenge.

KAPSARC employs a number of best-of-breed tools to facilitate data collection and model construction. Researchers use spreadsheets connected to SQL databases, custom data-wrangling scripts in languages like Python, and data visualization tools like Metabase.

How Dolt helps KAPSARC

KAPSARC was an early believer in data version control. KAPSARC wanted a commit log to keep track of data lineage, data branches to allow for asynchronous collaboration between researchers, and human review of data changes. They wanted these features so much they initially set out to build a tool that did all this for themselves.

In 2021, After realizing building version control for data was a large technical undertaking, they discovered Dolt was under active development. Dolt provided all the versioning features they were trying to build themselves, at scale with free and open source code.

Pavithra Kumar Shetty, Lead Developer at KAPSARC said, "When we discovered Dolt here at KAPSARC, we realized it could replace our own data versioning solution development".

Data Lineage

Data lineage is metadata attached to data that tells you when was the last time this data was updated by what or who and why. In software, this information is often packaged in the form of a commit log and line-based differences of files. In Dolt, because the target of versioning is tables, data lineage is shown via a commit log and cell-based diffs. Using Dolt, you can tell when each cell changed by who all the way back to the inception of the database, an audit log of every cell.

KAPSARC wanted data lineage for model reproducibility and explainability. If one model performed better than another, KAPSARC could build it again from its inputs. If one model performed better than another, a researcher could easily tell what data was different in the better performing model.

Data Branches

Multiple researchers may work on the same dataset or model at the same time. This situation generates conflicting edits. At KAPSARC, Dolt allows multiple researchers to work on separate data branches and merge their changes back to the main copy when they are ready. Moreover, multiple forecasts can live on different branches allowing for fast comparison of results.

Data branches provide change isolation for researchers without causing ballooning storage costs. Data that is shared between branches is only stored once given Dolt's novel storage engine.

Human Review

At KAPSARC, researchers wanted a way to review each others models and forecasts. Some researchers were familiar with the Pull Request workflow from GitHub and wanted the same workflow for data. DoltLab provided Pull Request capability on premises.

Integrated human review of data changes allows KAPSARC to prevent model errors closer to the source of the change, saving research time and money. A bug caught in review is far less costly than a bug that makes it farther along the development path.

KAPSARC chose DoltLab

Once KAPSARC knew Dolt was the right fit for their collaborative research data, they had a few Dolt flavors to choose from. KAPSARC wanted a GitHub-like web interface, complete with Pull Requests, for their researchers to use. They also wanted the safety and security of hosting their data on-premises. DoltLab was a natural fit.

KAPSARC was an early adopter of DoltLab Enterprise. As part of the DoltHub support, they requested and we delivered a number of useful features like private repositories and data transformations on import. KAPSARC leverages DoltLab enterprise features like Single Sign On and customized look-and-feel.

KAPSARC exposes their public Dolt databases on the internet via their DoltLab deployment. Feel free to check out their public databases!

KAPSARC DoltLab

Pavithra Kumar Shetty loves DoltLab saying, "DoltLab is easy to deploy and maintain, giving KAPSARC all the power of DoltHub but on our own cloud deployment."

Conclusion

Dolt and DoltLab have been great additions to KAPSARC's data stack. KAPSARC leverages Dolt's data lineage and branching capabilities. KAPSARC chose to add DoltLab for an on-premises Pull Request workflow.

Sound like something that could help you at your company? Join our Discord and let's discuss your use case.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.