Doltpy 2.0

7 min read

Background

Earlier in the week we talked about Dolt's "API surface area." To recap, Dolt is a relational database with version control features. Dolt has a SQL query interface implementing the MySQL dialect, as well as a command line interface (CLI) for administration tasks. The CLI also exposes the version control features, which are closely modeled on Git.

Towards the end of that post we identified the value higher level APIs could offer our users. Version controlled database operations can be complex, as managing the Dolt commit graph adds complexity to SQL operations. The goal of a higher level API is to abstract away some of that complexity, so that our users can capture the value of Dolt's version control features at lower cost. Doltpy is a Python API for interacting with Dolt. This post covers a refresh of Doltpy that offers our users higher level interfaces for interacting with Dolt.

Doltpy

Doltpy started out as an internal utility for getting data into Dolt. The original motivation, when Dolt was less mature and harder to use, was simply to write a Pandas DataFrame into a Dolt database.

Internal / Pre 1.0

At first Doltpy was basically a "toolbox" for random operations with Dolt. It started out just writing a Pandas DataFrame into Dolt, and committing it. As we started loading more complicated datasets into Dolt, and wanting to script them, we added more functions. There was essentially no "design". We just implemented stuff we needed to experiment with our database.

Doltpy ended up in the public domain because a customer asked for it. Specifically an early adopter was interested in using Dolt and DoltHub for a data science use-case. They asked us if we had any tools for using Python alongside Dolt. We quickly tidied up the code and published it to PyPi. It wasn't really documented, and it wasn't well thought out, but it solved a customer's problem, so we shipped it.

Doltpy 1.0

As more folks started to use Doltpy, and we started to use it internally for complicated tasks, we started to write more functions to cover the various tasks we were trying to perform. It became increasingly clear that we were designing software by whack-a-mole. We took a step back and tried to evaluate how to meet our internal needs and the needs of our customers, which were closely aligned, with coherent API design.

We made the decision to replicate the Dolt CLI as closely as possible, and provide some basic functions for connecting to running SQL server. This meant that whatever could be done in Dolt CLI, could be done from Python, and the function names and parameters were identical to the extent possible. Since the Dolt CLI is familiar to Git users, the Doltpy "model" was pretty easy to figure out. For example, the following snippet is familiar to a Git user, even if they don't know Dolt:

from doltpy.core import Dolt
from doltpy.core.write import bulk_import

dolt = Dolt.clone('dolthub/great-players-example')
bulk_import(dolt, 'great_players', open('path/to/great_players.csv'), ['id'], 'update')
repo.add('great_players')
repo.commit('Added some great players')

Furthermore, we also recognized that many users did not actually want to use the Dolt CLI. Indeed if they wanted the full Dolt CLI, then why use Python at all? Many Python users were already working with Python tools and data structures, and surfacing the Dolt CLI in Python didn't meet them where they were. We therefore added dolt.core.read and dolt.core.write with tools for users to write Python data structures to Dolt:

def import_list(repo: Dolt,
                table_name: str,
                data: List[Mapping[str, Any]],
                primary_keys: List[str] = None,
                import_mode: str = None):
    ...

def import_dict(repo: Dolt,
                table_name: str,
                data: Mapping[str, List[Any]],
                primary_keys: List[str] = None,
                import_mode: str = None):
    ...

These functions were the start of a "higher level API" for Dolt. They were designed to give users more succinct ways to express common read and write patterns against Dolt from within Python, using data structures they were familiar with.

Doltpy 2.0

While Doltpy 1.0 implemented some higher level interfaces for Python uses, and provided all of Dolt's functionality, it didn't do it in a very intelligent way. In our post earlier in the week we used Postgres and MySQL to show that relational databases provide an "online" SQL interface that assumes the existence of running server process, and a set of administration tools, generally in the shape of shell scripts. Doltpy 1.0 implemented features that implicitly administered the server process in a way that was confusing to the user. For example import_list referenced above started Dolt SQL Server process if one wasn't already running. More than once internal users got confused when they already had a Dolt SQL Server running on port 3306, and tried to write data to the wrong Dolt database. There are other examples, but the underlying point is there was not a clean separation between "offline" administration of Dolt, and "online" operations against the query engine.

To solve for this we decided to rethink the package layout to be reflect a clean separation. We ended up with the following modules:

doltpy/
├── doltpy.cli
├── doltpy.etl
└── doltpy.sql

The goal was to clearly separate online and offline operations against Dolt databases. Our ETL tools remained separate.

doltpy.cli

doltpy.core became doltpy.cli to make clear that this module represents a simple mapping of the Dolt CLI. We doubled down on our original decision to surface the CLI by making this module a more pure CLI-like experience and removing interactions with the Dolt SQL server. We also implemented sub-modules that provided the higher level interfaces we were aiming for without administering any SQL server processes. For example, from doltpy.cli.write:

def write_rows(
    dolt: Dolt,
    table: str,
    rows: List[dict],
    import_mode: Optional[str] = None,
    primary_key: Optional[List[str]] = None,
    commit: Optional[bool] = False,
    commit_message: Optional[str] = None,
    commit_date: Optional[datetime.datetime] = None,
)

While the parameter list is long, this function lets users script relatively succinct write operations with native Python data structures:

from doltpy.cli import Dolt
from doltpy.cli.write import write_rows
dolt = Dolt('path/to/database')
rows = [{'name': 'Roger', 'status': 'GOAT'}, {'name': 'Novak', 'status': 'aspiring-GOAT'}]
write_rows = write_rows(dolt, 'players', rows, commit=True)

Contrast this to the Doltpy 1.0 snippets in the previous section where we used bulk_import or import_list. Both required us to add and commit tables explicitly. While the Git workflow is familiar, it's not the workflow we necessarily want when automating a data pipeline in a high level scripting language. Functions like these wrap up lower level operations into the primitives our users need to incorporate Dolt into their data infrastructure. At the same time doltpy.cli remains available for users that want to script with more granular primitives.

doltpy.sql

Since doltpy.cli was to be a purely "offline" experience, we broke out the SQL functionality into its own package. There were two distinct cases, and we wanted to solve for both:

  • a user has a Dolt instance running locally, or on a remote host, and wants to connect to it to execute SQL queries
  • a user wants their Python program to manage the Dolt SQL Server process as a child process to facilitate their operations

Once our Python process has a connection, both cases are basically identical. With this in mind we implemented a class hierarchy as follows:

DoltSQLContext
DoltSQLEngineContext
DoltSQLServerContext

DoltSQLEngineContext

This class assumes the existence of Dolt SQL Server, and takes a ServerConfig that tells it how to find connect to that server. It creates a SQL Alchemy connection, and then exposes the methods of DoltSQLContext which are executed using that engine instance. Repeating the example from doltpy.cli, and assuming Dolt is running on port 3306 of localhost, and the database is called great_players:

from doltpy.sql import DoltSQLEngineContext, ServerConfig
dsec = DoltSQLEngineContext('great_players', ServerConfig())
rows = [{'name': 'Roger', 'status': 'GOAT'}, {'name': 'Novak', 'status': 'aspiring-GOAT'}]
dsec.write_rows('players', rows, commit=True)

DoltSQLServerContext

What about if we want to use the SQL query interface, but we want Doltpy to manage the server process? In that case our Dolt instance needs to be local, because we use doltpy.cli under the hood:

from doltpy.sql import DoltSQLEngineContext, ServerConfig
dolt = Dolt('path/to/database')
rows = [{'name': 'Roger', 'status': 'GOAT'}, {'name': 'Novak', 'status': 'aspiring-GOAT'}]
with DoltSQLSeverContext(dolt, ServerConfig()) as dssc:
    dssc.write_rows('players', rows, commit=True)

Note that we are mirrored the offline doltpy.cli.write function import_rows almost exactly. Both of these write paths do something similar: they write a native Python data structure to Dolt database, and associated a commit with that write. This is the primitive we are providing our users, but we realize that some of them will be using Dolt "offline" more like a VCS for data, whereas others will be using it as a database with version control features. In both cases, the user wants the same primitve.

doltpy.etl

Our ETL package remains largely unchanged, though with some minor code quality improvements. Additionally there was some refactoring required to work with doltpy.cli and doltpy.sql, since we made material breaking changes to those interfaces.

Ignoring Doltpy

Some users might which to interact with a running Dolt instance without going through Doltpy, and simply connect using a MySQL connector of their choice. Doltpy is totally optional, and merely reflects our accumulated knowledge from using Dolt in Python, and what we our users have told us. For those that want to go in their own direction, Dolt is a general purpose database which makes that totally possible.

Conclusion

Dolt's CLI provides administration tools for managing your Dolt databases, and interacting with the version control system. Dolt's SQL query interface provides a familiar query interface, and exposes Dolt's version control tools. Doltpy provides users with full access to these interfaces from a high level scripting language widely adopted in data engineering contexts. Additionally Doltpy provides higher level interfaces for common operations such as writing and committing tables, whether they want to be done offline, or offline.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt