TECHNICAL

11 min read

Codebase organization and design is a vital skill, and a skill I was never taught it in school and had to pick up in the industry. I suspect it wasn’t taught for two reasons:

Codebase design is a soft skill. Even when you can recognize a well-structured codebase, opinions differ on how to make code organized. It’s more art than science. Just look at the long list of incompatable practices on Wikipedia’s List of software development philosophies
The importance of codebase architecture is much more apparent with larger, more complex code bases. Most of the code written in academia may never reach the level of complexity where good organization matters.

Our project, Dolt, definitely meets that threshold where organization matters. Dolt is the first SQL database with Git-style branches; it’s revision control for your data. And we’ve put a lot of work into it: the Dolt code repository contains 762k lines of Golang code (excluding generated files), broken up into 204 different packages. We also make go-mysql-server, the SQL engine used by Dolt which itself consists of 475k lines of Golang code in 59 different packages. The code is ten years old.

That’s way more code than any one person can keep in their head. Breaking it up into packages helps, but how did we determine what the packages should be, and what code goes in what package?

For any project big enough, you’re going to want packages. Not just because they speed up your compilation time and allow for partial recompiles, but because organizing code this way inherently leads to cleaner projects that are more easily understood. The act of breaking code up into components is called modularization, and it’s an important part of code architecture. When talking about modularization, we call these individual components modules.¹

A core principle of modern software design is the “single responsibility principle”, which states that each module only has one responsibility, and complex behavior is achieved by composing these modules. The intent is that if each module can be understood completely independently from the others, than modules can be developed in parallel with minimal risk of changes in one module breaking the behavior of other modules.

At a glance, there appear to be two main tradeoffs to modularization, although these “tradeoffs” are usually upfront costs that are outweighed by the benefit of making code maintenance significantly easier:

Modularizing your codebase requires forethought and slightly increases the overall complexity of the codebase. As a project grows, assumptions made when designing the structure of the code may prove to be incorrect, resulting in either modules with multiple responsibilities that are difficult to understand, or “leaky” abstraction modules that require knowledge of their inner workings in order to use correctly. When this happens the code may need to be restructured.
Modularizing your code introduces the threat of dependency cycles: if module A depends on code symbols from module B, and module B depends on module C, then C should not depend on A. And while dependency cycles are often simple enough to untangle in theory, understanding their causes in complex code bases can be a challenge, and fixing them may require tedious refactors. In many languages, including Golang, dependency cycles between modules won’t even compile.²

It’s very easy for someone who doesn’t understand the layout of a codebase to accidentally introduce dependency cycles and then have trouble removing them. And dependency cycles are especially frustrating for devs because they feel like a barrier to writing clean code. In the moment, it can feel like modularization is making development harder. But it’s important to remember that:

Without modularization, a developer that doesn’t understand the entire codebase might not be able to contribute at all, and
While the code that is creating the dependency cycle feels simple and clean, it’s actually introducing a new relationship between code components that will make them difficult to separate in the future.
Breaking a dependency cycle is often much simpler once the developer understands the responsibilities of the different components involved and their relationship to each other, and modularization makes that understanding a lot easier.

This is best demonstrated by an example. This is a real contribution I made to Dolt where:

The modularization of the codebase allowed me to develop features for one component without needing to fully understand the details of other related component.
I was temporarily stymied by a dependency cycle.
Identifying the best way to resolve the dependency cycle took time, but left me with a better understanding of how different components were connected, making it time well spent.
Armed with this better understanding, the solution became simple.

Case Study: Foreign Key Validation

We recently added support for a feature we called nonlocal tables: Essentially, a user can configure one branch such that certain table names actually resolve to a table on another branch. The core functionality was easier to implement than expected. Next we added the ability for branches to have foreign key constraints on these tables, which proved to be more challenging. We expected that foreign keys would have some odd behavior here, since changes on the referenced branch could cause these foreign key constraints to become violated. Since it’s already possible for version control commands to create similar situations, we already had a tool in place to handle this: for circumstances where it’s not possible to prevent violations, Dolt has a special tool to detect them after the fact: the DOLT_VERIFY_CONSTRAINTS system procedure and the associated dolt constraints verify CLI command.

This command reads the database storage layer and determines whether or not any foreign key constraints on your branch are being violated. The logic is much simpler than every other part of this feature, and it was straightforward enough that we didn’t give it much thought in design; once we had the ability to correctly resolve table names, all we had to do was allow the validation logic to depend on the name resolution logic. It should have been a one-line change.

And yet, figuring out how to properly expose the name resolution logic to the validator turned out to be the hard part. But why?

Well, let’s look at the package structure for Dolt. The logic for executing dolt constraints verify makes use the following packages, among others:

go-mysql-server/sql - A collection of primitive types and interfaces required for running a database. Many of the types we use are defined here. This package is a great example of a common abstraction that other modules can depend on without needing to depend on each other. But it means that anything in this module cannot depend on any of the modules that make use of it. This means that there are still lots of interfaces that can’t go in this package.
- This package contains a sql.Table interface, describing a table that a database engine can interact with.
dolt/go/libraries/doltcore/doltdb - This defines the core types that power Dolt’s data structures, and defines core operations on these types.³ The logic for validating foreign keys is defined in this package.
- This package contains a doltdb.Table type, representing a table in storage. This type alone does not have the necessary context to be used by an engine, so it cannot implement sql.Table.
dolt/go/libraries/doltcore/sqle/dsess - This contains the logic responsible for maintaining the current state of a database session, including transactions.
dolt/go/libraries/doltcore/sqle - This implements a SQL engine on top of the storage layer. Evalutating references to other branches happens here, because the result of the evaluation depends on the current transaction, otherwise you might get concurrency issues.
- This package contains a sqle.DoltTable type, which implements sql.Table. It can be constructed from a doltdb.Table.
dolt/cmd/dolt/commands/cvcmds - The implementation of the command line command for validating constraints, including foreign keys.

A graph of the relationship between these packages and their classes

Something else I was never taught in school: how to make proper UML diagrams.

These five packages form a clean chain of dependencies: each package depends on every package listed above it, and none of the packages below it. And it means that when developing any of these packages, as long as you don’t change the behavior of its exported functions, you can safely ignore all the packages below it.

So let’s look back at the thing that we thought would be a one-line fix:

“all we had to do was allow the validation logic to depend on the name resolution logic”

We can now see that there are three separate problems with this proposal:

We can’t modify the validation logic in doltdb to depend on the new table name resolution logic… because the table name resolution logic depends on branch reference resolution, which is implemented in sqle. This is a dependency cycle.
The command itself is implemented in the top level package cvcmds, and is thus allowed to depend on everything. The engine has public functions that can resolve branch names, but those functions have parameters that the command couldn’t provide, because some of that context is encapsulated by the sqle package.
Finally, while both of these packages have methods for interacting with tables, the types used to represent a table have a different shape, different responsibilities, and don’t implement a common interface.

Again, it may feel like modularization is getting in our way by preventing us from calling functions or accessing state that we need. But the package layout also makes it clear that even if we could simply glue together these two components together, doing so would expose their internal state to each other in a way that could be complicated to refactor later. It’s worth putting in the extra legwork now to avoid this, and in doing so might suggest ways to keep the code readable.

So given all that, what’s the cleanest way to solve these problems?

Could we break the cycle by cleaving off some part of the sqle package, and then having both the engine and storage layers depend on this new package? Probably not: the functionality we’re trying to isolate depends on the session management code in the dsess package: separating it out would prevent one cycle, but create another.
Perhaps instead, we provide a way to resolve branch references without needing access to the current transaction? Then we could put all the branch resolution code in the doltdb storage layer. This could probably be done, but it would be a major change and would need to be done very carefully. We’d have to duplicate some of the lookup logic in the engine and in storage, it would be tricky to get right, and the cost of getting it wrong could be subtle concurrency bugs: no database wants that.

Instead, the best way to avoid dependency cycles is to have both modules should depend on a common “abstraction” instead. Usually this means an interface type. Interfaces are a great tool when you have a simple problem statement and you already how to solve that problem, and you’re just trying to avoid introducing new dependencies.

We have a simple problem statement: resolve a table name to a table, using the new rules for referencing tables on other branches. And we already have code that solves that problem. But the logic that requires that code cannot depend on it. So dependency inversion says: depend on an abstraction. And we accomplish that in three easy steps.

Step 1: The low-level package creates an interface that describes the shape of the operation we need.

We define a new interface TableResolver in the doltdb package that describes the shape of the operation we need:

// TableResolver allows the user of a DoltDB to configure how table names are resolved on roots.
// This is useful because the user-backed system table dolt_nonlocal_tables allows table names to resolve to
// tables on other refs, but sqle.Database is necessary to resolve those refs.
type TableResolver interface {
	ResolveTable(ctx *sql.Context, root RootValue, tblName TableName) (table *Table, found bool, err error)
  	ResolveTableCaseInsensitive(ctx *sql.Context, root RootValue, tblName TableName) (trueTableName TableName, table *Table, found bool, err error)
}

Step 2: The higher-level package provides an implementation.

In the sqle package, we provide an implementation. This requires adding some new functions to the package:

A function that can return the underlying doltdb.Table type used by the storage layer instead of preemptively constructing the higher-level type used by the engine.
An exported function that returns a TableResolver value for use by other packages.

Step 3: The top level package does dependency-injection

With these changes in place, the top level cvcmds package can get a TableResolver from the engine and pass it as an additional parameter to the relevant storage layer calls. This allows the new name resolution rules to influence storage operations without creating any additional dependencies.

As presented, this seems like a simple and obvious solution. And it is a simple solution… but it’s only obvious when viewed in the context of the code’s organization. It’s obvious that this approach won’t create dependency cycles or expose internal state, but we need to understand the package boundaries to see why other, similar-looking approaches would.

The Takeaway

Implementing this feature helped me better understand the exact relationship between the many different packages Dolt uses when performing even simple database operations. It ensured that any changes I made to boundaries between packages were thoughtful and deliberate, and it helped me identify future opportunities to clean up some of these interfaces and make them more usable.

I’m not sure if good codebase architecture can be taught: maybe it can only be learned. And I definitely learned something about Dolt’s design, not just the how but the why. And that lesson is not only going to help me now as I develop Dolt, but also influence any codebase design I may do in the future.

I’m biased, but I think Dolt is a pretty well-designed piece of software. It’s not perfect and it’s had it’s growing pains, but it has a solid core that’s been fun to work with. Databases have a ton of complexity, and we’ve done a good job of managing that complexity such that we can continue to add cool new features.

If you have a feature you want to see in Dolt, [drop us a line on Discord] and we’ll scope it out. We take user requests seriously when deciding our priorities. If you’re looking for cool open-source projects to contribute to, we’re always welcoming contributors and are happy to help you get set up. We even have a good first issue tag on GitHub.

That’s all for now.

The term “module” also has a specific definition in Go: code is divided into modules, which are themselves made up of packages. Both Go modules and Go packages are examples of modularization. In this article, I’ll still use the term “module” when talking about modularization as a general concept, but note that all the Go examples are about Go packages, not Go modules. ↩
Some languages allow modules to reference each other in specific circumstances: For instance, C++ allows two source files to references each other’s header files, as long as the header files don’t reference each other. C++ also allows the use of “incomplete” types whose full definition is only provided during linkage. Both of these can be thought of as natural cases of dependency inversion: the header files and/or incomplete type declarations are the abstraction that both modules depend on. But Golang does not have equivalent concepts to these. ↩
This package is only responsible for the data as it exists in memory. On disk storage is defined in dolt/go/store and its subpackages, and serializing and deserializing this format on disk is the responsibility of dolt/go/libraries/doltcore/doltdb/durable. ↩

Blog

Dependency Management in Database Design

Case Study: Foreign Key Validation

Step 1: The low-level package creates an interface that describes the shape of the operation we need.

Step 2: The higher-level package provides an implementation.

Step 3: The top level package does dependency-injection

The Takeaway

Get started with Dolt

Case Study: Foreign Key Validation

Step 1: The low-level package creates an interface that describes the shape of the operation we need.

Step 2: The higher-level package provides an implementation.

Step 3: The top level package does dependency-injection

The Takeaway

Footnotes

Get started with Dolt