Dolt Implementation Notes — Push And Pull On a Merkle DAG

REFERENCE
6 min read

Dolt is a SQL database with Git-like functionality, including branch, merge and diff and push and pull to remotes. This is a post in a series of posts about the internal workings of some of the core algorithms that underly Dolt's implementation. The previous posts in this series were about:

  1. The Prolly-tree, a unique content-address indexed data structure that underlies Dolt's table storage.

  2. The Commit Graph and structural sharing in Dolt's table storage.

  3. Dolt's Diff Implementation

  4. Dolt's Merge Implementation

In this post, we explore the Merkle DAG structure underlying Dolt's commit graph and table storage a little more, and investigate how push and pull to remote repositories is implemented in Dolt.

Overview

A Dolt repository contains any number of branches and tags, where each branch or tag is a top-level reference to a particular commit. A commit, in turn, points to a value of the database at that commit, and 0 or more parent commits. All data in the repository, the commits and the table data, is stored in content-addressed Chunks which in turn can contain references to other Chunks, forming a Merkle DAG. This was the example of a three commit branch from our previous blog post on the commit graph:

Dolt Commit Graph

And this was the example of how the data in a single table might be broken down:

Dolt Table Value

Dolt supports remotes, which means it can clone, push and pull branch and tag references from a Dolt repository stored in DoltHub or in cloud storage like AWS S3 or Google Cloud Storage. This blog post briefly explores what Dolt remotes are and how they operate under the hood.

A Dolt Remote

A Dolt repository can have multiple remote repositories configured, and each of these repositories can be fetched from and pushed to separately. Each configured Dolt remote consists of three pieces of configuration within a Dolt repository:

  1. The name of the remote. After a clone, this will be origin.

  2. Configuration for a networked endpoint that implements a network protocol that Dolt can use as a remote. Most commonly, this is the GRPC API exported by https://doltremoteapi.dolthub.com and a DoltHub repository path like dolthub/image-net.

  3. Default fetch specs for the remote. After a clone, this will be refs/heads/*:refs/remotes/origin/*, and most users never need to interact directly with fetch specs.

The fetch spec is the subtlest piece of the configuration, but it's also fundamental to the way that remotes actually work. The above fetch spec says: When we fetch from origin, create a new ref in refs/remotes/origin/... for each ref we find in the remote at refs/heads/.... refs/heads/... will be all the branches in the remote, and so fetching from the remote will create corresponding refs in our local repository for each branch in the remote repository. If the remote had the branches main, bh/agi and aaron/by-zip, and we fetched from it, Dolt would create the refs refs/remotes/origin/main, refs/remotes/origin/bh/agi and refs/remotes/origin/aaron/by-zip, each pointing at the corresponding commits that the remote branches were pointing at when we ran the fetch.

So the remotes refs namespace is separate from our local branches namespace, and Dolt is keeping a copy of the branches that we fetch from the remote locally. The only time that copy is updated, and the only time we reference the remote generally, is when we run dolt fetch (or dolt pull, which does a fetch and then merge). And the fundamental operation involved in a fetch is:

  1. Contact the remote to list all the branches we will clone.

  2. For each branch we will clone, update our local repository to contain the referenced Commit Chunk and all Chunks reachable from it.

  3. Set a corresponding ref in our local repository to point to the newly fetched Commit Chunk.

Step #2 is where all the missing data actually gets copied into our local repository. Let's take a look at exactly how that happens.

Chunk Stores and DAG Traversals

As mentioned above, all the data in the repository, both Commits and the table data, is stored in these content-addressed variable sized blocks called Chunks. A storage abstraction exists in the Dolt storage layer called a ChunkStore, which is a place where we can read, and potentially write, chunks.

package chunk

type Address [20]byte

type Chunk interface {
    // All of the chunk addresses that are referenced by this Chunk.
    Refs() []Address
    // The contents of the Chunk.
    Bytes() []byte
}

type Store interface {
    Has(addr Address) bool
    Get(addr Address) Chunk
    Put(contents Chunk)
}

We can create a Store implementation for a remote repository which is hosted in DoltHub, and we have a Store implementation for our local repository as well. A simple recursive DAG walk to copy a given commit (or any Chunk with all of its children) into our local repository looks like:

func CopyChunks(to, from Store, a Address) {
	if !to.Has(a) {
		c := from.Get(a)
		for _, r := range c.Refs() {
			Fetch(to, from, r)
		}
		to.Put(c)
	}
}

This approach already has some nice properties. Ideally, if a Chunk is in a Store, all the Chunks it references would also be in the Store. We don't want our algorithm to persist any Chunks to the Store whose children we haven't already fetched and persisted, because if the algorithm gets aborted halfway through the Store could then be in an inconsistent state. So we're careful not to Put the fetched Chunk until its children are persisted.

Better Performance With Batching

In Go, the recursion is not much of a concern because goroutines have on-heap growable stacks. But if it is a concern, it's easy to translate the call stack state into an explicit stack with on-heap state as well.

The above algorithm has one glaring issue: it's very slow. If from is a remote Store, then every Get is a round-trip RPC, and as written there's no capacity for pipelining or batching. The simplest solution is to make the batching explicit. We can give Store batch methods, and make CopyChunks take a slice of Addresses instead:

type Store interface {
    // HasMany partitions the supplied addresses into ones that are
    // already present in the `Store` and ones that are missing.
    HasMany([]Address) (present, missing []Address)
    // GetMany fetches all addresses from the store and returns the
    // corresponding chunks in order.
    GetMany([]Address) []Chunk
    // PutMany persists the supplied chunks to the store.
    PutMany([]Chunk)
}

func CopyChunks(to, from Store, as []Address) {
	_, missing := to.HasMany(as)
	chunks := from.GetMany(missing)
	nextlevel := []Address{}
	for _, c := range chunks {
		nextlevel = append(nextlevel, c.Refs()...)
	}
	CopyChunks(to, from, nextlevel)
	to.PutMany(chunks)
}

That improves the round-trips for remote clones substantially and allows for better bandwidth utilization. But it introduces two new flaws.

  1. Memory usage is potentially unwieldy. In the batched version, we're holding a potentially large number of Chunks in memory at every call to CopyChunks, and we're making a call to GetMany with an unbounded number of Addresses. Previously we were only holding one Chunk in memory at each level of the call.

  2. It potentially fetches the same Chunks from from multiple times. Two different length paths through the DAG to the same Chunk will have to.HasMany() returning missing for the same Chunk multiple times, with consequent calls to GetMany() with the same addresses.

Addressing those actually gets somewhat complicated. In Dolt, for #1 we form explicit RPC batches and write the fetched Chunks to temporary chunk files so that they don't have to stay in memory. The temporary files are constructed so that they can be cheaply integrated into the to Store when all their dependencies are persisted. Addressing #2 involves adding a little bit of book keeping across recursive calls. But perfect behavior with regards to case #2 is actually a tradeoff between the batch sizes and memory usage. Once the chunk from a higher level leaves memory and goes into the temporary chunk file, it needs to be refetched from the remote or from the disk in order to be incorporated in the to Store earlier than its peers.

Fetch and Push

It's neat that the above algorithm works great whether operation is a push or a fetch. If the remote Store is from, we're doing a fetch and we will get new chunks from the remote into our local Store. But we can also make the remote Store to, in which case its a push—the remote Store gets the new Chunks that were unique to our local Store. In either case, once all the Chunks are persisted in the to Store, we can update any refs appropriately, setting them to point to the newly persisted commit Chunks based on the operation we're performing and the fetch/push specs as appropriate.

Conclusion

Dolt remotes are a powerful feature that allows for data synchronization, collaboration and easily maintaining local changes while tracking upstreams. Underlying the feature is a simple model for how to build up the commit graph and the table data as a merkle DAG of content addressed chunks. Building on top of that model allows for push and fetch to be both be implemented by the same elegant DAG walk. At the same time, practical engineering and performance concerns introduce an opportunity for tradeoffs and optimization. Hopefully we've given you some insight into the ways that Dolt approaches and solves the issues underlying its implementation.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.