Supporting Git remotes as Dolt remotes

22 min read

We recently made changes in Dolt to support pushing databases to a Git remote. This makes it possible to push your Dolt database to remotes like GitHub, for example, so your version-controlled SQL database can use the same repository as your source code.

We built this feature to better support the transition of Beads and Gastown from SQLite to Dolt, by ensuring existing users of those tools could continue using their Git remotes to sync their data to, without having to provision new credentials for a traditional Dolt remote, like DoltHub or DoltLab.

I wrote the announcement blog last week which covers how to get up and running using a Git remote with a Dolt database, so check that out if you haven’t seen it yet. In today’s post I’ll do a technical deep dive into how we actually pulled this feature off.

A bit about remotes#

Before supporting Git remotes as Dolt remotes, Dolt supported three categories of remotes: filesystem based remotes, cloud-based remotes, and our full-service remote products DoltHub and DoltLab.

Filesystem remotes allow a location on disk to act as a Dolt remote. This means that during a remote operation like dolt push, Dolt writes the database data to this location on disk. When a remote read operation is performed, like dolt fetch, remote data is read from this location on disk, back to the Dolt client.

Cloud-based remotes use cloud-based storage, typically an object or blob store, to persist Dolt remote database data. These function similarly to the filesystem-based remote, only the underlying storage is not a filesystem it’s an S3 bucket, for instance. These remotes also require users to supply cloud credentials and have the proper permissions to write and read from them. Currently, Dolt supports AWS, GCP, OCI, and Azure remotes.

DoltHub and DoltLab are our remote products. These are built atop the filesystem and cloud-based remote primitives, but add a suite of rich features on top of these implementations including a pull-request style workflow the way GitHub and GitLab do for Git repositories.

At a high-level, Dolt writes two types of files to a remote, an immutable tablefile which contains all database data for the history of the database, and the manifest, a mutable file that contains references to all relevant tablefiles, meaning those that contain all reachable database data. Some tablefiles may exist in remote storage which no longer contain reachable data. These are considered garbage and are safe to delete.

Remote data organization

The diagram above shows how Dolt organizes data on a remote. The cloud represents remote storage containing the files as they might appear in cloud storage. The manifest file is shown along with a set of hash-named tablefiles. The arrows pointing from the manifest to the right show the manifest’s contents. Here you can see it contains important information like its version, storage format, lock hash, root hash, and garbage collection generation, along with references to each tablefile and their corresponding chunk count. The arrow pointing from the last tablefile to the right shows its contents which reveals the seven chunks of binary data it contains. The Dolt database below the cloud represents a Dolt client that pushes and pulls to this remote.

Fundamentally, a Dolt remote implementation needs to be able to accept writes or uploads of tablefiles and a manifest, it also must be able to serve or support downloads of these files entirely and at times in pieces, and it must support a compare-and-swap (CAS) write of files, which is crucial for writing the manifest in particular. The remote also needs to support concatenation of tablefiles since occasionally Dolt will “conjoin” data from multiple tablefiles into a much larger new one.

The common abstraction used to implement a new remote that covers all of these requirements is the Blobstore abstraction in Dolt. It has the following methods:

type Blobstore interface {
	Exists(ctx context.Context, key string) (ok bool, err error)
	Get(ctx context.Context, key string, br BlobRange) (rc io.ReadCloser, size uint64, version string, err error)
	Put(ctx context.Context, key string, totalSize int64, reader io.Reader) (version string, err error)
	CheckAndPut(ctx context.Context, expectedVersion, key string, totalSize int64, reader io.Reader) (version string, err error)
	Concatenate(ctx context.Context, key string, sources []string) (version string, err error)
}

The methods listed above are pretty straightforward, they need to do exactly what their name implies. Exists returns if a given key in the remote exists, where a key is a manifest file or tablefile. Get returns the file contents as a whole, or in pieces. Put writes a file to the remote. CheckAndPut performs a CAS write to the remote, which is used to write the manifest specifically, and Concatenate combines multiple files into a single one.

To support Git-backed Dolt remotes we wanted to use this same abstraction, which we know to be correct, but do so in a way that works with Git.

Initially when brainstorming a design for what components would be necessary for supporting Git remotes, we came up with a local Git-checkout based design.

In this design, there would need to be an existing remote Git repository, like on GitHub or GitLab, but crucially, this design would use local checkout of this remote repository that the Dolt client would use to facilitate its remote actions against the Git remote.

For example, a dolt clone in this design would first run git clone against the Git remote, so that the remote repository was cloned locally and then Dolt would perform a dolt clone operation behind the scenes against the local checkout, as if it were a filesystem-based remote.

Similarly a dolt fetch would really be a git fetch and git checkout and git reset --hard, to make the local checkout state exactly match the Git remote state, then Dolt would perform a dolt fetch against the local checkout as if it were a filesystem-based remote.

And it does not get any simpler for dolt push. In this world, first Dolt would execute a dolt push to the local checkout, which would write to the working set and commit, then it would run a git add ., then squash all changes into a single git commit, then git push origin --force-with-lease.

If this git push succeeds, then the CAS was won, but if it did not, we’d need to reset the local checkout to match the state of the remote again by fetching and resetting from the Git remote, then rebase the Dolt changes against the new state of the local checkout, and then retry the push again.

Checkout design retry loop

It became pretty clear to us that a big drawback of this design was that it required complex handling of a local Git checkout, which we also believed would fail under concurrency. Under a concurrent workload, there is no way to guarantee that writes to the local Git checkout would land atomically, especially in the case of a failed dolt push, where we would need to reset and replay the writes.

It also wasn’t clear that this design fit the commonly used Blobstore abstraction, which meant we’d need a new abstraction for these remotes, one that would need to be independently verified as correct.

Fortunately, in the face of these challenges, we decided to consider a GitBlobstore implementation of the Blobstore interface more seriously, which led us to an investigation into how we could use Git’s internal commands to operate directly on its underlying objects in order to correctly satisfy the Blobstore interface.

Fundamentally, Git stores four types of objects and exposes primitives to work with them directly. The object types are blobs, trees, commits, and tags. We’ll also be discussing Git refs a bit later, but those are not Git objects, they’re just pointers to commit objects. For our purposes we only care about blobs, trees, and commits (and refs).

Here is way too much information about Git internals and its plumbing commands.

Git internals#

A Git blob is raw contents of a file and Git hashes these contents to create the object ID of the blob. These IDs are the same, if file contents are the same, even if the files have different names.

# Store some content as a blob
$ echo "hello dolt" | git hash-object -w --stdin
012edb49e0a30a498e81c4f55e3762a009e1e049

# Verify it's a blob
$ git cat-file -t 012edb49e0a30a498e81c4f55e3762a009e1e049
blob

# Read it back
$ git cat-file blob 012edb49e0a30a498e81c4f55e3762a009e1e049
hello dolt

git hash-object -w --stdin reads raw bytes from stdin, computes a SHA hash of the content, writes the object into Git’s object store (that’s the -w flag, without it Git would only print the hash without storing anything), and prints the resulting object ID. Git’s content-addressable storage guarantees that the same content always produces the same ID, no matter what you name the file.

The next command we run is git cat-file -t <oid> which prints the type of the object, which in this case is blob.

Lastly, git cat-file blob <oid> dumps the raw contents back out, where you can see “hello dolt” again. hash-object and cat-file are plumbing commands you can use to inspect any object in Git’s store.

Notice there’s no filename involved at this stage. hash-object just stores raw bytes and returns an object ID. The filename-to-blob mapping is the job of a Git tree.

A Git tree is a directory listing. Each entry in a Git tree maps a filename to the object ID of either a blob or another tree (which is a subdirectory). Like blobs, trees are also content-addressed, meaning two directories with identical contents produce the same tree ID.

# Create a temporary index (so we don't touch the real one)
$ export GIT_INDEX_FILE=/tmp/dolt-index

# Start with an empty index
$ git read-tree --empty

# Add our blob to the index under the filename "hello_dolt.txt"
$ git update-index --add --cacheinfo 100644 \
    012edb49e0a30a498e81c4f55e3762a009e1e049 hello_dolt.txt 

# Write the index out as a tree object
$ git write-tree
ae461249f7f4b72aa653eae41d05ddaf0dd3c6cf

# Inspect the tree: it maps "hello_dolt.txt" to our blob
$ git ls-tree ae461249f7f4b72aa653eae41d05ddaf0dd3c6cf
100644 blob 012edb49e0a30a498e81c4f55e3762a009e1e049    hello_dolt.txt

Git doesn’t let you create tree objects directly so we have to build them through an index file.

The index file in Git is a mutable, flat list of entries where each entry in the file is a filepath, mode, blob object ID triplet. Interestingly, the index file is where your working set changes are staged when you run git add. Once you want to commit the changes in your working set, running git commit calls git write-tree under-the-hood which produces an immutable tree object. Git then creates a commit pointing to that tree object.

GIT_INDEX_FILE in the snippet above, points Git at a temporary file to use as its index, then we run git read-tree --empty to initialize it. Next, git update-index --add --cacheinfo registers our blob from earlier under the filename “hello_dolt.txt” with the standard file mode 100644. At this point in time, the index file would contain the following:

$ git ls-files --stage
100644 012edb49e0a30a498e81c4f55e3762a009e1e049 0    hello_dolt.txt

After this, git write-tree takes the index file and writes it to the object store as a Git tree, returning the tree’s object ID. Finally, git ls-tree lets us inspect the tree to confirm it contains the entry we expect.

If we had added a second blob to the index before calling write-tree, the resulting tree would have two entries, like a directory containing two files. The second object could have also been another tree, or a sub-tree. Sub-trees work the same way as other tree entries do. This would just be an entry that itself points to another Git tree. This is how nested directories are represented.

The next type of object is a Git commit, which is a snapshot of a Git tree, the root tree, which may contain subtrees. In addition to the root tree, the Git commit points to previous commits in the Git history (its parents). The first commit in a repository has no parents, and commits also contain metadata like author, timestamp, and message.

# Create a commit with no parent (first commit)
$ git commit-tree ae461249f7f4b72aa653eae41d05ddaf0dd3c6cf \
    -m "add hello_dolt.txt"
8f3e42b1a47d9c05e6aa0be912b7085ff1c3d6e0

In the snippet above we executed git commit-tree which takes the tree object ID and a message and produces a commit object. Since this is the first commit, it has no parent commits. If we later updated hello_dolt.txt, built a new tree, and wanted to commit again, we’d pass the -p argument to link it to the previous commit like so:

$ git commit-tree <new-tree-oid> \
    -p 8f3e42b1a47d9c05e6aa0be912b7085ff1c3d6e0 \
    -m "update hello_dolt.txt"

These objects and commands form the full chain: blob → tree → commit. Each commit captures the complete state of the tree at that point, and the parent link forms the tree history.

Let’s quickly talk about Git refs, next, because they’re also an important element we’d need for a Blobstore implementation.

A Git ref is a named pointer to a commit, which means it’s just a file on disk that contains a commit’s object ID. Branches like main are also refs, where refs/heads/main, or main for short, points to the latest commit on that branch. Tags are refs too. As it happens, you can also create custom refs under any path you want, and this is actually a feature of Git we explicitly took advantage of in our GitBlobstore implementation.

# Point a ref at our commit
$ git update-ref refs/heads/main \
    8f3e42b1a47d9c05e6aa0be912b7085ff1c3d6e0

# Read it back
$ git rev-parse refs/heads/main
8f3e42b1a47d9c05e6aa0be912b7085ff1c3d6e0

In the snippet above, git update-ref creates or moves a ref to point at the provided commit. Importantly, this command also supports an atomic CAS mode where you can essentially say “update this ref to new commit X, but only if it currently points at old commit Y.” You would use this mode if someone else changed the ref out from under you, so you fail the write atomically and can retry after that. This command and its atomic mode were crucial for implementing safe concurrent writes, as you’ll see later on.

Git Refs are also what is pushed and pulled to a remote, when you run git push or git fetch, for example. When these commands are run, they transfer Git objects and update the corresponding refs on the remote.

Like the update-ref command, these remote commands support an atomic write mode. You can use Git’s --force-with-lease flag to guarantee that your writes to the remote either completely succeed or completely fail, like so:

# Fetch the current state of the remote ref
$ git fetch --no-tags origin +refs/heads/main:refs/remotes/origin/main

# Push, but only if the remote ref still matches what we fetched
$ git push --force-with-lease=refs/heads/main:<expected-oid> \
    origin refs/heads/main:refs/heads/main

If someone else pushed changes to our remote ref between our fetch and push, --force-with-lease rejects our push attempt and we would have to retry. This mode gives us the CAS semantics we used in the Blobstore.

Hopefully this breakdown gives you an idea for how we leveraged Git’s internals to implement a GitBlobstore, a Blobstore whose methods use these Git commands to implement the correct behavior we need for a remote.

One implication of using these Git primitives that may not be immediately clear is that it freed us from having to manage a local Git checkout the way we originally conceived to do.

Because we are working directly with Git objects, blobs, trees, commits, and Git refs, we don’t actually need a local checkout at all. Instead we can use a local Git bare repository.

A bare Git repository is a Git repository with no working tree, which means it’s essentially just the contents of the .git directory, which is where Git stores all objects, refs, and configuration.

In a normal Git repository there is the .git directory and the working tree, which is the actual files on disk that you edit. Here, when you run git checkout, Git reads blobs out of its object store and writes them as real files into the working tree. This is great for human editing code, but for our purposes it’s unnecessary overhead that would be too difficult to work with. The Blobstore never needs files on disk, instead, it only needs to read and write Git objects directly.

If you’re curious you can create a bare Git repo by running git init --bare:

$ git init --bare /tmp/my-cache.git
Initialized empty Git repository in /tmp/my-cache.git/

$ ls /tmp/my-cache.git
HEAD  config  description  hooks  info  objects  refs

As you can see in the above output, there is no working tree, no checked-out files, just the Git internals. This is what we decided to use for our GitBlobstore.

Putting it all together#

With these Git primitives we can map every Blobstore method to a Git plumbing implementation.

To start, here are the key fields in a GitBlobstore instance:

type GitBlobstore struct {
    gitDir string
    maxPartSize uint64
    remoteName string
    remoteRef string
    remoteTrackingRef string
    localRef string
    cacheHead git.OID
    cacheObjects map[string]cachedGitObject
    cacheChildren map[string][]git.TreeEntry
}

The gitDir is the path to the bare Git repository we discussed above. We decided to write these to Dolt’s data directory so each Dolt database will have its own local bare repository, and clones on the same host will not clobber each other’s repo.

maxPartSize controls how large a single Git blob can be. Many Git hosting providers enforce per-object size limit which our implementation needs to be aware of and handle, since Dolt tablefiles can often be quite large. As I mentioned earlier, Dolt also periodically conjoins multiple tablefiles into a single, larger one, so we need to not do this in the GitBlobstore.

Our plan for the implementation is to automatically chunk large Git blobs into multiple, smaller blobs, stored under the key as a directory. For example, let’s say we encountered a blob (tablefile) that was larger than maxPartSize, in our implementation, instead of writing it as a blob, we want to write it as a sub-tree (e.g. tablefile/0001, tablefile/0002, …) where its children are blobs of the chunked tablefile. We would also reassemble these transparently on read.

The remoteName is a named Git remote configured inside that bare repo, pointing at the user’s actual Git remote URL, this is usually “origin”.

Below remoteName are three refs which is where things get really interesting.

Recall that it’s possible to create custom refs in Git. Unlike branches, custom refs are not automatically fetched, pulled, or cloned. They need to be retrieved explicitly. For this reason, we decided that it was best for Dolt’s remote data to live on a custom ref so that it never interferes with the normal usage and flow of the Git repository itself. Dolt could just manage the custom ref behind-the-scenes and users won’t even need to worry about it.

With this in mind, we added the remoteRef field to the GitBlobstore. This is the single ref on the Git remote that will store all of Dolt’s data, essentially turning the Git remote ref into object storage. Our implementation uses the custom ref path, refs/dolt/data, so we don’t collide with normal branches or tags, and the ref points to a Git commit whose tree contains the Dolt tablefiles and manifest as Git blobs.

Next, the remoteTrackingRef is a local copy of the remote ref’s current state, updated every time we fetch. It tells us “what did the remote look like the last time we checked?”. Tracking the state of the remote ref is crucial for our GitBlobstore since this is the value we pass with the --force-with-lease flag when pushing the Git remote, in order to detect if there were concurrent changes on the remote while this instance of the GitBlobstore was building a commit.

Then, localRef is a scratchpad where our GitBlobstore will write its changes that need to be pushed to the remote. Importantly, this ref needs to be independent of both the remote ref and unique to a Blobstore instance (so concurrent writers don’t clobber the ref if they’re working on the same Git bare repository), and of the remote tracking ref, which must always reflect the remote ref state.

When writing to the localRef, the GitBlobstore builds a new commit (hashes the blobs, updates the index, writes a tree, commits the tree) and then points localRef at this new commit. Then, it tries to push localRef to remoteRef with a lease check against remoteTrackingRef. This architecture ensures that if the push succeeds, it has won the CAS and everything is hunky-dory. If the push fails, someone else pushed first and won the CAS, and this instance must now fetch the new remote state, rebuild localRef on top of it, and retry the push.

Both remoteTrackingRef and localRef include a UUID so that multiple concurrent GitBlobstore instances sharing the same bare repo don’t step on each other’s refs.

The last section of fields are related to the GitBlobstore’s internal cache, which serves two very important purposes.

The first purpose is performance improvement. Every Blobstore read operation needs to know what’s in the commit’s tree object, which means knowing what tablefiles and manifest exist, what their object IDs are. Without a cache, we’d have to run git ls-tree against the full tree on every Get and Exists call.

The second purpose is correctness under concurrency. When a GitBlobstore instance loses a CAS race (another writer pushed to the remote first), it has to fetch the new remote state and retry. At this point the remote’s tree may contain tablefiles that weren’t there before, and without a cache, the instance that failed the CAS would need to re-read the entire tree from scratch before it could attempt a CAS retry. But in this case, because tablefiles are immutable and thus once cached are guaranteed to be correct and never stale, we can simply merge the new entries into the existing cache, without invalidating and rebuilding the entire thing. The only exception to this is the manifest, which is the only mutable file. The manifest does need to be overwritten in the cache when it has changed.

To this effect, cacheHead tracks which commit object ID the cache was last built from. When we fetch a new remote state, we compare the fetched commit to cacheHead and if they match, there’s nothing to do. If they don’t match, we list the new commit’s tree and merge its entries into the cache.

cacheObjects is a map from tree path (e.g. manifest, abc123def456...) to the object’s ID and type. This lets us answer Exists and Get calls by looking up the path directly in the cache without shelling out to Git.

Finally, cacheChildren maps a directory path to its immediate child entries. This is needed for reading chunked tablefiles. We look up the children here to know how many parts there are and read them in order.

With these fields in mind, here’s how each GitBlobstore method uses these Git concepts and internals to actually make Git remotes work as Dolt remotes.

Exists#

Exists checks whether a key (a tablefile or manifest) is present on the remote. Here’s the simplified logic:

func (gbs *GitBlobstore) Exists(ctx context.Context, key string) (bool, error) {
    // Sync with the remote
    err := gbs.syncForRead(ctx)
    if err != nil {
        return false, err
    }

    // Check the cache
    _, ok := gbs.cache.get(key)
    return ok, nil
}

syncForRead does the heavy lifting here.

func (gbs *GitBlobstore) syncForRead() {
    // Fetch the remote ref into our remote-tracking ref
    git fetch --no-tags gbs.remoteName +gbs.remoteRef:gbs.remoteTrackingRef

    // Resolve the remote-tracking ref to a commit OID
    remoteHead = git rev-parse gbs.remoteTrackingRef

    // Set our local ref to match (remote is source of truth for reads)
    git update-ref gbs.localRef remoteHead

    // Merge the commit's tree entries into the cache
    if remoteHead != gbs.cacheHead {
        entries = git ls-tree -r remoteHead
        for each entry {
            cache.mergeEntry(entry)   // add if new, skip if exists
            if entry is manifest {
                cache.overwrite(entry) // manifest is always refreshed
            }
        }
        gbs.cacheHead = remoteHead
    }
}

syncForRead fetches the latest state from the remote, then merges the tree entries into the cache. After this, Exists is just a local map lookup. If the remote ref doesn’t exist yet the fetch returns “ref not found” and we treat the store as empty, so Exists returns false for everything.

Get#

Get retrieves the contents of a key from the remote, supporting both full reads and byte-range reads. Here’s the simplified logic:

func (gbs *GitBlobstore) Get(ctx context.Context, key string, br BlobRange) (io.ReadCloser, uint64, string, error) {
    // Sync with the remote (same as Exists)
    err := gbs.syncForRead(ctx)
    if err != nil {
        return nil, 0, "", err
    }

    // Look up the key in the cache
    obj, ok := gbs.cache.get(key)
    if !ok {
        return nil, 0, "", NotFound{key}
    }

    // Read the data based on object type
    switch obj.typ {
    case blob:
        // Key is a single blob, read it directly
        size = git cat-file -s obj.oid
        reader = git cat-file blob obj.oid
        return sliceToRange(reader, size, br), size, obj.oid, nil
    case tree:
        // Key is a chunked sub-tree, read parts in order
        parts = gbs.cache.children(key)  // ["0001", "0002", ...]
        reader = concatenatePartReaders(parts, br)
        return reader, totalSize, obj.oid, nil
    }
}

Like Exists, Get starts with syncForRead to make sure the cache is up to date. Then, it looks up the key in the cache. If found, the behavior depends on whether the key is a blob or a tree.

If it’s a blob, the data was small enough to fit in a single Git object based on maxPartSize, so we use git cat-file -s to get the size and git cat-file blob to stream the contents. If the caller requested a byte range (e.g. “give me bytes 1000–2000”), we slice the stream accordingly.

If it’s a tree, the data was chunked during a previous Put. We look up the children from cacheChildren, open a reader for each part in order, and concatenate them into a single stream, slicing to the requested byte range if needed.

The version string returned from Get is the object ID, either the blob object ID or the tree object ID. This is used later by CheckAndPut to detect concurrent changes.

Put#

Put writes a key to the remote. This is the most involved method because it has to hash the data, build a new commit, and push it atomically. Here’s the simplified logic:

func (gbs *GitBlobstore) Put(ctx context.Context, key string, totalSize int64, reader io.Reader) (string, error) {
    // Sync to check if the key already exists
    err := gbs.syncForRead(ctx)
    if err != nil {
        return "", err
    }

    // If the key already exists (tablefiles are immutable), skip the write
    if key != "manifest" && gbs.cache.has(key) {
        return existingVersion, nil
    }

    // Hash the data into blob(s) up front (so retries don't re-read the reader)
    if totalSize <= gbs.maxPartSize {
        blobOID = git hash-object -w --stdin < reader
        plan = {writes: [{path: key, oid: blobOID}]}
    } else {
        // Chunk into parts
        for each chunk of maxPartSize bytes from reader {
            partOID = git hash-object -w --stdin < chunk
        }
        plan = {writes: [{path: key/0001, oid: partOID1}, {path: key/0002, oid: partOID2}, ...]}
    }

    // Push with CAS retry loop
    gbs.remoteManagedWrite(ctx, key, plan)
}

The exists check is an important optimization in this method. Since tablefiles are content-addressed and immutable, there’s no point writing one that already exists on the remote. This avoids unnecessary writes during dolt push where many tablefiles may already be present.

The hashing phase writes all the data into Git blob objects in the local bare repo before attempting the push. This is important because the io.Reader can only be read once, but if the push fails due to a CAS race we may need to retry. By hashing up front, the blob OIDs are stored in the local object store and can be reused across retries.

Last is the remoteManagedWrite loop, which is the core of the write path:

func (gbs *GitBlobstore) remoteManagedWrite(ctx, key, plan) {
    retry with exponential backoff {
        // Fetch latest remote state
        remoteHead, ok = gbs.fetchAlignAndMergeForWrite(ctx)

        // Build a new commit on top of the remote head
        if ok {
            // Load the remote head's tree into a temporary index
            git read-tree remoteHead -> tempIndex
        } else {
            // First write ever, start from an empty index
            git read-tree --empty -> tempIndex
        }

        // Add our blob(s) to the index
        for each write in plan {
            git update-index --add --cacheinfo 100644 write.oid write.path
        }

        // Write tree and commit
        treeOID = git write-tree (from tempIndex)
        if ok {
            commitOID = git commit-tree treeOID -p remoteHead -m "put <key>"
        } else {
            commitOID = git commit-tree treeOID -m "put <key>"  // no parent for first commit
        }

        // Point our local ref at the new commit
        git update-ref gbs.localRef commitOID

        // Push with lease (this is the CAS)
        git push --force-with-lease=gbs.remoteRef:remoteHead \
            gbs.remoteName gbs.localRef:gbs.remoteRef

        // If push fails (someone else pushed first), loop back to fetch
        // If push succeeds, merge new commit into cache and return
    }
}

The retry loop is the heart of the concurrency model. Each iteration fetches the latest remote state, builds a new commit on top of it, and attempts a push with --force-with-lease. The lease check ensures the remote hasn’t changed since we fetched. If it has, the push is rejected, we fetch again, rebuild the commit on top of the new state, and retry. This is exactly the compare-and-swap pattern the Blobstore interface requires.

CheckAndPut#

CheckAndPut is how the manifest gets written. It’s like Put, but with an additional precondition where the write only succeeds if the key’s current version matches an expected version. This is the application-level CAS that sits on top of the Git-level --force-with-lease CAS.

func (gbs *GitBlobstore) CheckAndPut(ctx context.Context, expectedVersion, key string, totalSize int64, reader io.Reader) (string, error) {
    var cachedPlan *putPlan

    // CAS retry loop (same remoteManagedWrite as Put)
    retry with exponential backoff {
        // Fetch latest remote state
        remoteHead, ok = gbs.fetchAlignAndMergeForWrite(ctx)

        // Check the precondition: does the key's current version match?
        actualVersion = ""
        if ok {
            obj, found := gbs.cache.get(key)
            if found {
                actualVersion = obj.oid
            }
        }

        if expectedVersion != actualVersion {
            return CheckAndPutError  // version mismatch, stop, don't retry
        }

        // Hash the data (only on first attempt, reuse on retries)
        if cachedPlan == nil {
            cachedPlan = gbs.planPutWrites(ctx, key, totalSize, reader)
        }

        // Build a new commit on top of the remote head
        //    (same as Put: read-tree, update-index, write-tree, commit-tree)
        commitOID = gbs.buildCommitForKeyWrite(ctx, remoteHead, ok, key, cachedPlan)

        // Point our local ref at the new commit
        git update-ref gbs.localRef commitOID

        // Push with lease (this is the Git-level CAS)
        git push --force-with-lease=gbs.remoteRef:remoteHead \
            gbs.remoteName gbs.localRef:gbs.remoteRef

        // If push fails, loop back to fetch and re-check version
        // If push succeeds, merge new commit into cache and return
    }
}

The key difference from Put is the version check. Before building a commit, CheckAndPut looks up the key’s current object ID in the cache and compares it to expectedVersion. If they don’t match, it returns a CheckAndPutError immediately. When this happens, no push is attempted, and no retry happens. This is how Dolt ensures that manifest updates are serialized. A writer first reads the manifest, computes a new one, and writes it back with CheckAndPut using the version it originally read. If another writer updated the manifest in between, the versions won’t match and the write fails.

Concatenate#

Concatenate combines multiple existing keys (source blobs) into a single new key. Dolt uses this to “conjoin” small tablefiles into a larger one.

func (gbs *GitBlobstore) Concatenate(ctx context.Context, key string, sources []string) (string, error) {
    // Sync with the remote
    err := gbs.syncForRead(ctx)
    if err != nil {
        return "", err
    }

    // If the destination key already exists, skip (same as Put)
    if key != "manifest" && gbs.cache.has(key) {
        return existingVersion, nil
    }

    // Read each source blob and concatenate into a single stream
    totalSize = sum of sizes of all sources
    reader = concatenatedReader(sources)  // streams source blobs in order

    // Hash the concatenated data into blob(s)
    plan = gbs.planPutWrites(ctx, key, totalSize, reader)

    // Push with CAS retry loop (same as Put)
    return gbs.remoteManagedWrite(ctx, key, plan)
}

Concatenate is straightforward once you understand Put. It reads the source blobs from the local object store (they were already fetched during syncForRead), concatenates them into a single stream, and then follows the exact same write path as Put.

Conclusion#

And there you have it, a GitBlobstore implementation that turns any Git remote into a Dolt remote using nothing but Git plumbing commands and a local bare repository!

Though it was the major piece of engineering that made Git remotes as Dolt remotes possible, the GitBlobstore doesn’t work totally alone. It actually plugs into Dolt’s Noms Block Store (NBS) layer, which is the same storage engine that powers all other Dolt remotes. NBS handles the higher-level concerns like reading and writing the manifest, coordinating tablefile uploads, and triggering conjoins at the right time, while the Blobstore underneath handles the actual storage.

By implementing the Blobstore interface, the GitBlobstore slots in seamlessly to NBS, so it doesn’t know and doesn’t have to care that it’s talking to a Git remote instead of S3 or a local filesystem.

We hope you enjoyed this dive into our latest feature. If you’re excited about Git remotes, or want to talk about Dolt and how you can start building with it, come by our Discord, we’d love to meet you.

Blog