Shallow Clone Support

February 21, 2024

10 min read

Dolt is the first version controlled SQL database, and the natural consequence of that is it needs to store every value of every row that has ever existed in your database. Git is similar, and it turns out that storing every revision of a large project like the Linux kernel is actually no problem given delta compression and storage being very cheap. Nonetheless, Git has features to help users who don't want to clone the entire history to their local copy. Until recently, Dolt users did not have that ability, but now they do!

TL;DR:

If you just want to try it out, add the --depth flag to the arguments of your clone, and pull just the history you want!

$ dolt clone --depth 5 dolthub/us-businesses

or in SQL:

call dolt_clone('--depth', 5, 'dolthub/us-businesses');

Git Capabilities

Let's review the Git capabilities before we jump into what you can do with Dolt.

Git has two primary types of clones for reducing your storage footprint:

Shallow Clone. This allows the user to specify the number of commits they want to pull down. There are also options to the fetch command which allow you to extend the amount of history on a given branch, or shorten the history of what you have so that more data gets cleaned up on the next garbage collection round. Shallow Clone is fully related to commits, and the source content of each commit is pulled in its entirety.
Partial Clone. This allows the user to specify a filter-spec, which will filter objects and skip pulling them down during clone. filter-specs can be used to limit blob sizes, tree depth, among other things. The most useful filter spec, --filter=sparse:path=<path>, was removed in 2017 for security reasons because the path needs to be evaluated on the remote server. To replace it, the --filter=sparse:oid=<blob>, option was added, which requires that you know the object id of a blob that contains the paths to clone. In many ways partial clone is really only useful today if you want to avoid pulling specific artifacts, such as large test data files and so forth. Organizations which have such needs will probably document or script around this poor user experience.

What's the verdict on these features? Mixed. GitHub has an excellent write up about the pros and cons of shallow and partial clone, and there are so many caveats it's hard to know if you should use the features or not.

At the heart of the issue is the fact that source content generally grows with time. Tracking, storing, and transferring the history ends up being less impactful to the total size of the repository than you might think because the latest revision has more content than previous revisions. An additional consideration is the cost of calculating an incremental diff and the network back and forth to construct a minimal downloadable object. It is often times cheaper to just pull a few pack files (Git's compressed storage format) than do all the computational work required to save yourself some disk space.

Finally, if you decide to use a shallow or partial Git clone (and you figure out the right arguments!), there are several limitations which are pretty much all in the form of poor error messages. If you attempt git show using a commit ID which you didn't clone, the error will simply be "Commit not Found", which is kind of confusing because you got the commit from somewhere (GitHub, a build log, or whatever). It would be nice to at least get an error which says the commit id is real but your clone just doesn't contain it. All in all, Git's shallow and partial clone features are fairly niche, and have a lot of rough edges.

Dolt's Approach

So... why would Dolt consider following Git's lead given this lackluster review of these features? Answer: Databases aren't source code. Databases are generally a lot more data. Depending on your application, it may be the case that you have many updates on each row of data and the total size of the database at HEAD only grows marginally with time. This would also be the case if inserts and deletes are very common in your application. Performing a shallow clone makes a lot of sense if you have years of history which have no relation to the current state of HEAD. Especially if you have no structural sharing in your data, you could end up with much less data in your local clone than the full database.

We opted to start with shallow clone because it is simple to understand for users... and it's way easier to implement! The data model of Dolt is very amenable to this approach because each commit has a reference to a root object which contains everything for your database at the time of commit. This includes tables, indexes, and really everything you would need to perform a non-Dolt SQL operation. From the perspective of someone familiar with how databases traditionally work, the root is your entire database. Given that, it's fairly easy for us to implement Shallow Clone because the primary feature users need - querying their data - is all possible if we have the root available to us.

Partial clones do not have the same convenient data model benefits that Shallow Clones do. We do see benefit in a user being able to pull just the table they are interested in looking at, but what happens when that table has foreign keys in another table? Another example would be pulling a subset of the tables and then adding a row which would fail your primary key check had all data been available at the time of commit. There are many more examples, but suffice it to say, we have more work to do in order to make partial Dolt clones a reality.

Which brings us back to Shallow Clone Support - hey, that's the title of this post! The feature is entirely summed up with one flag: --depth n. That argument can be used on the Dolt CLI:

$ dolt clone --depth n <database url>

Or if you are running in a SQL context:

call dolt_clone("--depth", n, "<database url>");

Using this clone, you can perform SQL operations on the data, especially any non-Dolt related SQL statements will run as expected. If you attempt to perform a Dolt operation which required historical information you don't have, you will get an error:

$ dolt diff HEAD~10..HEAD
Commit not found. You are using a shallow clone which does not contain the requested commit. Please do a full clone.

Finally, if you modify your Shallow Clone and commit your changes, you can push them back to the origin. This is performed just like any other push.

$ dolt commit -a -m "My wonderful shallow commit"
$ dolt push

How it Works

Shallow Clone is based of the fetch workflow. fetch works by having the client interrogate the remote to determine what is missing in the client then attempting to download just enough data to fill the gaps. The simplified workflow:

The Client asks the Remote what commit is on the Remote branch.
The Client determines it doesn't have that commit, so it requests the "Chunk" which contains the commit (a Chunk is a binary portion of a file in our storage system. Chunks contain binary serialized objects, Commits, Tables, etc, and have references to other objects)
The Client will inspect the Chunk which will likely have references to other objects which reside in other chunks. Those objects will include the ancestor commits for the commit in question as well as the root value mentioned above.
The Client tabulates all Chunks it needs in step 3 but doesn't have, so it can request more chunks (going back to step 2).
When the Client has all the chunks it needs, it finishes up by updating remote branch references.

An important thing to call out is that the Objects stored are all opaque to our storage system, and they all have 20 byte addresses, which are what our commit ids are. Objects which refer to other objects through references all use these 20 byte addresses, and as fetch is pulling Chunks it doesn't know a lot about what it's pulling. Making sense of the references and their types happens one level above the storage system.

This is how fetch has worked for a long time, and it continues to work that way now. The piece we've added is if a Reference ID discovered during the incremental search for fetch is in a specific set of Reference IDs, then we don't pull it down. Specifically in step 3 above, the ancestor commits will be short circuited and not pulled down.

How do we know what commits to not pull though? Dolt has a specific data structure on every commit which is called the Commit Closure. The closure is the full set of all commits which are reachable from the commit. We have this data structure in order to speed up merges because it makes finding the merge-base much faster.

With all this information in hand, the Shallow Clone process is effectively a breadth first traversal:

The Client asks the Remote what commit is on the Remote branch.
Build a skip list using the Commit Closure of the commit.
Determine the parent ids of the current level's commits, removing them from the skip list
Decrease --depth value by 1.
If --depth is greater than one, and go back to step 3.
If --depth == 0, write the items in the skip list to disk so we know in the future what commits we skipped.
Request all Chunks the client doesn't have, and ensure that the storage system skips all commits listed in the skip list
Finish by creating a branch pointing to the first commit requested.

In contrast, a full clone is primarily impacted by your internet connection's bandwidth. This is because the protocol for a full clone is far less chatty - it's effectively a large file download. There are a few requests for information, but very little back and forth. Fetch is much more chatty, and therefore Shallow Clone is as well, and as a result your internet connections latency becomes more impactful. This is not to say your bandwidth doesn't matter - but latency goes from not factoring into clone performance at all to being important.

Results

I'm not ashamed to say it: The results are mixed. Similar to the GitHub post I mentioned above, there are a lot of caveats. Shallow Clone's effectiveness in saving you time and disk space will depend a lot on the database you are attempting to clone and the internet connection you are using. I took a non-scientific assortment of databases on DoltHub, and cloned them in both modes using an EC2 instance in us-west-2 where DoltHub resides, and on a 50Mbps residential connection to compare. The EC2 instance is about the lowest latency/highest bandwidth connection possible, and is representative of an application running in the cloud. Shallow Clone data points were gathered using --depth 1.

Database	Full Clone Size (Gb)	Shallow Clone Size (Gb)	Full Clone EC2 Time (min:sec)	Shallow Clone EC2 Time (min:sec)	Full Clone 50Mbps Time (min:sec)	Shallow Clone 50Mbps Time (min:sec)
US Businesses	21.7	1.8	7:24	3:07	67:22	16:40
Hospital Price Transparency	9.0	1.8	2:33	4:47	27:14	25:40
Earnings	0.7	0.5	0:17	0:37	2:15	4:43

Take away from this is that Shallow Clone helps most if you are very limited on the disk space in the destination.

The database in this set which benefits most from the use of Shallow Clone is the US Businesses database, which was a bounties database from a couple years ago. You can see that the size of the Shallow Clone is more than 10x smaller (21.7 Gb reduced to 1.8Gb), and it takes 1/4 the time to download on a 50Mbps connection (67 min to 16). As an example of where Shallow Clone is barely helpful is the Earnings database, which only saved ~20% of disk space, and takes twice as long to clone. The reason for the difference comes down to the type of history in each database. In the first, the data is the result of a people working on and massaging the data. In the latter, the earnings are strictly additive over time.

In the interesting middle ground is the Hospital Price Transparency database. The size does reduce significantly, but if time is the concern, results are surprising. Performing a full clone in EC2 where bandwidth is very high and latency is low, the shallow clone takes almost twice as long. Apparently it's faster to just get all of the data. On a slower connection though, the timing benefit is slight, so the real benefit is saving 7Gb of local disk space.

Try this out on your database, and let us know how it works for you!

What's Next?

The bigger picture is this: Archiving history is where we are heading. The ability to put the old stuff you don't care about into cold storage (or delete it entirely) is on our roadmap. We want to offer the ability to our users to keep a configurable amount of history. In order to do that, we needed to break a critical assumption that all Dolt databases have all data into perpetuity. The 100+ files changed to make this work were primarily about breaking this assumption which has been cooked into Dolt from day one. Breaking this assumption is the first step toward a future where your database can have a rolling window of history.

Surely given the last paragraph, we are building History Archiving next, right? Actually, no. First we are going to focus some time on reducing the total size of the database by using Chunk Delta Compression. This is the primary way Git keeps its footprint small, and we have a lot of potential to improve Dolt's footprint this way as well. Delta compression could not only reduce our storage footprint, it could also allow us to significantly streamline the clone and fetch protocols. That may come next, or not. We'll see how much milage we get from each approach, but the point is that Dolt is taking storage footprint seriously now. Cheap storage only gets us so far, and it's been fun while it lasted!

Dolt considers Shallow Clones and Archiving to be critical features, to an extent that Git never will because data is not source code. While Git's shallow clone leaves a lot to be desired and Dolt's MVP Shallow Clone is about as good, Dolt is committed to making it a core feature. If you'd like to ask us more about this feature, or any feature for that matter, please join us on Discord. We're the nicest group of Database and Version Control geeks on the internet!

Blog