Comparing Dolt Backups with Remotes

October 8, 2021

5 min read

Whether due to clumsiness, physical damage, or malicious actors, the worst thing that can happen to your data is irretrievable loss. But this is a feature release, not a postmortem, and we are excited to announce Dolt backups!

The CLI now includes dolt backup add to create, dolt backups sync to save, and dolt backups restore to recover the contents of your database.

Why backup a Dolt database?#

Dolt is a versioned database with several layers of fault tolerance, just like Git. You can reverse a drop table command with dolt reset --hard HEAD, undoing changes and restore the previous state. Individual clones can even sudo rm -rf a repo safely, as long as the most recent data was saved to a remote with dolt push origin main. dolt clone <remote-url> safely restores your previous state. With all these recovery features, why would you also need Dolt backups?

Clones aren’t copies! There are important differences between rsyncing your .dolt folder and pushing master. We will dig into when you would use each in this blog, and how to use the new backups feature in Dolt to complement remotes.

chopper backups

Remotes are not backups#

Remotes and backups are similar at first glance. They both copy data incrementally between remote address spaces. The storage and transmit format is the same. And you can push every branch, head, and tag to a remote. Remotes are familiar, convenient, guard against most varieties data loss, and designed for flexibility. Most users will be comfortable sticking to remotes.

But remotes aren’t always enough. Regulatory and governance requirements can compel the use of backups. Remotes can’t and shouldn’t replicate uncommitted data. Backups are also useful as checkpoints before rewriting the history of your database with rebases or migrations.

In each of these cases, a brute force snapshot of the whole database is more useful than a customizable push. A backup can either be a single snapshot or rolling, but is always private to a single writer. A backup therefore copies state from a .dolt repo otherwise hidden from remotes, including staging and working sets.

In summary, backups complement remotes when we want a heightened level of protection against faults and data loss.

Backups Tutorial#

Create A Database Snapshot with the CLI#

We will show how to use the new dolt backup command in this tutorial.

The only install needed to start is the dolt binary:

sudo bash -c 'curl -L https://github.com/dolthub/dolt/releases/latest/download/install.sh | sudo bash'

We will focus on two directories to start. One for backups, and an initial dolt repo:

mkdir -p repo1 backups/backup1
$ cd repo1
$ dolt init
Successfully initialized dolt data repository.

Adding a backup looks similar to adding a remote:

$ dolt backup add backup1 file://../backups/backup1
/ Tree Level: 1, Percent Buffered: 0.00% Files Written: 0, Files Uploaded: 1

And syncing a backup looks similar to pushing a remote:

$ dolt backup sync backup1

In this simplified example, where we only created a single main branch, restoring the database will look similar to a clone:

Cd ..
$ dolt backup restore file://./backups/backup1 repo2
$ dolt branch -a
* main
$ dolt status

But under the hood, backups and remotes do different things. A reference or ref in Dolt and Git is a commit hash, branch, or tag. A client interacting with a remote can only push one ref per command. In addition to copying every pushable ref, backups also copy remote tracking refs and working set. Remote tracking refs are usually privately namespaced within a databases, and working sets are copies of rows that transactions collect before committing.

We will create one of each and start a new backup cycle to highlight this behavior:

Cd ../repo1
$ dolt branch feature
$ mkdir ../rem1
$ dolt remote add origin file://../rem1
$ dolt push origin main
$ dolt tag v1 HEAD
$ dolt sql -q "create table not_committed (a int primary key)"
$ dolt backup sync backup1

If we restore the database again, we see all of our new changes:

$ cd ..
$ dolt backup restore file://backups/backup1 repo3
$ cd repo3
$ noms ds .dolt/noms
  feature
* main
  remotes/origin/main
$ dolt status
On branch main
Untracked files:
  (use "dolt add <table|doc>" to include in what will be committed)
	new table:      not_committed

We can get almost the same thing with remotes. But backups copy everything and save uncommitted data. Hopefully this example makes the comparison concrete, and provide a little inspiration for your own apps!

Daily Backups#

In this second tutorial, we will make a systemd script that synchronizes our database on a timer. Different operating systems have different cron managers, and we will use a linux setup with systemctl timers here.

Our systemd script requires three files:

our run_backup.sh script a “unit file” that executes the backup
script within the systemd interface (backup.service)
and a timer that executes our unit periodically (backup.timer)

First, we will write a script that creates a database backup in the same manner as the previous tutorial:

#!/usr/bin/bash

BACKUP_DIR=/home/test/backups
DB_DIR=/home/test/repo1

cd $DB_DIR
backup_id=$(date '+%s')
mkdir -p ${BACKUP_DIR}/${backup_id}
dolt backup add ${backup_id} file:///${BACKUP_DIR}/${backup_id}
dolt backup sync ${backup_id}

I saved this file to /home/test/run_backup.sh and hardcoded my local test folder here. You would want to edit these accordingly if following along at home.

Next, our “unit file” written to /usr/lib/systemd/system/backup.service references our backup script:

[Unit]
Description=Runs db backup

[Service]
Type=oneshot
ExecStart=/home/test/run_backup.sh

[Install]
WantedBy=multi-user.target

And finally a timer backup.timer coupled to backup.service periodically executes the script:

[Unit]
Description=Db backup timer

[Timer]
OnCalendar=*-*-* *:*:0/5
AccuracySec=1s

[Install]
WantedBy=timers.target

We enable and start the timer to kickoff backups, which should be configured to run every five seconds:

$ systemctl enable my_backup_cmd.service.timer
$ systemctl restart my_backup_cmd.service.timer

After waiting a bit, we can view our growing list of backups:

$ tree backups
backups
├── 1633451635
│   ├── LOCK
│   ├── abmbvta6lclqj7dgrvon4kkgs4lf8ol3
│   ├── ajgrseim4flkk7bprt1jvec5dgpga6ag
│   ├── manifest
│   └── oldgen
├── 1633451640
│   ├── LOCK
│   ├── abmbvta6lclqj7dgrvon4kkgs4lf8ol3
│   ├── ajgrseim4flkk7bprt1jvec5dgpga6ag
│   ├── manifest
│   └── oldgen
├── backup1
│   ├── LOCK
│   ├── abmbvta6lclqj7dgrvon4kkgs4lf8ol3
│   ├── ajgrseim4flkk7bprt1jvec5dgpga6ag
│   ├── manifest
│   └── oldgen

What’s Next?#

Guarantees for single writers, encryption, and access control will make backups more secure. Users can currently implement these manually, for example, by adding additional steps to our systemd script. We will include more of these features at the Dolt layer in the future.

Extending DoltHub to automatically provision backups alongside managed servers is another useful feature we are developing. The option to custom provision remote endpoints will always exist, but we think a convenient hosted option is also useful.

Unlike MySQL, Dolt backups do not double as a format for read replication. We are currently developing other features to provide read replicas and automatic failover for Dolt SQL servers.

Conclusion#

You can now backup your Dolt database separate from shared remotes. Backups and remotes are similar, but backups add an extra layer of fault tolerance and facilitate easy database restores.

A quick summary of the technical differences:

Backups capture the entire internal state of your database, whereas remotes synchronize specific branches or tags.
Backups are private snapshots, while remotes expose internal state for sharing.

We summarized examples of when you might want the flexibility of remotes, and where need the fault tolerance of backups. We also walked through two tutorials using the new dolt backup CLI commands. The first creates a static backup manually, and the second configures a background process that automatically updates our database on a timer.

If you are interested in learning more about Dolt, backups, or relational databases reach out to us on Discord!

Blog

PRODUCTS

KEYWORDS