Working with AWS Remotes

13 min read

Almost two years ago, my colleague Brian blogged about how to use Dolt with different types of supported remotes. The most accessible type of Dolt remote is a database on DoltHub, but, as outlined in that previous blog, there are a number of other ways to host remote Dolt databases.

Since that blog was written, a number of users have adopted workflows that use AWS remotes directly. I wanted to take the opportunity to dive a bit deeper into:

  1. How to create and use AWS remotes.

  2. How to use IAM permissions to allow read and write access for various IAM principals to the remotes.

  3. The various operations and use cases AWS remotes support.

This will be a tutorial style blog post, where we show how to create the AWS resources you need to create AWS remotes, and then how to use them.

Getting Started

To get started you will need an AWS account and the AWS CLI installed. The AWS CLI should be configured with AWS credentials which are allowed to: create and delete S3 buckets, create and delete DyanmoDB tables, create and delete IAM policies and roles. Used in isolation, the interactions in this tutorial will stay within the AWS Free Tier.

We can start by verifying that everything is setup appropriately in your environment and capturing a shell parameter we will need in the future:

$ aws sts get-caller-identity --output text --query Account
355726134989
$ AWS_ACCOUNT_ID=`aws sts get-caller-identity --output text --query Account`

In this tutorial I will be working against us-west-2, but you can replace the line below with whatever region you want to use:

$ export AWS_REGION=us-west-2

In order to create and use a Dolt remote in AWS, you need two things:

  1. An S3 bucket, where Dolt table data will be stored.

  2. A DynamoDB table, where Dolt stores the top level metadata about the remote, including the root hash of the entire database and which S3 files are part of the database.

Let's go ahead and create those things. We will add some randomness onto our names since AWS S3 bucket names must be globally unique.

$ S3_BUCKET_NAME="dolt-aws-remote-example-"`openssl rand -hex 8`
$ aws s3api create-bucket --bucket $S3_BUCKET_NAME --create-bucket-configuration 'LocationConstraint='"$AWS_REGION"
{
    "Location": "http://dolt-aws-remote-example-7b7de821f78bb148.s3.amazonaws.com/"
}
$ DYNAMODB_TABLE_NAME="$S3_BUCKET_NAME"
$ aws dynamodb create-table \
    --table-name $DYNAMODB_TABLE_NAME \
    --attribute-definitions 'AttributeName=db,AttributeType=S' \
    --key-schema 'AttributeName=db,KeyType=HASH' \
    --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1
{
    "TableDescription": {
        "AttributeDefinitions": [
            {
                "AttributeName": "db",
                "AttributeType": "S"
            }
        ],
        "TableName": "dolt-aws-remote-example-7b7de821f78bb148",
        "KeySchema": [
            {
                "AttributeName": "db",
                "KeyType": "HASH"
            }
        ],
        "TableStatus": "CREATING",
        "CreationDateTime": "2023-04-25T14:08:16.708000-07:00",
        "ProvisionedThroughput": {
            "NumberOfDecreasesToday": 0,
            "ReadCapacityUnits": 1,
            "WriteCapacityUnits": 1
        },
        "TableSizeBytes": 0,
        "ItemCount": 0,
        "TableArn": "arn:aws:dynamodb:us-west-2:355726134989:table/dolt-aws-remote-example-7b7de821f78bb148",
        "TableId": "f94680dd-cbd5-435a-b662-50133e42332b"
    }
}

Here I have created the DynamoDB table with the same name as the S3 bucket, but this is not required. It could have any name you wish. You can see below that they are specified separately in the remote URL given to Dolt.

Our First Remote

Having created those two cloud resources, we can now create any number of Dolt remotes which use those resources for storage. Let's create a primes database and push it to a new remote backed by that bucket and DynamoDB table.

$ mkdir primes
$ cd primes
$ dolt init
Successfully initialized dolt data repository.
$ dolt sql -q 'create table primes (i int primary key, val int);'
$ dolt sql -q 'insert into primes values
(1, 2),
(2, 3),
(3, 5),
(4, 7),
(5, 11),
(6, 13),
(7, 17),
(8, 19),
(9, 23),
(10, 29),
(11, 31),
(12, 37),
(13, 41),
(14, 43),
(15, 47),
(16, 53),
(17, 59),
(18, 61),
(19, 67),
(20, 71),
(21, 73),
(22, 79),
(23, 83),
(24, 89),
(25, 97);'
Query OK, 25 rows affected (0.01 sec)
$ dolt commit -Am 'insert first 25 primes'
commit jvfj0hnej9toup6pntjmdho8rcbievtm (HEAD -> main) 
Author: Aaron Son <aaron@dolthub.com>
Date:  Tue Apr 25 14:14:57 -0700 2023

        insert first 25 primes

$ dolt remote add origin 'aws://['"$DYNAMODB_TABLE_NAME":"$S3_BUCKET_NAME"']/primes'
$ dolt push origin main:main
Uploaded 1.6 kB of 1.6 kB @ 8.68 kB/s.

We can inspect the cloud resources which were created as part of the push:

$ aws s3 ls s3://"$S3_BUCKET_NAME"/
                           PRE primes/
$ aws s3 ls s3://"$S3_BUCKET_NAME"/primes/
2023-04-25 14:15:41       1634 4f4i07s5oj4et4pu85dlkokg0r4bhnet
2023-04-25 14:15:41        321 plh9lge7etttc2qjh8stu9sia5shvvja
$ aws dynamodb scan --table-name "$DYNAMODB_TABLE_NAME" 
{
    "Items": [
        {
            "root": {
                "B": "bllmmmU4pFuyEZv1KjtTXGnO0Zg="
            },
            "lck": {
                "B": "Gb1LM9zoWn/hsfviUJ31vhTLdb0="
            },
            "vers": {
                "S": "__DOLT__"
            },
            "nbsVers": {
                "S": "4"
            },
            "specs": {
                "S": "4f4i07s5oj4et4pu85dlkokg0r4bhnet:7:plh9lge7etttc2qjh8stu9sia5shvvja:2"
            },
            "db": {
                "S": "primes"
            }
        }
    ],
    "Count": 1,
    "ScannedCount": 1,
    "ConsumedCapacity": null
}

You can see that the entry in DynamoDB associated with this remote has db="primes as its key, and all the table files associated with this remote have a common key prefix in S3 of primes/. We will come back to this later when we talk about attenuated permissions against the cloud resources.

Cloning, Pushing, and Pulling

For now, let's see what it looks like to use this remote for cloning, pushing and pulling. We will clone the database into another directory:

$ cd ../
$ dolt clone 'aws://['"$DYNAMODB_TABLE_NAME":"$S3_BUCKET_NAME"']/primes' primes-copy
cloning aws://[dolt-aws-remote-example-7b7de821f78bb148:dolt-aws-remote-example-7b7de821f78bb148]/primes

and do our own writes against it:

$ cd primes-copy
$ dolt sql -q 'INSERT INTO primes VALUES
(26, 101),
(27, 103),
(28, 107),
(29, 109),
(30, 113),
(31, 127),
(32, 131),
(33, 137),
(34, 139),
(35, 149),
(36, 151);'
Query OK, 11 rows affected (0.01 sec)
$ dolt commit -am 'insert primes 26-36'
commit q487jfvt2cq8sfla9eoa7l2epu3cv7n2 (HEAD -> main) 
Author: Aaron Son <aaron@dolthub.com>
Date:  Tue Apr 25 14:24:25 -0700 2023

        insert primes 26-36

$ dolt push origin main:main
Uploaded 1.2 kB of 1.2 kB @ 7.74 kB/s.

And we can pull them from the original directory:

$ cd ../primes
$ dolt sql -q 'select count(*) from primes'
+----------+
| count(*) |
+----------+
| 25       |
+----------+

$ dolt pull origin main
Downloaded 4 chunks, 1.0 kB @ 2.81 kB/s.
Updating jvfj0hnej9toup6pntjmdho8rcbievtm..q487jfvt2cq8sfla9eoa7l2epu3cv7n2
Fast-forward
$ dolt sql -q 'select count(*) from primes'
+----------+
| count(*) |
+----------+
| 36       |
+----------+

Multiple Remotes

We can put any path on the end of remote URL we use. The path will be used to prefix the table files stored in S3 and will become the db= key for the entry stored in the DyanmoDB table. For example, if we want to mirror a DoltHub repository to our AWS resources, we can do something like:

$ cd ../
$ dolt clone dolthub/standard-charge-files
cloning https://doltremoteapi.dolthub.com/dolthub/standard-charge-files
$ cd standard-charge-files
$ dolt remote add my_aws_mirror 'aws://['"$DYNAMODB_TABLE_NAME":"$S3_BUCKET_NAME"']/dolthub_mirrors/dolthub/standard-charge-files'
$ dolt push my_aws_mirror main:main
Uploaded 66 MB of 66 MB @ 40.25 B/s.
$ cd ..

The table files and DynamoDB entries for aws://[...]/dolthub_mirrors/dolthub/standard-charge-files are completely distinct from aws://[...]/primes. Because they use the same S3 bucket and DynamoDB table, they do share things like quota, billing, bucket permissions, and provisioned DynamoDB capacity, etc.

Attenuated Permissions

Because the data entries associated with different remotes are distinct, and because S3 and DynamoDB both support IAM policy condition keys for fine-grained access control, you can create IAM policies which only allow access to specific remotes. By restricting the available operations to reads, you can also create policies which only allow read-only access to a remote. Let's create two roles in our account, one of which will allow read-only access to any Dolt remote, and one of which will allow Read-Write access only to the primes remote.

$ aws iam create-role --role-name DoltRemoteReader --assume-role-policy-document '{"Version":"2012-10-17","Statement":{"Effect":"Allow","Principal":{"AWS":"arn:aws:iam::'$AWS_ACCOUNT_ID':root"},"Action":"sts:AssumeRole"}}'
{
    "Role": {
        "Path": "/",
        "RoleName": "DoltRemoteReader",
        "RoleId": "AROAS5I5E2G3ZEHUTE3VB",
        "Arn": "arn:aws:iam::355726134989:role/DoltRemoteReader",
        "CreateDate": "2023-04-25T21:52:51+00:00",
        "AssumeRolePolicyDocument": {
            "Version": "2012-10-17",
            "Statement": {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::355726134989:root"
                },
                "Action": "sts:AssumeRole"
            }
        }
    }
}
$ aws iam put-role-policy --role-name DoltRemoteReader --policy-name AllowReadingDoltRemotes --policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
        "Effect": "Allow",
        "Action": [
          "s3:GetObject",
          "s3:ListBucket"
        ],
        "Resource": [
          "arn:aws:s3:::'$S3_BUCKET_NAME'",
          "arn:aws:s3:::'$S3_BUCKET_NAME'/*"
        ]
    },
    {
        "Effect": "Allow",
        "Action": [
          "dynamodb:DescribeTable",
          "dynamodb:GetItem"
        ],
        "Resource": [
          "arn:aws:dynamodb:'$AWS_REGION':'$AWS_ACCOUNT_ID':table/'$DYNAMODB_TABLE_NAME'"
        ]
    }
  ]
}'


$ aws iam create-role --role-name DoltRemotePrimesEditor --assume-role-policy-document '{"Version":"2012-10-17","Statement":{"Effect":"Allow","Principal":{"AWS":"arn:aws:iam::'$AWS_ACCOUNT_ID':root"},"Action":"sts:AssumeRole"}}'
$ aws iam put-role-policy --role-name DoltRemotePrimesEditor --policy-name AllowEditingPrimesRepo --policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
        "Effect": "Allow",
        "Action": [
          "s3:AbortMultipartUpload",
          "s3:PutObject",
          "s3:GetObject",
          "s3:ListBucket"
        ],
        "Resource": [
          "arn:aws:s3:::'$S3_BUCKET_NAME'",
          "arn:aws:s3:::'$S3_BUCKET_NAME'/primes/*"
        ]
    },
    {   
        "Effect": "Allow",
        "Action": [
          "dynamodb:DescribeTable",
          "dynamodb:GetItem",
          "dynamodb:PutItem",
          "dynamodb:UpdateItem"
        ],
        "Resource": [
          "arn:aws:dynamodb:'$AWS_REGION':'$AWS_ACCOUNT_ID':table/'$DYNAMODB_TABLE_NAME'"
        ],
        "Condition": {
          "ForAllValues:StringEquals": {
            "dynamodb:LeadingKeys": [
                "primes"
            ]
          }
        }
    }
  ]
}'

So that we can see these in action, we need to configure our AWS profiles in our AWS config file.

$ aws configure set profile.dolt_tutorial_read_only.region $AWS_REGION
$ aws configure set profile.dolt_tutorial_read_only.role_arn arn:aws:iam::"$AWS_ACCOUNT_ID":role/DoltRemoteReader
$ aws configure set profile.dolt_tutorial_read_only.source_profile ${AWS_PROFILE:=default}
$ aws configure set profile.dolt_tutorial_primes_editor.region $AWS_REGION
$ aws configure set profile.dolt_tutorial_primes_editor.role_arn arn:aws:iam::"$AWS_ACCOUNT_ID":role/DoltRemotePrimesEditor
$ aws configure set profile.dolt_tutorial_primes_editor.source_profile ${AWS_PROFILE:=default}

We can try them out from the CLI by calling dolt with the AWS_PROFILE environment variable set. For example, the dolt_tutorial_primes_editor role will not be able to clone the dolthub_mirrors/dolthub/standard-charge-files repository.

$ AWS_PROFILE=dolt_tutorial_primes_editor dolt clone 'aws://['"$DYNAMODB_TABLE_NAME":"$S3_BUCKET_NAME"']/dolthub_mirrors/dolthub/standard-charge-files' scf_should_fail
cloning aws://[dolt-aws-remote-example-7b7de821f78bb148:dolt-aws-remote-example-7b7de821f78bb148]/dolthub_mirrors/dolthub/standard-charge-files
error: failed to get remote db
cause: failed to get dynamo table: 'dolt-aws-remote-example-7b7de821f78bb148' - AccessDeniedException: User: arn:aws:sts::355726134989:assumed-role/DoltRemotePrimesEditor/1682462467004463000 is not authorized to perform: dynamodb:GetItem on resource: arn:aws:dynamodb:us-west-2:355726134989:table/dolt-aws-remote-example-7b7de821f78bb148 because no identity-based policy allows the dynamodb:GetItem action
       	status code: 400, request id: QBSN9MNKA7BLTNKT5TA4MJK0B7VV4KQNSO5AEMVJF66Q9ASUAAJG

But it can be used to clone and push primes:

$ AWS_PROFILE=dolt_tutorial_primes_editor dolt clone 'aws://['"$DYNAMODB_TABLE_NAME":"$S3_BUCKET_NAME"']/primes' primes_with_profile
$ cd primes_with_profile
$ AWS_PROFILE=dolt_tutorial_primes_editor dolt push origin main:new_branch
$ cd ..

The read only policy, on the other hand, can clone and pull everything, but cannot push anything:

$ AWS_PROFILE=dolt_tutorial_read_only dolt clone 'aws://['"$DYNAMODB_TABLE_NAME":"$S3_BUCKET_NAME"']/dolthub_mirrors/dolthub/standard-charge-files' scf_read_only
cloning aws://[dolt-aws-remote-example-7b7de821f78bb148:dolt-aws-remote-example-7b7de821f78bb148]/dolthub_mirrors/dolthub/standard-charge-files
$ cd scf_read_only
$ AWS_PROFILE=dolt_tutorial_read_only dolt fetch
$ AWS_PROFILE=dolt_tutorial_read_only dolt push origin main:fail_to_create_branch
error: push failed
cause: unknown push error; AccessDenied: Access Denied
       	status code: 403, request id: SXNXACTN79N3AP6W, host id: UyPrHzrhLNfDWExnELy1YTcGos/3pmR2KrE91qOA1p1kT6BLxIisEcLIk7IAxYYY9fm80htyqNs=
$ cd ..

Using similar policies for things like EC2 instance profiles or IAM roles for EKS service accounts, you can easily attenuate access for different pieces of your infrastructure to different Dolt databases. Similarly, you could use IAM policy substitution variables within these policies to allow credentials associated with web identity users to create their own Dolt remotes, for example.

Backing Up to AWS

Dolt backups are snapshots of the entire database, including all working sets (changes made to tables which have not been added to a dolt commit yet), all remote refs, etc. Backups in Dolt are implemented by pushing the entire database state to a remote. We can use our AWS resources to store backups. For example, let's make a local change to our primes database. We will then back it up and restore it. The restored copy will have the local changes which were never committed to a dolt branch:

$ cd primes
$ dolt sql -q 'create table uncommitted_table (i int primary key)'
$ dolt status
On branch main
Untracked files:
  (use "dolt add <table>" to include in what will be committed)
	new table:      uncommitted_table
$ dolt backup sync-url 'aws://['"$DYNAMODB_TABLE_NAME":"$S3_BUCKET_NAME"']/backups/primes/first_backup'
Uploaded 3.9 kB of 3.9 kB @ 20.85 kB/s.

$ cd ..
$ dolt backup restore 'aws://['"$DYNAMODB_TABLE_NAME":"$S3_BUCKET_NAME"']/backups/primes/first_backup' primes_restored
$ cd primes_restored
$ dolt status
On branch main
Untracked files:
  (use "dolt add <table>" to include in what will be committed)
	new table:      uncommitted_table
$ cd ..

Using AWS Remotes For SQL Replication

One form of Dolt replication supports read replication by pushing new writes to a Dolt remote on write. The read replicas are configured to pull new changes from the Dolt remote. Once the AWS resources exist, this is easy to accomplish. We are going to use our /primes remote directly as the replication remote, just for simplicity. First we start the writer.

$ dolt clone 'aws://['"$DYNAMODB_TABLE_NAME":"$S3_BUCKET_NAME"']/primes' primes_writer
$ cd primes_writer
$ dolt sql -q 'set @@persist.dolt_replicate_to_remote = origin'
$ dolt sql -q 'set @@persist.dolt_transaction_commit = 1'
$ dolt sql-server &
$ cd ..

And then we start the read replica:

$ dolt clone 'aws://['"$DYNAMODB_TABLE_NAME":"$S3_BUCKET_NAME"']/primes' primes_reader
$ cd primes_reader
$ dolt sql -q 'set @@persist.dolt_replicate_all_heads = 1'
$ dolt sql -q 'set @@persist.dolt_read_replica_remote = origin'
$ dolt sql-server --port 3308 &
$ cd ..

We can try it out using the mysql CLI:

$ mysql -u root
mysql> use primes_writer;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

mysql> insert into primes values (37, 157);
Query OK, 1 row affected (1.26 sec)

mysql> ^D
Bye

And on the read replica we should get the new value:

$ mysql -u root -h 127.0.0.1 -P 3308
mysql> use primes_reader;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> select * from primes where i = 37;
+------+------+
| i    | val  |
+------+------+
|   37 |  157 |
+------+------+
1 row in set (0.06 sec)

mysql> ^D
Bye

And be sure to shutdown the server processes:

$ kill %1 %2
[1]  + done       dolt sql-server
[2]  + done       dolt sql-server --port 3308

Cleaning Up

That's the end of our examples for this blog. It's good hygiene to clean up all the resources we created here and it will prevent us from being billed for any resources left behind. To do so, run the following commands:

$ aws iam delete-role-policy --role-name DoltRemotePrimesEditor --policy-name AllowEditingPrimesRepo
$ aws iam delete-role-policy --role-name DoltRemoteReader --policy-name AllowReadingDoltRemotes
$ aws iam delete-role --role-name DoltRemotePrimesEditor
$ aws iam delete-role --role-name DoltRemoteReader
$ aws s3 rm --recursive s3://$S3_BUCKET_NAME
$ aws s3api delete-bucket --bucket $S3_BUCKET_NAME
$ aws dynamodb delete-table --table-name $DYNAMODB_TABLE_NAME

If you want to, it would be great to remote the [profile dolt_tutorial_read_only] and [profile dolt_tutorial_primes_editor] entries in your ~/.aws/config file at this point as well. You can do that with:

$ sed \
    -e '/^\[profile dolt_tutorial_primes_editor\]$/,/^\[/ d' \
    -e '/^\[profile dolt_tutorial_read_only\]$/,/^\[/ d' \
    -i.bak \
    ~/.aws/config

or you may choose to just edit the file.

Conclusion

We went through a cookbook-style tutorial on how to use AWS remotes with Dolt. After provisioning the necessary cloud resources, we were able to use our S3 bucket and DynamoDB table to support multiple distinct Dolt remote URLs. We saw how to clone, push, and pull from a remote; how to backup to one and restore from it; and how to use one as a rendezvous point in remotes-based SQL replication. By using fine-grained resource and condition specificiers in AWS IAM policies, we saw how to create IAM policies which allow read-only or read-write access to only some of the of the repositories which are stored in the S3 bucket and DynamoDB table.

If you have any questions about AWS remotes or Dolt usage in general, don't hesitate to stop by our discord to discuss it with us.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.