Announcing Write Queries for DoltHub's SQL API

WEBFEATURE RELEASE
7 min read

DoltHub is a place on the internet to share, discover, and collaborate on Dolt databases. It's been our goal to add more features to make it easy to add and edit data on DoltHub. As a part of this roadmap, we're excited to announce our previously read-only SQL API is now writable!

How it works

DoltHub's SQL API exposes data via an HTTP endpoint at www.dolthub.com/api/v1alpha1. Our existing read-only url has the form:

https://www.dolthub.com/api/v1alpha1/<owner_name>/<repo_name>/<branch_name>?q=<sql_query>

Since adding, updating, and deleting data can take a bit of time to execute against larger databases, we made the writable API asynchronous. Therefore we needed to add two new HTTP endpoints for executing these kinds of queries using the SQL API.

The first endpoint starts the write operation with the database information and query. The url has the form:

https://www.dolthub.com/api/v1alpha1/<owner_name>/<repo_name>/write/<from_branch>/<to_branch>?q=<sql_query>

There are a few important things to note here:

  1. A nonexistent <to_branch> creates a new branch from the <from_branch>
  2. An empty <sql_query> merges <from_branch> into <to_branch>

This allows the user to branch, change, review, and merge data using this endpoint, similar to what the workflow would be using Dolt or DoltHub.

The first endpoint returns an operation name in the response, which can be used to poll the second endpoint until the operation has completed. This operation url has the form:

https://www.dolthub.com/api/v1alpha1/<owner_name>/<repo_name>/write?operationName=<operation_name>

Once done === true the response will include a to_commit_id and from_commit_id, which can be used to query the dolt_commit_diff_$tablename table for reviewing changes.

An example

Our SHAQ basketball bounty finished a few days ago. Tim wrote a blog about some work needed to clean it up, which mostly includes deduping players added twice with different player_ids. This could be a great use case for the writable SQL API.

Authentication

Making the SQL API writable required putting in place some kind of authentication system. Anyone can access our read-only endpoint for public databases, but making changes to a database requires write permissions. We created API tokens, which can be used in the header of an API request to authenticate. You can learn more about this process in our docs here.

Now that we have our API token, I can fork dolthub/SHAQ to my own namespace (tbantle/SHAQ) and we can move forward updating players in our SHAQ fork.

What we want to change

We can use Dolt, a query on DoltHub, or the read-only SQL API endpoint to see which players have more than one entry.

SHAQ $ dolt sql -q "select concat(lower(first_name), ' ', lower(last_name)) as name, count(*) from players group by name having count(name) > 1 order by count(*) desc limit 10"
+-----------------+----------+
| name            | count(*) |
+-----------------+----------+
| brandon brown   | 37       |
| brandon johnson | 28       |
| jordan jones    | 27       |
| marko popovic   | 27       |
| jordan johnson  | 25       |
| chris williams  | 23       |
| chris johnson   | 23       |
| jordan davis    | 23       |
| justin robinson | 22       |
| jamar smith     | 22       |
+-----------------+----------+

I scanned the list and found two entries for a college player named Dimitri Saliba in the players table, one with player_id of 714287 and the other 15404617. In the player_season_stat_totals table there are four entries for these IDs, two for his years at Western Wyoming (NJCAA DI) and two for his years at Colorado Christian Cougars (NCAA DII).

After a quick Google to make sure the same person played at both these schools, we want to remove one of the players entries for Dimitri Saliba and update the player_id in the player_season_stat_totals table. We're going to keep player_id 714287 since it has more information filled out.

Using the writable SQL API

The process of using the SQL API to make changes requires a few steps.

1. Run query on new branch

First, we want to hit the write endpoint (using the Python requests library) with a from branch (main), to branch (update-dimitri, which does not exist yet), and an UPDATE query. I'm going to put this logic in a function because we'll use it again later.

import requests

owner, repo, from_branch, to_branch = 'tbantle', 'SHAQ', 'main', 'update-dimitri'
query = '''UPDATE player_season_stat_totals SET player_id=714287 WHERE player_id=15404617'''
headers = {
    'authorization': 'token [token you copied from Settings > Tokens (starts with dhat.v1.)]'
}

def run_write_query(from_branch_name, to_branch_name, q):
  res = requests.post(
          f'https://www.dolthub.com/api/v1alpha1/{owner}/{repo}/write/{from_branch_name}/{to_branch_name}',
          params={'q': query},
          headers=headers,
      )
  return res.json()


update_res = run_write_query(from_branch, to_branch, query)
print(update_res)

The JSON response includes an operation name, which can be used to poll the operation endpoint.

{
  "query_execution_status": "Success",
  "query_execution_message": "",
  "repository_owner": "tbantle",
  "repository_name": "SHAQ",
  "to_branch_name": "update-dimitri",
  "from_branch_name": "main",
  "query": "UPDATE player_season_stat_totals SET player_id=714287 WHERE player_id=15404617",
  "operation_name": "operations/72abb56b-d478-43ae-9a2d-c9602184c7ab"
}

2. Poll operation endpoint

Now, we need a function that makes a get request to the operation endpoint, and calls this function every few seconds until the done field in the response is true.

import time

def get_operation(op_name):
    poll_res = requests.get(
        f'https://www.dolthub.com/api/v1alpha1/{owner}/{repo}/write',
        params={'operationName': op_name},
        headers=headers,
    )
    return poll_res.json()


def poll_operation(op_name):
    done = False
    while done == False:
        poll_res = get_operation(op_name)
        done = poll_res['done']
        if done:
            return poll_res
        else:
            time.sleep(3)


op_res = poll_operation(update_res['operation_name'])
print(op_res)

We'll get some query metadata when the operation is done, which includes to and from commit ids that can be used to query the dolt_commit_diff_$tablename table.

{
  "_id": "operations/72abb56b-d478-43ae-9a2d-c9602184c7ab",
  "done": true,
  "res_details": {
    "query_execution_status": "Success",
    "query_execution_message": "Query OK, 2 rows affected. Rows matched: 2  Changed: 2  Warnings: 0.",
    "owner_name": "tbantle",
    "repository_name": "SHAQ",
    "from_commit_id": "518ue176ec13qf563e87uoerkqqepood",
    "to_commit_id": "5sp7i0bov6itma2u04dpk15ui24lrigp"
  }
}

3. View diff

A lot of Dolt's functionality is exposed via SQL, so we can query commit diffs using the API. Each row in the dolt_commit_diff_$tablename table represents a row that has changed between two commits. We can use this table to view the resulting changes from our UPDATE query. To make the results easier to read, we're only going to select a few columns.

def get_dolt_commit_diff(table_name, columns, from_commit, to_commit, branch):
    col_str = ', '.join((columns)
    query = f'''SELECT {col_str}, diff_type FROM `dolt_commit_diff_{table_name}` WHERE from_commit="{from_commit}" AND to_commit="{to_commit}"'''
    diff_res = requests.get(
        f'https://www.dolthub.com/api/v1alpha1/{owner}/{repo}/{branch}',
        params={'q': query},
        headers=headers,
    )
    return diff_res.json()


details = poll_res['res_details']
cols = ['from_player_id', 'to_player_id']
get_dolt_commit_diff('player_season_stat_totals', cols, details['from_commit_id'], details['to_commit_id'], to_branch)

As with any read query, the response includes some metadata, the table schema, and the result rows.

{
  "query_execution_status": "Success",
  "query_execution_message": "",
  "repository_owner": "tbantle",
  "repository_name": "SHAQ",
  "commit_ref": "update-dimitri",
  "sql_query": "SELECT from_player_id, to_player_id, diff_type FROM `dolt_commit_diff_player_season_stat_totals` WHERE from_commit='518ue176ec13qf563e87uoerkqqepood' AND to_commit='5sp7i0bov6itma2u04dpk15ui24lrigp'",
  "schema": [
    {
      "columnName": "from_player_id",
      "columnType": "INT",
      "isPrimaryKey": false
    },
    {
      "columnName": "to_player_id",
      "columnType": "INT",
      "isPrimaryKey": false
    },
    { "columnName": "diff_type", "columnType": "TEXT", "isPrimaryKey": false }
  ],
  "rows": [
    {
      "from_player_id": "null",
      "to_player_id": "714287",
      "diff_type": "added"
    },
    {
      "from_player_id": "null",
      "to_player_id": "714287",
      "diff_type": "added"
    },
    {
      "from_player_id": "15404617",
      "to_player_id": "null",
      "diff_type": "removed"
    },
    {
      "from_player_id": "15404617",
      "to_player_id": "null",
      "diff_type": "removed"
    }
  ]
}

4. Repeat

Now that we've updated the player_id in the player_season_stat_totals table, we can delete the extra Dimitri Saliba entry in the players table. We can repeat steps 1-3 above using the functions we wrote, but changing some of the arguments.

# Run query
delete_query = 'DELETE FROM players WHERE player_id=15404617'
delete_res = run_write_query(to_branch, to_branch, delete_query)

# Poll operation
op_res = poll_operation(delete_res['operation_name'])

# View diff
details = poll_res['res_details']
get_dolt_commit_diff('players', details['from_commit_id'], details['to_commit_id'], to_branch)

5. Merge branches

Once we're satisfied with our changes, we can merge our branches using our run_write_query function without a query.

merge_res = run_write_query(to_branch, from_branch, '')
poll_res = poll_operation(merge_res['operation_name'])
print(poll_res)

We get a successful response with an empty to_commit_id.

{
  "_id": "operations/0690bd2c-024b-44fb-bd9e-7163a484d17a",
  "done": true,
  "res_details": {
    "query_execution_status": "Success",
    "query_execution_message": "",
    "owner_name": "tbantle",
    "repository_name": "SHAQ",
    "from_commit_id": "518ue176ec13qf563e87uoerkqqepood",
    "to_commit_id": ""
  }
}

If we look at our from branch (main in this case), we should see the commits generated from our queries in the log.

Commit log for main

Conclusion

Now that the SQL API is writable, you really can do anything with your data. We're still continuing to improve the SQL API, and it is still in alpha. Its shape may change over time, but the basic principle of data access through SQL will remain. If you have any questions or want to talk to us about your use case, please feel free to reach out on Discord.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.