Dolt as an Application Server

May 20, 2020

6 min read

A question we have been asked numerous times is, "Can Dolt be used as an application server"? This has driven a lot of conversations internally about the use cases of a versioned database server, and led to some very technical discussions about concurrency modes, transactions, and branch merge semantics within SQL.

One of the first use cases we've spent time addressing is using Dolt's MySQL-compatible server as a compatibility layer to be used with existing tooling. We have invested a good deal of time in our MySQL-compatible server, and we hope that one day any tooling that can speak to a MySQL server will be able to speak with a Dolt server. When thinking about this use case, we typically think about a server connected to a single client, a third party tool in this case, which is running queries against Dolt on behalf of an end user. This could be a GUI rendering graphs backed by Dolt data, an ML application training based on Dolt data, or it could be an application specifically written against Dolt in order to unlock features that you don't get with other platforms. This is what I'll be talking about in this blog post, with future blog posts to cover the more complicated use cases for Dolt as an application server.

The application

Katie McCulloch covered reading data from Dolt in order to filter profanity in a previous blog post. I will be extending that application to support user edits to the dataset. These edits will be written and committed to the local Dolt database so that they can be pushed to DoltHub, where users can submit PRs to have additions added.

Starting the Server

The ChatBot uses DoltPy to interact with Dolt. In the previous iteration of this program, all Dolt interactions were done using Dolt cli commands via Python's subprocess module. In this iteration we'll start a MySQL-compatible server and issue SQL commands against it to filter profanity, and to add new words to the list of banned words.

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--remote-name',
                        default='dolthub/bad-words',
                        help='The DoltHub remote from which to pull a set of bad words')
    parser.add_argument('--checkout-dir',
                        default='bad-words',
                        help='The local directory to clone the remote too')
    args = parser.parse_args()

    repo = clone_or_pull_latest(args.remote_name, args.checkout_dir)
    repo.start_server()
    time.sleep(1)

    cnx = None

    try:
        cnx = mysql.connector.connect(user="root", host="127.0.0.1", port=3306, database="bad_words")
        cnx.autocommit = True

        languages_df = repo.read_table(LANGUAGES_TABLE)
        language_codes = {x: True for x in languages_df['language_code']}

        # Enter chat loop
        chat_loop(repo, cnx, language_codes)
    finally:
        commit_new_bad_and_stop_server(repo, cnx)

After processing the command line arguments, and reading the latest version of the data from DoltHub, we will call repo.start_server() to start a background server which will run for the lifetime of the application. We'll sleep for one second to give the server time to start (not the best approach, but works for this simple example). We will put the rest of the code in a try/finally block in order to make sure the server gets stopped regardless of how the program terminates. By default, the server will only allow a single connection, and that connection will directly affect the working set of data, which is the use case I am attempting to demonstrate here.

One gotcha I ran into when writing this application is that the official Python MySQL connector will turn autocommit to false on each connection it makes by running the SQL SET @@autocommit=false; for every new connection it makes to the server. As a result we'll need to explicitly set it back to true again using cnx.autocommit = True. Using our new connection to the server we'll read the list of supported languages, and then start our chat loop.

Adding New Words or Phrases

Users of the chatbot can now add new words to the list of bad words using the syntax !bad! <language> <word or phrase>. When the chatbot sees a comment starting with !bad! it will parse the arguments before checking to see if the word or phrase already exists in the bad_words table before inserting it. Both checking for existence and inserting use standard SQL syntax, run on the server we started earlier:

HAS_BAD = '''SELECT count(*)
FROM bad_words
WHERE language_code="%s" AND bad_word="%s";'''

NEW_BAD_WORD_QUERY = '''INSERT INTO bad_words (language_code, bad_word)
VALUES ('%s','%s');'''

def add_bad_word(repo, cnx, language_code, word):
   query_str = HAS_BAD % (language_code, word)
   cursor = repo.query_server(query_str, cnx)

   row = cursor.next()

   if row[0] == 0:
       query_str = NEW_BAD_WORD_QUERY % (language_code, word)
       repo.query_server(query_str, cnx)
       print("> ChatBot: New bad word '%s' added. You can commit this upon exit." % word)
   else:
       print("> ChatBot: '%s' has already been added." % word)

Using the Latest Data

The previous iteration of the chatbot read all the bad words at the start of execution. Since we modify the list of bad words during execution we will read them before filtering new text.

BAD_WORDS_QUERY = '''SELECT bad_word
FROM bad_words;'''

def censor_text(text, repo, cnx):
    cursor = repo.query_server(BAD_WORDS_QUERY, cnx)
    bad_words = {row[0]: True for row in cursor}

    censored = False
    censored_text = text
    for bad_word in bad_words.keys():
        // loop through censoring text

Of course reading the entire table each time you were processing a statement isn't optimal from a performance point of

view, but it did get me thinking about cool Dolt-specific optimizations that could be made. Because Dolt stores data using a Merkle DAG we could expose the hash which represents the table value, and check it to see if the data has changed at all since it was last read. Another optimization would be to keep the state of the table in memory, and then read from the dolt_diff_bad_words table to find only the words that had changed since last read.

Add and Commit on Exit

When a user sends the message "bye" to the chatbot, we will query the dolt_diff_bad_words table to see if there are any changes to the bad_words table. If changes exist we will display them to our user, and prompt them for a commit message. Then we add the working table to the list of tables staged for our next commit, and commit the tables before stopping the server and exiting.

CHANGE_QUERY = '''SELECT to_bad_word, to_language_code, from_bad_word, from_language_code
FROM dolt_diff_bad_words
WHERE to_commit='WORKING';'''

def commit_new_bad_and_stop_server(repo, cnx=None):
    try:
        if cnx is not None:
            cursor = repo.query_server(CHANGE_QUERY, cnx)
            new_words = {row[0]: row[1] for row in cursor}

            if len(new_words) > 0:
                print('> ChatBot: %d new words added.' % len(new_words))

                for word, language_code in new_words.items():
                    print("     word: %16s, language code: %s" % (word, language_code))

                print('> Chatbot: Add a description for these changes.')

                commit_msg = input("> Me: ")

                repo.add_table_to_next_commit("bad_words")
                repo.commit(commit_msg)

                print('> Chatbot: These changes have been committed to your local master.')
                print('>        : run "dolt push origin master:<branch>" and visit dolthub.com to create a PR')
    finally:
        repo.stop_server()

Pushing Changes and Creating a PR

If we were looking to create the greatest dataset of bad language in the world, it would be critical to allow collaboration between speakers of different dialects and languages. No one person would be able to do a good job creating that dataset, and data collaboration is one of the ways we think that Dolt can improve all kinds of different datasets. If we have changes we want to contribute back to the dataset, we will push our local data to a remote branch so that we can create a PR for an admin of the dataset to review and integrate back into the main branch.

After exiting the program, you'll need to run dolt push origin master:<remote_branch_name>. This will push your changes up to the server on a branch name of your choosing. At that point you can open a PR by navigating to the DoltHub bad words repository page and selecting "Pull Requests" from the left navigation and then clicking the "Create Pull Request" button. Select your branch as the "From branch" and master as the "To branch".

Conclusion

This is the simplest case for using Dolt as an application server. At its core this is an example of writing a tool for users to be able to edit data on their local machine using a custom interface, and allowing them to contribute those changes back to some master dataset. While it is a simple case, it has numerous applications. It allows non-technical data collaborators to contribute to open data projects. It allows companies to develop tools where bad data can be filtered out before it is merged to master, and impacts production systems. This also gives an audit log of who changed the data, and why. This is something that you just don't get out of the box with other platforms and requires a lot of additional software to get these types of features in your system.

Don't take my word for it though. Install Dolt and download some data from DoltHub. If there is data you don't see in our catalog that you would really like to use, or if you would like to collaborate on a dataset, please reach out to us.

Blog