F*#%! you (in 4 languages)

April 17, 2020

4 min read

Dolt is to DoltHub as Git is to GitHub - except with Dolt, the unit of versioning is SQL tables. Dolt also has Git-like semantics such as pull, branch and merge. By running dolt pull in a Dolt repository, you know you are getting the most up-to-date data. Not only that, but if there are issues running it against your application, you can quickly roll back to the last working version of your dataset. In this post, I will walk through a simple Python application using Dolt’s bad-words dataset to create a chat bot and profanity filter.

Start with the data#

First I need to import bad words into a Dolt repository so that our chat bot knows which words should be considered profane. I decide to use this GitHub repository, which has one file per language, each containing a list of bad words. I run a Python script to compile the contents of each file into bad_words.csv, each row being a language_code, bad_word pair. Then I use the doltsql shell to create the bad_words table:

doltsql> CREATE TABLE `bad_words` (
`bad_word` VARCHAR(100) NOT NULL,
`language_code` VARCHAR(20) NOT NULL,
PRIMARY KEY (`language_code`,`bad_word`)
);
doltsql> describe bad_words;
+---------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-----+---------+-------+
| bad_word | VARCHAR(100) | NO | PRI | | |
| language_code | VARCHAR(20) | NO | PRI | | |
+---------------+--------------+------+-----+---------+-------+

And import the data:

$ dolt table import -u bad_words bad_words.csv
Rows Processed: 2660, Additions: 2660, Modifications: 0, Had No Effect: 0
Import completed successfully.

After I add, commit and push my branch, I submit my changes in a pull request on DoltHub. Next, I merged in 22 of the 30 open pull requests from the GitHub repository. Check out that pull request here. We now believe we have the most up-to-date bad words dataset on the internet.

Now we can run SQL queries against bad_words to understand what’s in the data:

>doltsql SELECT COUNT(*) AS total_bad_words FROM bad_words;
+-----------------+
| total_bad_words |
+-----------------+
| 4054            |
+-----------------+
>doltsql SELECT * FROM bad_words WHERE bad_word = 'poopy mcpoop face';
+---------------+-------------------+
| language_code | bad_word          |
+---------------+-------------------+
| en            | poopy mcpoop face |
+---------------+-------------------+
doltsql> SELECT COUNT(bad_word) AS count_fuck_words FROM bad_words WHERE language_code = 'en' AND bad_word LIKE '%fuck%';
+------------------+
| count_fuck_words |
+------------------+
| 70               |
+------------------+

Check out the bad-words DoltHub repo for yourself, and use the online SQL editor to see how the count of shit words and count of ass words compare. We also have a few saved queries including number of bad words by language.

Using Dolt data in my application#

After setting up the data with Dolt, we import it to our chat bot Python script and iterate through the bad_words column to censor user input. Every time the script starts, we load the most up-to-date Dolt data into a Pandas DataFrame with the help of doltpy. You can check out the script here.