Using Dolt to Manage Train/Test Splits

11 min read

Twitter is wonderful sometimes.

Train/Test Tweet

We don't know Aaron. He finds us on Twitter, asks a great question, makes us think, and prompts a blog post.

How can you use Dolt to manage train/test splits for your Machine Learning models? Dolt is a SQL database with Git-style versioning. DoltHub is a place on the internet to share and collaborate on Dolt databases. It seems like such a thing could help manage train/test splits.

Problems Managing Train/Test Splits

The basic idea of a train/test split is to divide your data as fairly as possible between train and test and not let information leak from test to train. Once you've assigned data to a set, you track that assignment over time. If new data comes in, you need to assign it to train and test using the same method. The most common practice for train/test splits is a random sample of something like 80% train, 20% test.

Random sampling is hard. If you have a label that is fairly uncommon, like say the label only occurs on less than 1% of the data, it's possible you can run a random sample and get none of that label in the training or test set. So, you may want to run a 80/20 split on each label. What if you have multiple labels per record? It starts to get complicated.

Moreover, there are all sorts of problems with requiring computers to generate random numbers. Most methods for storing data are inherently not random so you have to be careful with slicing arrays by index number or assuming a hashing algorithm is random.

You also have to preserve your train/test split over time. If you did something fancier than random, you need to re-run the split the same way if new data comes in. If the new data messed up the split and you need to move train records to test or vice versa, your old models will over perform on the new test set. The test data was used to train the model.

Dolt Can Help

Dolt is a versioned database. So, you get all the power of a version control system like Git and all the power of a SQL database. Both of these powers help you with train/test splits. We'll walk you through Dolt's power using the famous Iris dataset we have on DoltHub.

All these commands are standard Git and SQL so if you know those, Dolt should feel very familiar. If you don't know Git or SQL, learning Dolt can help you learn those two tools.

dolt $ dolt clone dolthub/classified-iris-measurements
cloning https://doltremoteapi.dolthub.com/dolthub/classified-iris-measurements
20 of 20 chunks complete. 0 chunks being downloaded currently.
dolt $ cd classified-iris-measurements/
classified-iris-measurements $ dolt checkout -b train-test-split
Switched to branch 'train-test-split'
classified-iris-measurements $ dolt sql -q "show tables"
+-------------------------+
| Table                   |
+-------------------------+
| classified_measurements |
+-------------------------+

We'll use the train-test-split branch so we don't mess up master. Making this change on your own branch leaves master clean for others to use, even for non-machine learning use cases. You can depend on the actual production data without needing a job to write a new copy for the machine learning team.

Adding columns and generating random variables is as easy as a couple queries. First add a new boolean is_test column.

classified-iris-measurements $ dolt sql -q "alter table classified_measurements add column is_test boolean"
classified-iris-measurements $ dolt diff
diff --dolt a/classified_measurements b/classified_measurements
--- a/classified_measurements @ 6mjvsm4it08dqgpe19k4475j0gjhle1a
+++ b/classified_measurements @ k9d51bnt1c87lso58c2jlro9b82c35bd
  CREATE TABLE classified_measurements (
    `measurement_id` BIGINT NOT NULL COMMENT 'tag:7601'
    `sepal_length` DOUBLE COMMENT 'tag:3679'
    `sepal_width` DOUBLE COMMENT 'tag:4570'
    `petal_length` DOUBLE COMMENT 'tag:2621'
    `petal_width` DOUBLE COMMENT 'tag:12612'
    `class` LONGTEXT COMMENT 'tag:12738'
+   `is_test` TINYINT COMMENT 'tag:15402'
     PRIMARY KEY (`measurement_id`)
  );

+-----+----------------+--------------+-------------+--------------+-------------+-------+---------+
|  <  | measurement_id | sepal_length | sepal_width | petal_length | petal_width | class |         |
|  >  | measurement_id | sepal_length | sepal_width | petal_length | petal_width | class | is_test |
+-----+----------------+--------------+-------------+--------------+-------------+-------+---------+
+-----+----------------+--------------+-------------+--------------+-------------+-------+---------+

Then, update the is_test column with a random 80/20 split.

classified-iris-measurements $ dolt sql -q "update classified_measurements set is_test = rand() > .8"
Query OK, 150 rows affected
Rows matched: 150  Changed: 150  Warnings: 0
classified-iris-measurements $ dolt sql -q "select count(*) from classified_measurements where is_test=1" 
+----------+
| COUNT(*) |
+----------+
| 28       |
+----------+
classified-iris-measurements $ dolt sql -q "select round((count(*)+.0000001)/150, 3) as test_percent from classified_measurements where is_test=1" 
+--------------+
| test_percent |
+--------------+
| 0.187        |
+--------------+

Notice how we inspected what you did by running dolt diff and SQL queries. If you don't like what you did, just run dolt reset --hard and go back to the original state of the master branch. You want to try two different approaches, just make two branches. If someone else on your team wants to try something different, she can have her own branch too. This is the power of Git-style version control. Everyone can have his or her own copy of the data, make modifications, and merge those modifications together later.

We'll check in our work now using dolt add and dolt commit.

classified-iris-measurements $ dolt add .
classified-iris-measurements $ dolt commit -m "Added a is_test column and applied a 20% random split"
commit orp59ejukv1njektni82822cj159j8u0
Author: Tim Sehn <tim@liquidata.co>
Date:   Mon May 04 15:02:10 -0700 2020

	Added a is_test column and applied a 20% random split

Now, let's simulate new data arriving on master.

classified-iris-measurements $ dolt checkout master
Switched to branch 'master'
classified-iris-measurements $ dolt sql -q "insert into classified_measurements select 151, sepal_length, sepal_width, petal_length, petal_width, class from classified_measurements where measurement_id=150"
Query OK, 1 row affected
classified-iris-measurements $ dolt diff
diff --dolt a/classified_measurements b/classified_measurements
--- a/classified_measurements @ ir8lgv0vrj61eib3vb90tajveq99mu7g
+++ b/classified_measurements @ k9d51bnt1c87lso58c2jlro9b82c35bd
+-----+----------------+--------------+-------------+--------------+-------------+----------------+
|     | measurement_id | sepal_length | sepal_width | petal_length | petal_width | class          |
+-----+----------------+--------------+-------------+--------------+-------------+----------------+
|  +  | 151            | 5.9          | 3           | 5.1          | 1.8         | Iris-virginica |
+-----+----------------+--------------+-------------+--------------+-------------+----------------+
classified-iris-measurements $ dolt add .
classified-iris-measurements $ dolt commit -m "Simulate new data arriving by copying measurement 150 into 151"
commit as2gej6vkvk1f9gld75ce0g3ke6mhquf
Author: Tim Sehn <tim@liquidata.co>
Date:   Mon May 04 15:09:52 -0700 2020

	Simulate new data arriving by copying measurement 150 into 151

Now, you can merge it down into your branch using dolt merge master. That data will have a NULL is_test column. You apply the same splitting logic to the new data.

classified-iris-measurements $ dolt branch
* master                                        	
  train-test-split                              	
classified-iris-measurements $ dolt checkout train-test-split
Switched to branch 'train-test-split'
classified-iris-measurements $ dolt sql -q "select count(*) from classified_measurements"
+----------+
| COUNT(*) |
+----------+
| 150      |
+----------+
classified-iris-measurements $ dolt merge master
Updating orp59ejukv1njektni82822cj159j8u0..as2gej6vkvk1f9gld75ce0g3ke6mhquf
classified_measurements | 1
1 tables changed, 1 rows added(+), 0 rows modified(*), 0 rows deleted(-)
classified-iris-measurements $ dolt sql -q "select count(*) from classified_measurements"
+----------+
| COUNT(*) |
+----------+
| 151      |
+----------+
classified-iris-measurements $ dolt sql -q "select * from classified_measurements where is_test is NULL"
+----------------+--------------+-------------+--------------+-------------+----------------+---------+
| measurement_id | sepal_length | sepal_width | petal_length | petal_width | class          | is_test |
+----------------+--------------+-------------+--------------+-------------+----------------+---------+
| 151            | 5.9          | 3           | 5.1          | 1.8         | Iris-virginica | <NULL>  |
+----------------+--------------+-------------+--------------+-------------+----------------+---------+
classified-iris-measurements $ dolt sql -q "update classified_measurements set is_test = rand() > .8 where is_test is NULL" 
Query OK, 1 row affected
Rows matched: 1  Changed: 1  Warnings: 0
classified-iris-measurements $ dolt diff
diff --dolt a/classified_measurements b/classified_measurements
--- a/classified_measurements @ 2pk26vrc533rclborq5brb3fg6rh932j
+++ b/classified_measurements @ s4tmgtve0e1ndgc3v860am8tgpteq5v8
+-----+----------------+--------------+-------------+--------------+-------------+----------------+---------+
|     | measurement_id | sepal_length | sepal_width | petal_length | petal_width | class          | is_test |
+-----+----------------+--------------+-------------+--------------+-------------+----------------+---------+
|  <  | 151            | 5.9          | 3           | 5.1          | 1.8         | Iris-virginica | <NULL>  |
|  >  | 151            | 5.9          | 3           | 5.1          | 1.8         | Iris-virginica | 0       |
+-----+----------------+--------------+-------------+--------------+-------------+----------------+---------+
classified-iris-measurements $ dolt add .
classified-iris-measurements $ dolt commit -m "Merged master and assigned train test split to new data"
commit 34oimqcpko1q706i4t423cbhf8kbprdu
Merge: as2gej6vkvk1f9gld75ce0g3ke6mhquf orp59ejukv1njektni82822cj159j8u0
Author: Tim Sehn <tim@liquidata.co>
Date:   Mon May 04 15:14:26 -0700 2020

	Merged master and assigned train test split to new data

Take a moment to marvel at how cool that was. We had data being written to master in a different schema than ours. We took that data and merged it into our schema with one command. We then applied our train test split with another command. All the while, we had diffs and a log available to make sure the right thing was happening.

Let's investigate the distribution of labels in the test and training set.

classified-iris-measurements $ dolt sql -q "select class, is_test, count(*) from classified_measurements group by class,is_test order by class, is_test"
+-----------------+---------+----------+
| class           | is_test | COUNT(*) |
+-----------------+---------+----------+
| Iris-setosa     | 0       | 43       |
| Iris-setosa     | 1       | 7        |
| Iris-versicolor | 0       | 42       |
| Iris-versicolor | 1       | 8        |
| Iris-virginica  | 0       | 38       |
| Iris-virginica  | 1       | 13       |
+-----------------+---------+----------+

Looks like we oversampled in Iris-virginica. Let's rerun the split just for that class.

classified-iris-measurements $ dolt sql -q "update classified_measurements set is_test = rand() > 0.8 where class='Iris-virginica'"
Query OK, 18 rows affected
Rows matched: 51  Changed: 18  Warnings: 0
classified-iris-measurements $ dolt diff
diff --dolt a/classified_measurements b/classified_measurements
--- a/classified_measurements @ jfkhaqmgg0ehlmnq8n6pfaof3q2fr18m
+++ b/classified_measurements @ 2pk26vrc533rclborq5brb3fg6rh932j
+-----+----------------+--------------+-------------+--------------+-------------+----------------+---------+
|     | measurement_id | sepal_length | sepal_width | petal_length | petal_width | class          | is_test |
+-----+----------------+--------------+-------------+--------------+-------------+----------------+---------+
|  <  | 102            | 5.8          | 2.7         | 5.1          | 1.9         | Iris-virginica | 0       |
|  >  | 102            | 5.8          | 2.7         | 5.1          | 1.9         | Iris-virginica | 1       |
|  <  | 109            | 6.7          | 2.5         | 5.8          | 1.8         | Iris-virginica | 1       |
|  >  | 109            | 6.7          | 2.5         | 5.8          | 1.8         | Iris-virginica | 0       |
|  <  | 112            | 6.4          | 2.7         | 5.3          | 1.9         | Iris-virginica | 1       |
|  >  | 112            | 6.4          | 2.7         | 5.3          | 1.9         | Iris-virginica | 0       |
|  <  | 114            | 5.7          | 2.5         | 5            | 2           | Iris-virginica | 0       |
|  >  | 114            | 5.7          | 2.5         | 5            | 2           | Iris-virginica | 1       |
|  <  | 119            | 7.7          | 2.6         | 6.9          | 2.3         | Iris-virginica | 1       |
|  >  | 119            | 7.7          | 2.6         | 6.9          | 2.3         | Iris-virginica | 0       |
|  <  | 121            | 6.9          | 3.2         | 5.7          | 2.3         | Iris-virginica | 1       |
|  >  | 121            | 6.9          | 3.2         | 5.7          | 2.3         | Iris-virginica | 0       |
|  <  | 122            | 5.6          | 2.8         | 4.9          | 2           | Iris-virginica | 0       |
|  >  | 122            | 5.6          | 2.8         | 4.9          | 2           | Iris-virginica | 1       |
|  <  | 123            | 7.7          | 2.8         | 6.7          | 2           | Iris-virginica | 0       |
|  >  | 123            | 7.7          | 2.8         | 6.7          | 2           | Iris-virginica | 1       |
|  <  | 126            | 7.2          | 3.2         | 6            | 1.8         | Iris-virginica | 0       |
|  >  | 126            | 7.2          | 3.2         | 6            | 1.8         | Iris-virginica | 1       |
|  <  | 130            | 7.2          | 3           | 5.8          | 1.6         | Iris-virginica | 1       |
|  >  | 130            | 7.2          | 3           | 5.8          | 1.6         | Iris-virginica | 0       |
|  <  | 131            | 7.4          | 2.8         | 6.1          | 1.9         | Iris-virginica | 1       |
|  >  | 131            | 7.4          | 2.8         | 6.1          | 1.9         | Iris-virginica | 0       |
|  <  | 138            | 6.4          | 3.1         | 5.5          | 1.8         | Iris-virginica | 1       |
|  >  | 138            | 6.4          | 3.1         | 5.5          | 1.8         | Iris-virginica | 0       |
|  <  | 139            | 6            | 3           | 4.8          | 1.8         | Iris-virginica | 1       |
|  >  | 139            | 6            | 3           | 4.8          | 1.8         | Iris-virginica | 0       |
|  <  | 140            | 6.9          | 3.1         | 5.4          | 2.1         | Iris-virginica | 1       |
|  >  | 140            | 6.9          | 3.1         | 5.4          | 2.1         | Iris-virginica | 0       |
|  <  | 142            | 6.9          | 3.1         | 5.1          | 2.3         | Iris-virginica | 0       |
|  >  | 142            | 6.9          | 3.1         | 5.1          | 2.3         | Iris-virginica | 1       |
|  <  | 145            | 6.7          | 3.3         | 5.7          | 2.5         | Iris-virginica | 0       |
|  >  | 145            | 6.7          | 3.3         | 5.7          | 2.5         | Iris-virginica | 1       |
|  <  | 147            | 6.3          | 2.5         | 5            | 1.9         | Iris-virginica | 0       |
|  >  | 147            | 6.3          | 2.5         | 5            | 1.9         | Iris-virginica | 1       |
|  <  | 149            | 6.2          | 3.4         | 5.4          | 2.3         | Iris-virginica | 1       |
|  >  | 149            | 6.2          | 3.4         | 5.4          | 2.3         | Iris-virginica | 0       |
+-----+----------------+--------------+-------------+--------------+-------------+----------------+---------+
classified-iris-measurements $ dolt sql -q "select class, is_test, count(*) from classified_measurements group by class,is_test order by class, is_test"
+-----------------+---------+----------+
| class           | is_test | COUNT(*) |
+-----------------+---------+----------+
| Iris-setosa     | 0       | 43       |
| Iris-setosa     | 1       | 7        |
| Iris-versicolor | 0       | 42       |
| Iris-versicolor | 1       | 8        |
| Iris-virginica  | 0       | 40       |
| Iris-virginica  | 1       | 11       |
+-----------------+---------+----------+

A little better. Again, this is all very efficient and safe. I'm working on my own copy without fear of messing up the master version or someone elses version. Your data science is liberated to be science not shepherding data through various pipelines, figuring out what to do with n/a columns, or wondering what version of the data produced that great model you are comparing against.

You can document all this in your README.md and check that in as well. Push the whole thing to DoltHub and collaborate remotely.

classified-iris-measurements $ dolt push origin train-test-split
Tree Level: 5 has 4 new chunks of which 0 already exist in the database. Buffering 4 chunks.
Tree Level: 5. 100.00% of new chunks buffered.
Tree Level: 4 has 6 new chunks of which 2 already exist in the database. Buffering 4 chunks.
Tree Level: 4. 100.00% of new chunks buffered.
Tree Level: 2 has 5 new chunks of which 1 already exist in the database. Buffering 4 chunks.
Tree Level: 2. 100.00% of new chunks buffered.
Tree Level: 1 has 3 new chunks of which 1 already exist in the database. Buffering 2 chunks.
Tree Level: 1. 100.00% of new chunks buffered.
Successfully uploaded 1 of 1 file(s).   

You can see the result on DoltHub. Your colleagues can easily browse the data on different branches, inspect the commit log and diffs for all your changes, and even give you feedback on a pull request if a change requires peer review.

Let's say a colleague was able to produce a better model with a different train/test split on theirbranch. Seeing a diff in the sets to start to debug what is happening is as simple as dolt diff train-test-split theirbranch. If you want to use it for yourself, just run dolt checkout theirbranch. If you want to adopt their split or see how the combination of yours and theirs might work, just merge dolt merge theirbranch. If there are any conflicts, just take yours or theirs. Take theirs like so: dolt conflicts resolve --theirs. All of these changes will be reflected in dolt log so you can easily go back to a working state.

Dolt scales to hundreds of gigabytes. We have a 300GB dataset on DoltHub. The query language becomes even more powerful as data scales beyond what a human can reasonably interpret. You can query diffs to verify changes to the dataset, either human or automated.

There is so much utility in the Git plus SQL model of Dolt. There's a reason Git has become the defacto version control standard for software development. Bring that standard to your data science workflows. What else would you like to see demonstrated? Let us know.

Conclusion

Dolt and DoltHub have practical application in managing train/test splits for Machine Learning model creation. Dolt proves powerful in creation, debugging, and collaboration on train/test splits. Try it today.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt