Dolt and Images

5 min read

Dolt is a MySQL-compatible database that uniquely supports version control operations such as branch, merge, and diff. Over the past couple of months, our customers have been asking us to support the loading and versioning of files, particularly images. Dolt offers several properties that make it a convincing image store.

  • One single instance that can be queried through a server
  • Lineage for understanding how a dataset has changed
  • Git semantics (clone, branch, merge, etc.) for collaboration

In this blog post we'll focus on a simple example of machine learning model experimentation and see how these properties are useful. The model we're building is used to classify handwritten digits into numbers.

Getting Started

We'll be working with MNIST, a dataset for identifying images. Go ahead and download all the images from here. Create a dolt database with dolt init and make sure you have all the unzipped images in the same folder. Now let's walk through the following script where we load all the images into our dolt database.

Let's start by loading in our dolt database with doltpy and creating our training and test tables.

import doltcli as dolt
import os
db = dolt.Dolt(".")

# create our table with the data
db.sql("CREATE TABLE training(name varchar(64) primary key, data longblob, label int);", result_format="csv")
db.sql("CREATE TABLE testing(name varchar(64) primary key, data longblob, label int);", result_format="csv")

Now let's load in the training set of the first 5 labels. Notice the use of the LOAD_FILE function, the primary way to insert file/image data into our database.

# Go through all of the file and load them into our training table
trainingDir = './MNIST-JPG/MNIST Dataset JPG format/MNIST - JPG - training'

for i in range(0, 5):
	imgDir = trainingDir + '/{}'.format(i)
    
	for filename in os.listdir(imgDir):
		if filename.endswith(".DS_Store"):
			continue
		
		completeDir = imgDir + "/" + filename
		query = "INSERT INTO training VALUES ('{}', LOAD_FILE('{}'), {});".format(filename, completeDir, i)
		
		db.sql(query, result_format=None)

Let's also go ahead and load in our testing data as well.

testingDir =  './MNIST-JPG/MNIST Dataset JPG format/MNIST - JPG - testing'

for i in range(0, 10):
	imgDir = testingDir + '/{}'.format(i)
    
	for filename in os.listdir(imgDir):
		if filename.endswith(".DS_Store"):
			continue
		
		completeDir = imgDir + "/" + filename
		query = "INSERT INTO testing VALUES ('{}', LOAD_FILE('{}'), {});".format(filename, completeDir, i)
		
		db.sql(query, result_format=None)

Let's commit our data now.

dolt commit -am "Add data for labels 0-5"

Our basic model

We're going to train a simple model on our training set and then use it to run a prediction on a testing set. We won't go too much in depth into this model. The important thing to notice is how easy it is to query the image data in dolt. With dolt running in server mode you can query this data from a remote machine instead of keeping your image data local.

Go to your dolt database and start its sql server with dolt sql-server --port=3307 --max_connections=10.

import numpy as np
import doltcli as dolt
from sklearn.ensemble import RandomForestClassifier
from PIL import Image
from sqlalchemy import create_engine
import io

engine = create_engine("mysql+pymysql://root@localhost:3307/ml_image")
trainingImages = []
trainingLabels = []
testingImages = []
testingLabels = []

def reshape(nparray):
    nsamples, nx, ny = nparray.shape
    nparray = nparray.reshape((nsamples,nx*ny))
    return nparray

def processRow(row):
    _, fileBlob, label = row
    img = Image.open(io.BytesIO(fileBlob))
    nImg = np.asarray(img)

    return np.asmatrix(nImg), np.asmatrix(label)

with engine.begin() as conn:
    data = conn.execute("select * from training").fetchall()
    for row in data:
       img, label = processRow(row)
       trainingImages.append(img)
       trainingLabels.append(label)

    testingData = conn.execute("select * FROM testing").fetchall()
    for row in testingData:
        img, label = processRow(row)
        testingImages.append(img)
        testingLabels.append(label)

trainingImages = reshape(np.array(trainingImages))
trainingLabels = reshape(np.array(trainingLabels))
testingImages = reshape(np.array(testingImages))
testingLabels = reshape(np.array(testingLabels))

# Create the model
clf = RandomForestClassifier(
    n_estimators=50,
    min_samples_split=2,
    n_jobs=2,
    random_state=20170428
)

# Print the results
clf.fit(trainingImages, trainingLabels.ravel())
predictions = clf.predict(testingImages)

print(np.mean(predictions == testingLabels.ravel()))

Our initial accuracy is: 0.98 Not bad!

Branch Mechanics

Now suppose we wanted to test our model with new data. Instead of modifying our current database, we can create a branch and upload more images.

dolt checkout -b additional-labels

Repeat the code above and load in the rest of the images.

# Go through all of the file and load them into 
trainingDir = './MNIST-JPG/MNIST Dataset JPG format/MNIST - JPG - training'

for i in range(5, 10):
	imgDir = trainingDir + '/{}'.format(i)
    
	for filename in os.listdir(imgDir):
		if filename.endswith(".DS_Store"):
			continue
		
		completeDir = imgDir + "/" + filename
		query = "INSERT INTO training VALUES ('{}', LOAD_FILE('{}'), {});".format(filename, completeDir, i)
		
		db.sql(query, result_format=None)

We see that our new accuracy score is .96. Our model still performs pretty well! Now imagine scaling this up to 100s of experiments. Dolt manages that pretty easily with simple checkout semantics.

We can go ahead and merge our feature branch into our master branch. This signifies that the master branch is our primary dataset that our deployed model was trained and tested on.

dolt commit -am "Added the remaining labels"
dolt checkout master
dolt merge additional-labels

If we ever wanted to debug where our model went wrong, we could easily look through our commit log to see what happened to the data.

commit j71r2mj7edhphhhdpfsj4a10t561u1fl
Author: vinai <vinai@dolthub.com>
Date:   Mon Aug 23 11:49:01 -0700 2021

	Added the remaining labels

commit 4n6gbdv9intphlga72prip9p6jec6kp0
Author: vinai <vinai@dolthub.com>
Date:   Mon Aug 23 11:09:26 -0700 2021

	Add data for labels 0-5

commit dlbv9qc2hb7f2im8eds7r0nu9962b34s
Author: vinai <vinai@dolthub.com>
Date:   Mon Aug 23 10:31:20 -0700 2021

	Initialize data repository

Pushing to a Remote

That last step of this is pushing your dolt database to a remote so anyone you want can access your data experiments. We support Dolthub, S3, and GCP remotes. Let's stick with a Dolthub remote for now.

Go to https://www.dolthub.com/ and create a new database. Now push your local data to Dolthub as follows

dolt remote add origin <username>/<database-name>
dolt push origin master

You can go ahead and delete all the MNIST images stored locally on your machine! They're all in Dolt and Dolthub. More importantly, anyone you want can clone your database and start experimenting with different branches.

You can clone the database that I put together for this blog here

Conclusion

In this blog we walked through a simple example of versioning image datasets for machine learning use cases. Engineers or data scientists who value a single source of truth, lineage, and collaboration on their image data should highly consider adopting Dolt. But, if you just want to upload a bunch of images to Dolthub that's totally cool as well!

If you're interested in learning more or using Dolt please join our Discord group here!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt