Dolt: A Simple Example

October 14, 2019

3 min read

When Dolt and DoltHub first went into private beta, we were surprised that the Iris dataset was the dataset people first tried to put in Dolt. If you are looking for that dataset, we have uploaded it to DoltHub. In this article, we're going to show you exactly how we did that.

This dataset is a little problematic for Dolt because it does not have primary keys. Dolt tables require a primary key. Dolt uses primary keys to determine conflicts between conflicting writes.

Without primary keys it is difficult to discern from the data alone whether an update is an insert or an update. More importantly, the storage engine shares content across versions based on the primary key so we don't want those changing very frequently.

Back to the problem at hand. Here is what I did to get the data into Dolt and DoltHub.

I downloaded the dataset from here. I opened the .data file in a text editor, noticed it was a CSV, didn't have any headers on the rows, and had no obvious primary key. It looks like this.

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
…

I renamed the file iris.csv and imported that file into Google sheets. I added column headers sepal_length, sepal_width, petal_length, petal_width, and class so the Dolt CSV importer would infer the schema. These are the names of the columns I intend to use when I create the table in Dolt.
I added a column named measurement_id and made it sequentially increase as the primary key.

I debated making the primary key a hash of the five values or a sequentially increasing integer.

Say 50 more measurements are added to this dataset. If I download that new dataset, I need to make sure the current measurements are labeled 1-150 in the exact same way and the new 50 measurements get new measurement IDs if I'm to import it into Dolt and get reasonable diffs.

That likely means the data needs to be sorted in the exact same way which may be unlikely as this file seems sorted by class. It would be a bit of effort to reimport that new dataset because of this.

So, the hash method would protect against this because the key would just be the product of all 5 other columns but I could not have duplicate rows. Because this is a supervised learning dataset, I know duplicate rows are useful so I opted for using a measurement ID as the primary key.

I needed to make a new Dolt repository. I created a new directory named iris-classified-measurements and made it my current working directory. I intended to call the Dolt repository iris-classified-measurements. I exported the sheet I created as a CSV to that directory so I could import it easily later. It will not be part of the Dolt repository.
I created a Dolt repository by running dolt init. I need to create a table and some of the values in it are not strings. I know dolt import will make everything a string so I create a table using SQL so I can have some float columns:

dolt sql -q 'CREATE TABLE classified_measurements (
  measurement_id int,
  sepal_length float,
  sepal_width float,
  petal_length float,
  petal_width float,
  class varchar(255),
  primary key (`measurement_id`)
);'

I then ran dolt schema show to make sure I got the table I wanted. Dolt added some constraints and some tag numbers. Those are used for Dolt internals. We'll have a discussion on tag numbers at some point.

Now I needed to get the data into the table. I imported the data with dolt table import -u classified_measurements iris-dolt.csv. I then ran dolt sql -q 'select * from classified_measurements' to make sure the data was imported correctly.
Everything looked good so now I wanted to commit and push my dataset to DoltHub. I committed my changes by running dolt add classified_measurements and dolt commit -m "Imported very popular classified Iris measurement dataset".
I needed to create the repository on DoltHub to push to. I went to https://www.dolthub.com and created a public dataset named classified-iris-measurements under the dolthub organization.
I went back to my shell and set up DoltHub as my remote: dolt remote add origin dolthub/classified-iris-measurements. I pushed master to remote: dolt push origin master.

The dataset can be viewed on DoltHub. It can be cloned locally by running dolt clone dolthub/classified-iris-measurements.

For those of you familiar with Git, GitHub, and SQL databases, this should all seem eerily familiar. That is by design. Check out other datasets we and others have imported into DoltHub or create and upload one yourself.

Blog

Get started with Dolt