The plan for VECTOR columns
Recently, we added support for vector indexes to Dolt, our SQL database with Git-style version control.
Dolt is designed to be a drop-in replacement for MySQL, but MySQL didn't support vector indexes at the time. In fact, it still doesn't. So we modeled the syntax off of MariaDB, another database engine in the MySQL dialect that does support vector indexes. Right now, you can add a vector index like this:
ALTER TABLE products ADD VECTOR INDEX embedding_idx (embedding);
Currently, there's no dedicated VECTOR
type in Dolt. Instead, you can add vector indexes on JSON
columns, where the JSON values are arrays of floats that get interpreted as vectors. This was done for two reasons:
- We wanted to get a minimum proof of concept into users' hands, and it was simpler to use an existing data type than add a new one.
- When we started, MySQL and MariaDB didn't have a dedicated VECTOR type either. We didn't know how their vector support would evolve over time, and we wanted to make sure that whatever we did was compatible with both of them.
Then while we were implementing vector indexes, two things happened:
- Both MySQL and MariaDB added their own VECTOR type, and the two are largely compatible with each other.
- We confirmed that storing vectors as JSON and deserializing them on-demand makes building vector indexes slow.
The second bullet point didn't really surprise us; we fully expected it. But now that people can use vector indexes, making them faster by adding dedicated vector columns seems like the obvious next step.
The purpose of this article is to outline and preview what using a vector type in Dolt will look like, and to show why you should prefer a dedicated vector type. MySQL and MariaDB have APIs that are similar but vary in minor ways, mostly to do with the names and signatures of built-in functions. When possible, Dolt will support both variations, so that scripts written for either MySQL or MariaDB will work for Dolt out of the box.
Column Definitions
Both MySQL and MariaDB use the same syntax for creating vector columns: the column type VECTOR(N)
represents an vector consisting of N
single-precision (32-bit) floats. Dolt will also support this syntax, which looks like this:
CREATE TABLE products (id int primary key, embedding VECTOR(3));
This example creates a vector column where every vector has four dimensions. The minimum number of dimensions is 1, and the maximum is 16383.
Since vectors in MySQL and MariaDB are always 32 bits, we don't currently have plans to add support for other levels of precision unless a user requests it.
Creating Vectors
Both MySQL and MariaDB store vectors as a binary blob which stores each of the vector's individual float values in little-endian. Vector values are implicitly convertible from their corresponding binary values. This means the following is allowed:
INSERT INTO products VALUES (1, 0x00F0803F0000404000000040);
And vector values are displayed as binary when printed:
> SELECT embedding FROM products;
+----------------------------+
| embedding |
+----------------------------+
| 0x00F0803F0000404000000040 |
+----------------------------+
Depending on how you're importing and exporting vector data, this may be sufficient. And it's not human-readable. But if you're writing your own scripts for inserting or reading vectors, this binary format may not suffice.2 Trying to use human-readable strings won't work:
> INSERT INTO products VALUES (2, "[1.0, 2.0, 3.0]");
ERROR 6136 (HY000): Value of type 'string, size: 15' cannot be converted to 'vector' type.
In order to convert between vector blobs and human-readable strings, you use built-in functions. These functions have different names in MySQL and MariaDB but behave the same:
-
To convert a vector from JSON to binary, use
STRING_TO_VECTOR()
orTO_VECTOR()
in MySQL, orVEC_FromText()
in MariaDB. -
To convert a vector from binary to JSON, use
VECTOR_TO_STRING()
orFROM_VECTOR()
in MySql, orVEC_ToText()
in MariaDB. -
MySQL:
mysql> INSERT INTO products VALUES (2, STRING_TO_VECTOR("[1.0, 2.0, 3.0]"));
Query OK, 1 row affected (0.004 sec)
mysql> SELECT embedding from products;
+----------------------------+
| embedding |
+----------------------------+
| 0x0000803F0000004000004040 |
+----------------------------+
1 row in set (0.002 sec)
mysql> SELECT VECTOR_TO_STRING(embedding) from products;
+---------------------------------------+
| VECTOR_TO_STRING(embedding) |
+---------------------------------------+
| [1.00000e+00,2.00000e+00,3.00000e+00] |
+---------------------------------------+
1 row in set (0.001 sec)
- MariaDB:
mysql> INSERT INTO products VALUES (2, VEC_FromText("[1.0, 2.0, 3.0]"));
Query OK, 1 row affected (0.004 sec)
mysql> SELECT embedding from products;
+----------------------------+
| embedding |
+----------------------------+
| 0x0000803F0000004000004040 |
+----------------------------+
1 row in set (0.002 sec)
mysql> SELECT VEC_ToText(embedding) from products;
+---------------------------------------+
| VEC_ToText(embedding) |
+---------------------------------------+
| [1.00000e+00,2.00000e+00,3.00000e+00] |
+---------------------------------------+
1 row in set (0.001 sec)
Dolt will support both sets of function names, as aliases for each other.
Computing Vector Distance
The most important function for vectors is the distance function, which as the name suggests, computes a distance measure between two vectors. Vector indexes allow the engine to optimize queries where the results are ordered by a distance funciton.
There are multiple different types of distance, and the mechanism for specifying which type of distance you want is different between MySQL and MariaDB:
- MySQL:
SELECT * FROM products ORDER BY DISTANCE(embedding, <target vector>, "COSINE");
SELECT * FROM products ORDER BY DISTANCE(embedding, <target vector>, "EUCLIDEAN");
-- VECTOR_DISTANCE can also be used as an alias for DISTANCE
- MariaDB:
SELECT * FROM products ORDER BY VEC_DISTANCE_COSINE(embedding, <target vector>);
SELECT * FROM products ORDER BY VEC_DISTANCE_EUCLIDEAN(embedding, <target vector>);
MariaDB also has a function just called VEC_DISTANCE
, which uses the underlying vector index to determine the distance type. If there is no appropriate vector index, VEC_DISTANCE
returns an error.
Again, Dolt will support both of these syntaxes.
Creating Vector Indexes
Since MySQL doesn't currently support vector indexes, we're following MariaDB's lead.
As previously stated, you can create a vector index in either a CREATE TABLE
statement or a MODIFY TABLE ... ADD INDEX
statement, like so:
ALTER TABLE products ADD VECTOR INDEX embedding_idx (embedding);
In the future, you will be able to configure the distance type (Euclidean vs cosine distance, with Euclidean as the default). Selecting the distance type looks like this:
ALTER TABLE products ADD VECTOR INDEX embedding_idx (embedding) DISTANCE=cosine;
Additionally, you'll be able to modify the default values for this parameter if unspecified, by setting the mhnsw_default_distance
system variable.1
Miscellanous
There are a couple of things to be aware of when using VECTOR columns:
- Vectors can be compared with each other as long as they have the same dimension, but this is bytewise comparison on the bytes, not a semantic comparison on their contained values.
- Vector values cannot be compared with non-vector values, or compared with vectors of a different dimension.
- Vector columns cannot be used as keys in any kind of index (other than vector indexes).
- Vector values can be used as input to some built-in functions that accept strings or binary arrays, but not all. In MySQL, the following built-in functions support vector inputs:
BIT_LENGTH()
CHAR_LENGTH()
HEX()
LENGTH()
TO_BASE64()
AES_ENCRYPT()
COMPRESS()
MD5()
SHA1()
SHA2()
Recommendation
We are currently in the process of implementing vector types to Dolt, and will make another announcement once this is complete. Once we do, we recommend using the dedicated VECTOR type for vector indexes instead of JSON. Creating an index on vector types will be much faster and have closer parity to scripts written for MySQL. The only reason to use JSON would be if you need to use functions that the VECTOR type doesn't support, but even then you can always convert vectors to another type using the VEC_ToText
/ VECTOR_TO_STRING
/ FROM_VECTOR
function.
We'll make another announcment once vector types are fully supported in Dolt. In the meantime, we hope you found this useful. If you're working with Dolt or MySQL and have specific requirements for vector types, join our Discord and let us know! We take user requests and feedback very seriously.
- If you're using
dolt import
to import data from serialized file formats like CSV or Parquet, you won't have to worry about this at all: we'll make sure that imports work as expected for all supported file formats.↩ - MariaDB also has an
M=
parameter and a correspondingmhnsw_default_m
system variable. Dolt uses a custom implementation for our vector indexes that is designed to exhibit strong structural sharing when many versions of the index are stored in the database's commit history. As such, there is currently no equivalentM
parameter in Dolt. Setting this parameter will have no effect.↩