# Dolt Supports Spatial Indexes

Dolt is a version controlled SQL database; think Git and MySQL had a baby. Just over a year ago, we began our spatial journey with the addition of Spatial Types. We closed that gap about three months ago when we announced the support of Multi-Geometries. Now, we're one step closer to full MySQL compatibility with support for Spatial Indexes...kinda.

# What are Spatial Indexes?

Spatial Indexes are a special type of Secondary Indexes that aims to improve the performance of spatial SQL queries like `SELECT * FROM t WHERE ST_INTERSECTS(t.p, POINT(0, 0))`

.
Currently, this feature is only available after setting an environment variable; we'll get into why later on.

`export DOLT_ENABLE_SPATIAL_INDEX=1`

Here's an example of how to create a `SPATIAL`

index through the `CREATE TABLE`

syntax:

```
tmp> CREATE TABLE tbl (p POINT SRID 0 NOT NULL, SPATIAL INDEX (p));
tmp> show create table tbl;
+-------+------------------------------------------------------------------+
| Table | Create Table |
+-------+------------------------------------------------------------------+
| tbl | CREATE TABLE `tbl` ( |
| | `p` point NOT NULL SRID 0, |
| | SPATIAL KEY `p` (`p`) |
| | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin |
+-------+------------------------------------------------------------------+
1 row in set (0.00 sec)
```

It's also possible through the `ALTER TABLE`

syntax if you already have a table with a `GEOMETRY`

column:

```
tmp> CREATE TABLE tbl (p POINT SRID 0 NOT NULL);
tmp> ALTER TABLE tbl ADD SPATIAL INDEX (p);
tmp> show create table tbl;
+-------+------------------------------------------------------------------+
| Table | Create Table |
+-------+------------------------------------------------------------------+
| tbl | CREATE TABLE `tbl` ( |
| | `p` point NOT NULL SRID 0, |
| | SPATIAL KEY `p` (`p`) |
| | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_bin |
+-------+------------------------------------------------------------------+
1 row in set (0.00 sec)
```

`SPATIAL`

indexes can only be defined over `GEOMETRY`

columns (`POINT`

, `LINESTRING`

, `POLYGON`

, etc.).
Additionally, `SPATIAL`

indexes can only be defined over a single column and that column must have the constraints `SRID`

and `NOT NULL`

.

You can take advantage of `SPATIAL`

indexes by filtering over certain Spatial Relation functions, like `ST_INTERSECTS()`

.
As of Dolt version `v0.54.0`

, `ST_INTERSECTS()`

is the only function that can take advantage of `SPATIAL`

indexes.
You have to start somewhere. Over the next few weeks we will begin adding support for functions like `ST_EQUALS()`

, `ST_WITHIN()`

, etc.
Here's an example:

```
tmp> INSERT INTO tbl VALUES (POINT(0,0)), (POINT(1,1));
Query OK, 2 rows affected (0.00 sec)
tmp> select count(*) from tbl where st_intersects(p, point(0,0));
+----------+
| count(*) |
+----------+
| 1 |
+----------+
1 row in set (0.01 sec)
tmp> explain select count(*) from tbl where st_intersects(p, point(0,0));
+---------------------------------------------+
| plan |
+---------------------------------------------+
| GroupBy |
| ├─ SelectedExprs(COUNT(1)) |
| ├─ Grouping() |
| └─ Filter |
| ├─ st_intersects(tbl.p,{0 0 0}) |
| └─ IndexedTableAccess(tbl) |
| ├─ index: [tbl.p] |
| ├─ filters: [{[{0 0 0}, {0 0 0}]}] |
| └─ columns: [p] |
+---------------------------------------------+
9 rows in set (0.00 sec)
```

# Under the Hood / A Deeper Dive

## Why Spatial Indexes? (Motivation)

Let's say we had a table with a million evenly spread out `POINT`

s ranging anywhere from `(0,0)`

to `(1000, 1000)`

, and we wanted to know how many points were in the region defined by `POLYGON((0 0,1 0,1 1,0 1,0 0))`

.
So in SQL terms:

`SELECT COUNT(*) FROM point_tbl WHERE ST_INTERSECTS(p, ST_GEOMFROMTEXT('POLYGON((0 0,1 0,1 1,0 1,0 0))'));`

Naively, to answer this query we'd have to call `ST_INTERSECTS()`

for *every* `POINT`

in `point_tbl`

; that's a million times we're running the point-in-polygon algorithm.
Not to mention all the disk reads...

Logically, a `POINT`

located far away from the origin (like `(123, 456)`

) definitely does not spatially intersect with our `POLYGON`

, so there's no reason to run the expensive `ST_INTERSECTS()`

function.
With a `SPATIAL`

index, we'd be able to make these conclusions and avoid a great deal of unnecessary calculations.

## How do Spatial Indexes work?

Typically, indexes improve performance by neatly organizing the data into a B-Tree (in Dolt's case a Prolly Tree).
However, `POINT`

s don't fit so neatly into this data structure, as they are multi-dimensional (they consist of two `float64`

s `x`

and `y`

or `latitude`

and `longitude`

).

MySQL actually solves this problem by just not using a B-tree, and opting to represent their `SPATIAL`

indexes as a R-tree.
While this does work pretty well, it isn't without its drawbacks.
R-Trees are not deterministic; different insertion orders result in different trees with varying performance characteristics.
It's possible to accidentally create highly unbalanced trees, degrading performance. Additionally, R-Trees' non-determinism is not great for a versioned database that depends on maintaining history-independence.

For these very reasons, databases like Amazon Aurora and CockroachDB use a technique called Z-Ordering to map the two dimensional `POINT`

s into one dimension and place them into a B-Tree/Prolly Tree.
They claim to achieve great performance benefits and avoid certain downsides, so Dolt's going to do the same.
Specifically, Dolt requires history independence so we can merge Prolly trees.
Z-ordering is history independent so this technique fits the bill.

## Z-Ordering

As you might've guessed, it definitely matters *how* we perform this mapping.
An important factor is locality; we want `POINT`

s that are spatially nearby to still be nearby after our mapping.

A simple, but ineffective mapping is to simply concatenate the bits `x`

and `y`

.
The mapping is quick to do, but does a terrible job of preserving locality.
Let's say we're dealing with small 3-bit integers, and see how this mapping pans out:

x | y | concat | x_bits | y_bits | concat_bits |
---|---|---|---|---|---|

0 | 0 | 0 | 000 | 000 | 000 000 |

0 | 1 | 1 | 000 | 001 | 000 001 |

1 | 0 | 8 | 001 | 000 | 001 000 |

1 | 1 | 9 | 010 | 000 | 001 001 |

Concatenating works pretty well in preserving the locality between `(0, 0)`

, `(0, 1)`

, and `(0, 2)`

.
However, this pattern doesn't hold for `POINT`

s `(0, 0)`

, `(1, 0)`

, and `(2, 0)`

, as they are artificially far away.

A relatively simple, but performant mapping is to use a Z-order curve. It looks like this: I wonder where they got the name from...

Given two (or more) unsigned integers, we take their bits and interleave them. The resulting mapping is one-dimensional and maintains a high level of locality between our points; perfect for shoving into a B-tree.

x | y | z_order | x_bits | y_bits | z_order_bits |
---|---|---|---|---|---|

0 | 0 | 0 | 000 | 000 | 000 000 |

0 | 1 | 2 | 000 | 001 | 000 010 |

1 | 0 | 1 | 001 | 000 | 000 001 |

1 | 1 | 3 | 001 | 001 | 000 011 |

A Hilbert Curve is another space-filling curves that actually does a better job of spatial locality. However, due to the algorithm's complexity and relatively expensive calculation, we stuck with Z-ordering.

## Float Lexing

You may have noticed that Z-Ordering works on unsigned integers, while we have signed floats.
They're both 64 bits, but casting the raw bits of a `float64`

to a `uint64`

changes their ordering.
The signed bit is the first bit, so negative `float64`

s reinterpreted as `uint64`

s are larger than their positive counterparts.

So, we have to do another mapping; we want

`-MaxFloat64`

to be close to`0`

`0.0`

to be in the center, so`0x80...0`

`MaxFloat64`

to be close to`MaxUInt64`

, so`0xFF...F`

Luckily, the IEEE-754 standard is amicable to this lexicographic remapping.

```
// LexFloat maps the float64 into an uint64 representation in lexicographical order
// For negative floats, we flip all the bits
// For non-negative floats, we flip the signed bit
func LexFloat(f float64) uint64 {
b := math.Float64bits(f)
if b>>63 == 1 {
return ^b
}
return b ^ (1 << 63)
}
```

## Dealing with Polygons

Using float lexing and bit interleaving, we're able to convert our `POINT`

s into 16-byte Z-Addresses, ready to be placed into our Prolly Tree.
Well what about the other geometries, like `LINESTRING`

s and `POLYGON`

s that contain multiple `POINT`

s?
The trick is treating these geometries as `POINT`

s and associating a size with them; we call these `ZCell`

s.

Without going into too much detail, this is the algorithm for creating `ZCell`

s:

- For a given geometry, find it's bounding box, which consists of a
`MinPoint`

(bottom left) and a`MaxPoint`

(upper right). - Find the Z-Values of the minimum and maximum point,
`zMin`

and`zMax`

. - Find the matching prefix between
`zMin`

and`zMax`

; this is our level. - Create our Z-Cell, which looks like
`[level_byte, z_0, z_1, ...]`

.

Essentially, we've organized our geometries into 65 different levels. The larger the level, the larger the geometry; points are in level 0, and polygons that take up entire space are in level 64.

## What do Spatial Index Lookups look like?

Fortunately, our efforts to linearize the geometries so that they fit into the B-Tree makes creating lookups simple.
All `SPATIAL`

index queries provide a `GEOMETRY`

.
Using some logic from before, we can extract a bounding box from that `GEOMETRY`

.
This is essentially a range of values to search over; they go from `ZValue(minPoint)`

to `ZValue(maxPoint)`

.
Lastly, we prepend the level (creating 65 ranges) and (naively) perform this lookup in the same manner as a typical `SECONDARY`

index.

# Performance

Finally, we're able to read and write our geometries in an organized manner. So how much better is it? Well, here are some benchmarks:

table | min_point | max_point | num_selected | select_% | keyless_sec | spatial_key | multiplier |
---|---|---|---|---|---|---|---|

point | (0, 0) | (0, 0) | 0 | 0 | 0.0761 | 0.0009 | 81.1x faster |

point | (2, 2) | (4, 4) | 2 | 0.0004 | 0.1066 | 0.0009 | 109.4x faster |

point | (2, 2) | (128, 128) | 2021 | 0.4042 | 0.1119 | 0.0325 | 3.4x faster |

point | (2, 2) | (256, 256) | 8064 | 1.6128 | 0.1087 | 0.0542 | 2x faster |

point | (2, 2) | (512, 512) | 32583 | 6.5166 | 0.1118 | 0.2109 | 1.8x slower |

point | (0, 0) | (1000, 1000) | 124932 | 24.986 | 0.1121 | 0.2638 | 2.5x slower |

point | (-1000, 0) | (1000, 1000) | 249152 | 49.830 | 0.1114 | 0.5167 | 5x slower |

point | (-1000, 0) | (1000, 1000) | 382688 | 76.753 | 0.1139 | 0.9949 | 8.7x slower |

point | (-2000, -2000) | (2000, 2000) | 500000 | 100 | 0.1128 | 1.0218 | 10x slower |

point | (-1, -1) | (1, 1) | 0 | 0 | 0.1035 | 0.5115 | 5x slower |

poly_1 | (0, 0) | (0, 0) | 0 | 0 | 0.0345 | 0.0011 | 29.3x faster |

poly_1 | (2, 2) | (4, 4) | 0 | 0 | 0.0705 | 0.0013 | 53.4x faster |

poly_1 | (2, 2) | (128, 128) | 383 | 0.383 | 0.0713 | 0.0060 | 11.7x faster |

poly_1 | (2, 2) | (256, 256) | 1617 | 1.617 | 0.0742 | 0.0102 | 7.2x faster |

poly_1 | (2, 2) | (512, 512) | 6503 | 6.503 | 0.0733 | 0.0356 | 2x faster |

poly_1 | (0, 0) | (1000, 1000) | 25118 | 25.118 | 0.0705 | 0.0490 | 1.4x faster |

poly_1 | (-1000, 0) | (1000, 1000) | 49862 | 49.862 | 0.0657 | 0.1055 | 1.7x slower |

poly_1 | (-1000, 0) | (1000, 1000) | 76449 | 76.449 | 0.0614 | 0.1698 | 2.7x slower |

poly_1 | (-2000, -2000) | (2000, 2000) | 100000 | 100 | 0.0597 | 0.2321 | 3.3x slower |

poly_1 | (-1, -1) | (1, 1) | 1 | 0.001 | 0.0726 | 0.1264 | 1.7x slower |

The `point`

table consists of 500,000 `POINT`

s with random `x`

and `y`

values ranging anywhere from -1000 to 1000.
The `poly_1`

table consists of 100,000 randomly generated `POLYGON`

s (triangles) whose vertices are all anywhere from 0-2 units apart.

## Improvements

When the query range is small enough, `SPATIAL`

indexes significantly improve look ups.
For `point`

table we see speedups from 80-100x, and for the `poly_1`

we see speedups from 10-50x.

## Anti-Improvements

However, when the selection ranges get larger, our performance drastically worsens.
In fact, in these cases it's actually worse to use `SPATIAL`

Indexes...about 2-10x worse.
Additionally, there's even a degenerate case with the unit square `POLYGON((-1 -1,-1 1,1 1,1 -1,-1 -1))`

; the actual selection range is relatively tiny, but has awful performance.

Here's a graph plotting speedup multiplier against select percentage.

As you can see, there's a quick drop off in performance when we start to query large portions of our data.

# Future Work

Evidently, there's still a great deal of work to be done here.

## Z-Range Pruning

While the z-addresses preserve locality and do eliminate many spatial comparisons, a pass over the `zMin`

to `zMax`

still results in a ton of unnecessary comparisons.
Certain ranges around the "seams" or powers of 2 in the space of z-addresses, traverse over many z-addresses that are not within the original bounding box.
It is possible to break down these larger Z-Ranges into several smaller Z-Ranges using the LITMAX/BIGMIN algorithm. This is something we are currently working on.

## Level Pruning

For every range of Z-Values, we produce 65 more ranges, as there are 65 possible levels geometries could be in.
However, for tables like `point`

, every `POINT`

will be in level 0, yielding many unnecessary range searches.

## Covering Indexes

There is a performance distinction between covering and non-covering indexes.
Covering indexes can satisfy lookups with the key itself.
Meanwhile, non-covering indexes must perform another lookup to retrieve the value to satisfy the lookup.
This extra level of indirection leads to extra disk accesses, and subsequently worse performance.
Right now, `SPATIAL`

indexes are strictly non-covering as their keys are bounding boxes, which do not have enough information to restore the full `GEOMETRY`

.

## Geometry Decomposition

The S2 library performs something called Geometry Decomposition, which decomposes a `GEOMETRY`

into the set of bounding boxes that efficiently contain it. Currently, we only create one large bounding box, which, depending on the `GEOMETRY`

, may be too large.

Given the current state of their performance and the fact that implementing some of these optimizations may break compatibility between Dolt versions, `SPATIAL`

indexes are behind the feature flag mentioned above. Good thing your database is versioned 😉.

# Conclusion

If you feel like experimenting, give `SPATIAL`

indexes a try.
There are still some kinks to work out, but we are constantly improving and making Dolt better.
Remember we're open source, and will gladly take public contributions.
Feel free to drop by on our Discord with any questions or feature requests!