Benchmarking SQL Reads on EBS and S3

July 30, 2021

10 min read

Recently, we conducted some quick and dirty read latency smell tests comparing reads against Amazon's EBS volumes to reads against Amazon's S3. We ran these to test our hypothesis that a proposed infrastructure project would be a worthwhile undertaking since, in theory, it should make reads against DoltHub databases significantly faster.

In this blog I'll describe what we did, how we did it, and what it means for our roadmap moving forward. But first, a bit on why we wanted to conduct these adhoc benchmarks in the first place.

Why Benchmark?

On DoltHub, when users view a database, the data is fetched from S3, which stores the table-files that comprise database data. The fetched data is then cached in memory in DoltHub's API layer for fast retrieval on subsequent requests.

Currently this caching implementation is DoltHub's only caching layer and it runs in the same process as our DoltHub API server. This has worked well so far, but as we've added DoltHub features, like Data Bounties and added more support for various Dolt use cases, like Dolt as an OLTP database, we recognize the need for faster reads on DoltHub.

Unfortunately, our current caching strategy won't deliver the read performance we're after. About 256MB of memory is allocated in the API server for data caching, but we want the option to provision caching resources independently from the resources allocated for the API itself. To fix this, we've proposed a caching implementation for DoltHub that runs as a separate service.

Initially, this caching service will run as a sidecar next to the API and the API can make HTTP requests to the service for database data. If the service doesn't have the requested data cached, the API can go back to fetching data from S3 the same way it does today.

Maintaining a decoupled caching service will make future scaling of DoltHub's caches much easier, since such a service can be scaled separately from the API and can even be deployed to a separate host with more CPU, memory, and disk if necessary.

In addition to setting up caching as a separate service, we also developed a plan to make reads from the caching service much faster. Our hypothesis is that implementing a disk-backed caching service that reads data stored locally on disk will decrease the latency for reads against DoltHub databases. To store a database's data on disk we would download the database's appendix table-file when data is requested but not found. Then if that same data is re-requested, we won't need to download the file again, we can simply read from it.

Appendix table-files, a recent addition to DoltHub, are structured data files specifically optimized to improve data retrieval by maintaining an ordered copy of all data in a database. We currently generate these table-files on a schedule for all DoltHub databases, storing them in S3 alongside a database's standard table-files. We suspect, though, that if they were stored instead on disk, like on an EBS volume, the data can be read from them much faster.

Here's a simple illustration highlighting our proposed change to DoltHub's caching strategy:

Caching service proposal

The image on the left depicts the flow of data from S3 to DolHub's API, our current caching model, and the image on the right depicts the new model which incorporates a disk-backed caching service. The arrows' colors represent the speed of the data fetching/transfer where green is the fastest with the lowest latency, yellow is medium latency, and red is the highest latency, meaning it's the slowest.

In the left hand image, you can see that data retrieval is fastest from the in-memory cache, but is otherwise mediocre since it's being fetched at yellow speed from S3. In the right hand image, you can see that the API still utilizes the caching strategy represented in the left image, but has an additional service, the "Caching Service," where the API can make requests for data before it requests data from S3.

There are appendix table-files stored in this new caching service, and the red arrow represents the time it takes to download appendix table-files to disk if they are not present, or cached. In this case the API layer will default to reading the chunks from S3 at yellow speed. Once the appendix table-file containing the requested data is downloaded, though, we can see that reads from the caching service occur at green speed, represented by the green arrow pointing from the caching service to the API.

Conceptually this plan makes sense. For the most part, reading data from disk is faster than reading data over a network using HTTP. However, before we went and did the work of caching full table files and serving chunks locally before remoting out to S3, we wanted to benchmark relative performance increases, especially given the myriad of options regarding EBS volume types. Instead of relying on generic performance measures regarding throughput and latency, we knew Dolt's access patterns could be pretty specific and we wanted to measure how much better we would be able to do with network attached storage. That's why we set out to quickly test our hypothesis in a series of ad hoc benchmarks before jumping head first into our proposed caching changes!

Read Latencies of S3 vs EBS

We currently have some monitoring around DoltHub API's request lifecycle. For this reason, we know that the round-trip time to fetch the first bytes of data from S3 when executing a read query on DoltHub is about 30 milliseconds. For our quick benchmarking, this was good enough for us to get started, so we established this rough measurement as the time for EBS to beat.

To benchmark reads on an EBS volume, AWS provides helpful documentation explaining the steps you can follow to thoroughly benchmark volumes. They also explain the different volume types and the trade-offs to consider between each. For our purposes, we really just wanted to see that their standard gp2 storage was faster than 30 milliseconds for random access reads, so that's what we tested.

To test, we provisioned an EC2 instance, attached an EBS gp2 volume, and ssh'd into the machine. Next we installed fio, a tool for benchmarking I/O in Linux. We followed the instructions for initializing the volume here, then ran the following for command for benchmarking gp2 as stated here:

$ sudo fio --filename=/dev/nvme0n1 --name fio_test_file --direct=1 --rw=randread --bs=16k --size=1G --numjobs=16 --time_based --runtime=180 --group_reporting --norandommap
fio_test_file: (g=0): rw=randread, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=psync, iodepth=1
...
fio-3.1
Starting 16 processes
Jobs: 16 (f=16): [r(16)][100.0%][r=48.0MiB/s,w=0KiB/s][r=3074,w=0 IOPS][eta 00m:00s]
fio_test_file: (groupid=0, jobs=16): err= 0: pid=3924: Mon Jul 12 19:32:06 2021
   read: IOPS=3088, BW=48.3MiB/s (50.6MB/s)(8688MiB/180006msec)
    clat (usec): min=264, max=81045, avg=5178.24, stdev=501.60
     lat (usec): min=264, max=81045, avg=5178.42, stdev=501.60
    clat percentiles (usec):
     |  1.00th=[ 4490],  5.00th=[ 4883], 10.00th=[ 4948], 20.00th=[ 5080],
     | 30.00th=[ 5145], 40.00th=[ 5145], 50.00th=[ 5211], 60.00th=[ 5211],
     | 70.00th=[ 5276], 80.00th=[ 5342], 90.00th=[ 5407], 95.00th=[ 5538],
     | 99.00th=[ 5866], 99.50th=[ 6063], 99.90th=[ 6587], 99.95th=[ 6849],
     | 99.99th=[17171]
   bw (  KiB/s): min= 2459, max= 9504, per=6.25%, avg=3089.11, stdev=323.96, samples=5759
   iops        : min=  153, max=  594, avg=193.05, stdev=20.25, samples=5759
  lat (usec)   : 500=0.29%, 750=0.32%, 1000=0.04%
  lat (msec)   : 2=0.01%, 4=0.07%, 10=99.25%, 20=0.01%, 50=0.01%
  lat (msec)   : 100=0.01%
  cpu          : usr=0.22%, sys=0.28%, ctx=556074, majf=0, minf=241
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=556023,0,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=48.3MiB/s (50.6MB/s), 48.3MiB/s-48.3MiB/s (50.6MB/s-50.6MB/s), io=8688MiB (9110MB), run=180006-180006msec

Disk stats (read/write):
  nvme0n1: ios=555678/36, merge=0/4, ticks=2869471/231, in_queue=2208108, util=99.99%

There's a lot going on in the above fio output, but the most important line is the line specifying the clat:

    clat (usec): min=264, max=81045, avg=5178.24, stdev=501.60

clat is an abbreviation of "completion latency", which is the time between submitting a request and it being completed. We can also see that the avg (average) completion latency for this benchmark is 5178.24 usecs, or microseconds, which is actually only 5.178 milliseconds!

This means that for random reads against an EBS volume we get an average latency of 5 milliseconds, which is 6 times faster than our 30 millisecond round-trip times from S3. Based on this data, the credibility of our disk-backed caching service hypothesis greatly increased, but we still wanted to be sure we'd see these performance gains when running SQL queries against EBS in cluster.

In-Cluster Benchmarking

For our in-cluster smell test, we deployed a Pod and installed two Dolt binaries on it. One binary was a standard Dolt client built from the latest dolt version, and the second was a hacked Dolt client.

The hacked Dolt client is exactly the same as the standard Dolt binary we installed, except that it allows us to execute SQL queries against DoltHub databases instead of local Dolt databases, so it acts almost like a headless DoltHub SQL console. This essentially means that we could use this hacked client to benchmark reads against S3 using the Dolt CLI's sql command.

After installing the Dolt binaries, we cloned dolthub/tatoeba-sentence-translations into the Pod, added an appendix table-file to it, and constructed a SQL query that does random read access. Benchmarking from there only required timing our two SELECT queries, executing one query with our standard binary called dolt, and the other using our hacked binary called dolt-remoter:

$ time dolt sql -q "SELECT Text FROM translations JOIN sentences ON translations.TranslationId = sentences.SentenceId WHERE translations.SentenceID % 5000 = 0;"
...
real	0m18.093s
user	1m4.407s
sys	0m0.852s

$ time dolt-remoter sql -q "SELECT Text FROM translations JOIN sentences ON translations.TranslationId = sentences.SentenceId WHERE translations.SentenceID % 5000 = 0;"
...
real	3m8.903s
user	1m10.553s
sys	0m10.643s

And violà! As we expected, we do see significant gains when reading from EBS with dolt versus reading from S3 with dolt-remoter, about a 10.5x improvement! At this point, our simple smell tests are smelling like freshly baked bread (which smells good!), but we aren't quite done. Our last test aims to observe these performance wins when serving the cached data from a prototype of the proposed service.

Prototype benchmarking

Although we've basically confirmed that EBS reads are significantly faster than S3, we still want to observe the performance gains after prototyping this service.

Using the time of 3 minutes we observed earlier with dolt-remoter as the time for our facsimile service to beat, we stitched together a janky grpc server with a single RPC called GetChunk that returns data (we call them chunks) from tatoeba-sentence-translation's appendix table-file.

The simple server looks something like the following:

func main() {
   // open appendix table-table supplied as first argument
	f, err := os.Open(os.Args[1])
	if err != nil {
		panic(err)
	}

   // get the index
	tf := filepath.Base(os.Args[1])
	index, err := nbs.ReadTableIndex(f)
	if err != nil {
		panic(err)
	}

   // some golang stuff to be able to
   // lookup the unexported chunk address
	var a [20]byte
	vindex := reflect.ValueOf(index)
	lookupM := vindex.MethodByName("Lookup")

	addrType := lookupM.Type().In(0).Elem()
	v := reflect.NewAt(lookupM.Type().In(0).Elem(), unsafe.Pointer(&a))
	res := lookupM.Call([]reflect.Value{v})
	fmt.Println(res[0], res[1])

   // initialize prototyped service
	s := &service{
		LookupF: func(a [20]byte) (Entry, bool) {
			res := lookupM.Call([]reflect.Value{reflect.NewAt(addrType, unsafe.Pointer(&a))})
			e, f := res[0].Interface().(Entry), res[1].Bool()
			if !f {
				return iEntry{}, f
			}
			return iEntry{e.Offset(), e.Length()}, true
		},
		Reader: f,
	}

   // setup and run server
	fmt.Printf("Starting server on 9999 with appendix table-file: %s\n", tf)
	lis, err := net.Listen("tcp", fmt.Sprintf("localhost:%d", 9999))
	if err != nil {
		panic(err)
	}

	var opts []grpc.ServerOption
	grpcServer := grpc.NewServer(opts...)
	testchunkcache.RegisterTestChunkCacheServiceServer(grpcServer, s)
	err = grpcServer.Serve(lis)
	if err != nil {
		panic(err)
	}
}

// ... type definitions omitted ...

func (s *service) GetChunk(ctx context.Context, req *testchunkcache.GetChunkRequest) (*testchunkcache.GetChunkResponse, error) {
	var lookup [20]byte
	copy(lookup[:], req.Hash[:20])

   // lookup the chunk
	entry, found := s.LookupF(lookup)
	if !found {
		return &testchunkcache.GetChunkResponse{Found: false}, nil
	}

   // chunk found get bytes and return
	contents := make([]byte, entry.Length())
	_, err := s.Reader.ReadAt(contents, int64(entry.Offset()))
	if err != nil {
		return nil, err
	}

	return &testchunkcache.GetChunkResponse{
		Found: true,
		Contents: contents,
	}, nil
}

There's some extraneous golang complexity going on above, but that can be ignored. The most important thing to note is that this test service will return a chunk (piece of data) from the appendix table-file that's provided as the argument to main when this server runs.

So, to test our prototyped service, we started running it in-cluster then we changed dolt-remoter, once more, to call the GetChunk RPC and get data from the prototype service before reading any missing chunks from S3. Then, we ran the same SQL read query we ran earlier:

$ time dolt-remoter sql -q "SELECT Text FROM translations JOIN sentences ON translations.TranslationId = sentences.SentenceId WHERE translations.SentenceID % 5000 = 0;"
...
real	0m27.362s
user	1m8.232s
sys	0m2.764s

And, once again, we see the speed improvement we've seen throughout all of our benchmarking efforts so far. Our time to beat was 3 minutes, and we destroyed it, coming in at 27 seconds! This was all the proof we needed in order to know that our new caching service will get us much better if we cache the appendix table-files on an EBS volume.

Conclusion

We've now seen that our plan to add a disk-backed appendix table-file cache to DoltHub is a credible strategy for decreasing read latencies. This will be our next infrastructure undertaking and we're excited to see how it improves the DoltHub user experience once it's live. Stay tuned for updates, the work is underway!

Curious about Dolt, DoltHub and the versioned database of the future? There's no better place to get started than DoltHub.com where you can download Dolt, host your own public and private Dolt databases, or just clone some amazing public databases you won't find anywhere else.

Questions, comments, or looking to start backing your application with a Dolt database? Get in touch with our team here or join us on Discord!

Blog

Why Benchmark?

Read Latencies of S3 vs EBS

In-Cluster Benchmarking

Prototype benchmarking

Conclusion

Get started with Dolt