Skip to content

Optimizing Vercel Sandbox snapshots

Authors

4 min read

When we recently shipped filesystem snapshots in Vercel Sandbox to let teams capture and restore a sandbox's entire filesystem state, our initial engineering focus was pure reliability. We had to ensure the system would never fail to snapshot or lose data.

Once that foundation was stable, we faced a new problem where p75 snapshot restores were taking over 40 seconds. By implementing parallelization and local caching, we brought those restore times down to under one second.

Link to headingWhat a snapshot looks like on disk

Vercel Sandbox runs on the same infrastructure as our internal builds product, Hive. Each sandbox is an isolated container inside a Firecracker microVM.

A snapshot is a compressed copy of the sandbox's disk. We're working with two different files:

  • The raw disk image (.img), which can be several GBs

  • A compressed version in our custom VHS format (Vercel Hive Snapshot), which is what gets uploaded to and downloaded from S3

When you call sandbox.snapshot(), we compress the .img into a .vhs and upload it to S3. When you call Sandbox.create() with a snapshot, we download the .vhs and decompress it back. The compression cuts the file size, which matters when you're moving hundreds of MBs to GBs over the network.

Link to headingParallelize you shall

With reliability in place, we turned to performance. The restore path was painfully sequential. We'd download the entire .vhs file from S3 in a single request, wait for it to finish, then decompress it in a single thread.

The original restore pipeline: a single S3 download followed by single-threaded decompressionThe original restore pipeline: a single S3 download followed by single-threaded decompressionThe original restore pipeline: a single S3 download followed by single-threaded decompressionThe original restore pipeline: a single S3 download followed by single-threaded decompression
The original restore pipeline: a single S3 download followed by single-threaded decompression

Snapshots range from 200MB to a few GBs, so that single S3 download alone could take several seconds to tens of seconds. We used the Range HTTP header to download chunks in parallel instead. The AWS Go SDK has a transfermanager API built for exactly this. After benchmarking various concurrency and chunk sizes, we ended up with 2-5x faster downloads.

Splitting the download into parallel S3 range requestsSplitting the download into parallel S3 range requestsSplitting the download into parallel S3 range requestsSplitting the download into parallel S3 range requests
Splitting the download into parallel S3 range requests

Next up was decompression. Our .vhs format is composed of a header and a frame for each allocated region of the disk image. Instead of decoding and decompressing frames one by one, we switched to one decoder and N decompression goroutines. This made the .vhs to .img restore 2-4x faster, depending on snapshot size.

Fanning out decompression across multiple goroutinesFanning out decompression across multiple goroutinesFanning out decompression across multiple goroutinesFanning out decompression across multiple goroutines
Fanning out decompression across multiple goroutines

With both downloading and decompressing parallelized, we still had one remaining optimization. We piped S3 range request streams directly into decompression, without writing an intermediary file to disk or waiting for the full download to complete. That cut end-to-end restore time by another 2x.

Piping S3 download streams directly into decompression, no intermediate filePiping S3 download streams directly into decompression, no intermediate filePiping S3 download streams directly into decompression, no intermediate filePiping S3 download streams directly into decompression, no intermediate file
Piping S3 download streams directly into decompression, no intermediate file

Link to headingWe… didn't cache?

As you might have noticed, we so far only talked about improving the slow path, when we need to retrieve a snapshot from S3 on a cache miss. Well, we actually didn't have a fast path, so it was all cache misses. Yeah, we really didn't focus on performance at first.

Our sandboxes run on metal instances with NVMe disks, which means several terabytes of fast local storage that was mostly unused.

We added a local disk cache using LRU (least recently used) eviction, sized by total disk space rather than number of entries. We cache the decompressed .img directly rather than the compressed .vhs, so a cache hit skips both the download and the decompression. Once the cache fills up, the least recently used snapshots get evicted to make room.

Most customers have a "base" snapshot that they reuse across many sandboxes, so we're seeing a 95% cache hit rate. On a cache hit, boot time is bounded only by starting the microVM and container.

Local NVMe cache hit rate, consistently above 90%Local NVMe cache hit rate, consistently above 90%Local NVMe cache hit rate, consistently above 90%Local NVMe cache hit rate, consistently above 90%
Local NVMe cache hit rate, consistently above 90%

Link to headingFrom 40 seconds to sub-second

p75 dropped from 40s to sub-second, and p95 went from 50s to 5s. With our cache hit rate, most sandbox boots skip the download and decompression pipeline entirely.

Snapshot restore p95 latency dropping from 50s to under 10sSnapshot restore p95 latency dropping from 50s to under 10sSnapshot restore p95 latency dropping from 50s to under 10sSnapshot restore p95 latency dropping from 50s to under 10s
Snapshot restore p95 latency dropping from 50s to under 10s

We're exploring more ideas. Cache affinity would route sandboxes to metal instances that already have the requested snapshot cached, potentially eliminating the cold path for popular snapshots. But this risks thundering herds and hotspotting certain machines, so we're being deliberate about it.

Long term, we want the cold path fast enough that caching is a bonus, not a requirement.

These optimizations already power Automatic Persistence, now in beta. When you stop a named sandbox, its filesystem is automatically snapshotted and restored on resume. Sub-second restore means that cycle feels instant.

Filesystem snapshots are available today for all Vercel Sandboxes. Check the Sandbox documentation to get started.