4 min read
When we recently shipped filesystem snapshots in Vercel Sandbox to let teams capture and restore a sandbox's entire filesystem state, our initial engineering focus was pure reliability. We had to ensure the system would never fail to snapshot or lose data.
Once that foundation was stable, we faced a new problem where p75 snapshot restores were taking over 40 seconds. By implementing parallelization and local caching, we brought those restore times down to under one second.
Link to headingWhat a snapshot looks like on disk
Vercel Sandbox runs on the same infrastructure as our internal builds product, Hive. Each sandbox is an isolated container inside a Firecracker microVM.
A snapshot is a compressed copy of the sandbox's disk. We're working with two different files:
The raw disk image (
.img), which can be several GBsA compressed version in our custom
VHSformat (Vercel Hive Snapshot), which is what gets uploaded to and downloaded from S3
When you call sandbox.snapshot(), we compress the .img into a .vhs and upload it to S3. When you call Sandbox.create() with a snapshot, we download the .vhs and decompress it back. The compression cuts the file size, which matters when you're moving hundreds of MBs to GBs over the network.
Link to headingParallelize you shall
With reliability in place, we turned to performance. The restore path was painfully sequential. We'd download the entire .vhs file from S3 in a single request, wait for it to finish, then decompress it in a single thread.




Snapshots range from 200MB to a few GBs, so that single S3 download alone could take several seconds to tens of seconds. We used the Range HTTP header to download chunks in parallel instead. The AWS Go SDK has a transfermanager API built for exactly this. After benchmarking various concurrency and chunk sizes, we ended up with 2-5x faster downloads.




Next up was decompression. Our .vhs format is composed of a header and a frame for each allocated region of the disk image. Instead of decoding and decompressing frames one by one, we switched to one decoder and N decompression goroutines. This made the .vhs to .img restore 2-4x faster, depending on snapshot size.




With both downloading and decompressing parallelized, we still had one remaining optimization. We piped S3 range request streams directly into decompression, without writing an intermediary file to disk or waiting for the full download to complete. That cut end-to-end restore time by another 2x.




Link to headingWe… didn't cache?
As you might have noticed, we so far only talked about improving the slow path, when we need to retrieve a snapshot from S3 on a cache miss. Well, we actually didn't have a fast path, so it was all cache misses. Yeah, we really didn't focus on performance at first.
Our sandboxes run on metal instances with NVMe disks, which means several terabytes of fast local storage that was mostly unused.
We added a local disk cache using LRU (least recently used) eviction, sized by total disk space rather than number of entries. We cache the decompressed .img directly rather than the compressed .vhs, so a cache hit skips both the download and the decompression. Once the cache fills up, the least recently used snapshots get evicted to make room.
Most customers have a "base" snapshot that they reuse across many sandboxes, so we're seeing a 95% cache hit rate. On a cache hit, boot time is bounded only by starting the microVM and container.




Link to headingFrom 40 seconds to sub-second
p75 dropped from 40s to sub-second, and p95 went from 50s to 5s. With our cache hit rate, most sandbox boots skip the download and decompression pipeline entirely.




We're exploring more ideas. Cache affinity would route sandboxes to metal instances that already have the requested snapshot cached, potentially eliminating the cold path for popular snapshots. But this risks thundering herds and hotspotting certain machines, so we're being deliberate about it.
Long term, we want the cold path fast enough that caching is a bonus, not a requirement.
These optimizations already power Automatic Persistence, now in beta. When you stop a named sandbox, its filesystem is automatically snapshotted and restored on resume. Sub-second restore means that cycle feels instant.
Filesystem snapshots are available today for all Vercel Sandboxes. Check the Sandbox documentation to get started.