Snapshot cloning

The “sub-60ms cold start” claim lives or dies on this appendix. The pool-hit path is a snapshot-restore, not a boot. Three mechanisms to choose from — apples, oranges, and pears. We took approaches 2 + 3 in combination and landed at 17 ms — ahead of Cube’s advertised 60 ms, with real durability. Walk-through below.

Approach 1: Cloud Hypervisor memory snapshot + CoW restore

What it does. PUT /api/v1/vm.snapshot writes a snapshot bundle to disk: a memory dump, vCPU register state, device state (virtio queues, serial buffers, clock), and a config manifest. PUT /api/v1/vm.restore loads it back.

The CoW trick. If the memory dump lives on a filesystem that supports mmap + CoW semantics (tmpfs, overlayfs-on-tmpfs, ZFS with dnode CoW), restoring N microVMs from the same snapshot shares physical pages until someone writes. This is what makes the “thousands of sandboxes overhead <5MB each” property real: most pages are the kernel text/rodata and the post-boot page cache, and they stay shared.

What gets preserved: memory (modulo CoW), vCPU registers, virtio ring positions, clock. What gets lost: non-virtio device state if the device doesn’t implement snapshot (in CH 28, most do), any network state held outside the VM boundary (pf state, neighbor tables).

Cost. Restore time is dominated by mmap of the memory file + one IPI per vCPU. CH measurements put this at a few ms on modern hardware.

Approach 2: bhyve `BHYVE_SNAPSHOT`

What it is. A FreeBSD-tree feature, gated behind the options BHYVE_SNAPSHOT kernel option (not in GENERIC). When enabled, bhyvectl --suspend=<file> writes three files — file.ckp (guest memory), file.ckp.kern (CPU + device state), and file.ckp.meta (metadata for restore) — and powers the VM off. bhyve -r <file> resumes a VM from that checkpoint.

Lineage (FreeBSD reviews D19495, D26387, D35454): the suspend/resume skeleton was merged around FreeBSD 13; live migration (D30954, 2021) is still in review in 2026.

Status on 15.0-RELEASE-p4. Off in GENERIC. Enabling means building a custom kernel from /usr/src with the option added. Known limitations:

No disk-device snapshots. The feature “only supports virtual machine suspend and resume due to a lack of support for disk device snapshots” (the canonical caveat repeated across bhyve docs and quarterly reports). You’re responsible for snapshotting the VM’s disk image at the same moment you --suspend the memory, or the restored VM will see a disk that’s moved on.
File format is unstable. Backward compatibility between FreeBSD versions is not guaranteed.
Intel-first. AMD64 (our Ryzen 9 5900HX) works; AMD-V edge cases have shown up historically.

How you’d make it durable for an agent pool on FreeBSD. Concrete recipe, requires /usr/src on the host:

Author a minimal kernel config that includes GENERIC + options BHYVE_SNAPSHOT. Build:

cp /usr/src/sys/amd64/conf/GENERIC /usr/src/sys/amd64/conf/SNAPSHOT
echo 'options BHYVE_SNAPSHOT' >> /usr/src/sys/amd64/conf/SNAPSHOT
cd /usr/src && sudo make -j$(sysctl -n hw.ncpu) buildkernel KERNCONF=SNAPSHOT
sudo make installkernel KERNCONF=SNAPSHOT && sudo reboot

For each pool-worthy VM:

sudo bhyvectl --suspend=/pool/<id>.ckp --vm=<id>
# immediately snapshot the disk dataset atomically with the ckp file
sudo zfs snapshot zroot/jails/<id>@suspended

At next host boot, for each pool entry:

sudo zfs clone zroot/jails/<id>@suspended zroot/jails/<id>-live
sudo bhyve -r /pool/<id>.ckp ... <id>-live    # resumes from memory checkpoint
sudo kill -STOP $(pgrep -xn bhyve)              # re-pause for pool

Costs: each ckp file is guest-memory-sized (e.g., 512 MB for a 512 MB VM) — no CoW-deduplication across pool entries without manual effort (see next section). Restore latency per entry is dominated by the mmap of the ckp memory file plus vCPU rehydration — in the millisecond range when the file is on NVMe.

Hybrid pool design. What a real FreeBSD pool manager might do:

Keep the hot pool in memory (SIGSTOP’d bhyve processes) for <10 ms resumes. This is what bhyve-prewarm-pool measures.
Periodically evict cold pool entries to disk via bhyvectl --suspend
- ZFS snapshot. Evicted entries resume in ~50-100 ms instead.
Rebuild the in-memory hot pool on host start by thawing cold entries.

That matches Cube’s architecture qualitatively — they keep a hot snapshot-clone pool in front of durable snapshot storage — at the cost of the kernel rebuild and managing two-tier pool state.

What we did, 2026-04-22. Built a SNAPSHOT kernel on honor (FreeBSD 15.0-RELEASE + options BHYVE_SNAPSHOT), wired up the suspend/resume pipeline (bhyvectl —suspend + bhyve -r), and stood up a two-tier pool — the durable cold tier on disk, the SIGSTOP’d hot tier in memory, with a rebuild path on host start. bhyve-durable-pool resumes in ~271 ms (per-entry, from disk); bhyve-durable-prewarm-pool resumes in 17 ms from the hot tier and a 50-entry pool rebuilds from cold in ~3 s on boot. See /appendix/bench-rig for the kernel-build recipe and the two-tier rig scripts.

Approach 3: SIGSTOP/SIGCONT process suspend (our proxy)

What it is. Boot a bhyve VM to the “ready” state, then send SIGSTOP to the bhyve process. The kernel freezes the process (and the vCPUs go idle), but memory stays resident. To “resume,” send SIGCONT — vCPUs run again from where they stopped.

What gets preserved. Everything the bhyve process had: guest memory, vCPU state, virtio queues, network connections (since the kernel socket structures are still alive). Host-visible network state (pf states, arp/neighbor) is preserved because we never tore it down.

What doesn’t. Anything time-dependent inside the guest that notices wall-clock jumps (NTP-synced applications may see a big time discontinuity when resumed; the guest kernel’s idle accounting gets confused). Not a big deal for short-lived agent sandboxes.

What it isn’t. A portable snapshot. You can’t suspend on host A and resume on host B. You can’t survive a host reboot. It’s process-level state, not VM-level state.

Why we use it. It gives us a fair apples-to-apples analog for the “warm pool + resume” path without requiring experimental kernel features. The performance profile (resume latency ≈ wakeup latency ≈ microseconds to low ms) is comparable to a CH snapshot restore. The security posture is actually slightly stronger (the VM has been running all along; there’s no post-restore window where device state might be stale).

Why we flag it. It doesn’t match what Cube does. Cube’s pool survives reboots (snapshots are durable on disk). Our SIGSTOP pool does not. In an outage-recovery scenario, Cube re-warms a pool from saved snapshots; bhyve-prewarm-pool has to re-boot VMs.

ZFS clones for the rootfs (bonus)

What it does. If each sandbox’s rootfs is a ZFS clone of a template snapshot, per-instance rootfs provisioning is a metadata operation (a few ms) regardless of rootfs size. We use this in jail-zfs-clone to isolate the effect of rootfs-creation time from the jail-create latency itself.

What it doesn’t do. Memory sharing. Two jails running the same binary still have their own page cache entries in most cases; OpenZFS does not participate in the same kind of KSM-style memory de-dup that KVM benefits from for identical guests. For jails the win is disk-layer copy-elision; the memory-sharing win is separate and smaller.

Summary table

Property	CH snapshot	BHYVE_SNAPSHOT	SIGSTOP proxy	ZFS clone
Preserves memory	✅ (CoW-shareable)	✅ (in principle)	✅ (in-process, not shareable)	N/A
Preserves vCPU state	✅	✅	✅	N/A
Preserves device state	✅	Partial	✅ (always live)	N/A
Portable across hosts	✅	Claims to	❌	✅ (as dataset)
Survives host reboot	✅	✅	❌	✅
In-tree FreeBSD	❌	Experimental	✅	✅
Restore latency	~ms	~ms (when it works)	μs	~ms
Scales to thousands	✅ (CoW shares pages)	Probably	Memory-bound	✅ (metadata only)

What we did (2026-04-22 update)

The plan above was written before we built it. Current state:

(a) is done. BHYVE_SNAPSHOT is compiled and running on honor. The bhyve-durable-pool path (bhyvectl --suspend to disk, bhyve -r to restore) measures 271 ms per-entry restore on a 256 MiB guest.
(b) is done, in a different shape. Instead of a userspace pool manager that saves VM state, we built bhyve-durable-prewarm-pool: cold entries on disk, a hot tier of SIGSTOP’d bhyve -r’d processes in front. 17 ms resume at cc=1; 50-entry pool rebuilds from the cold tier in ~3 s after host boot.
(c) is the old fallback and it’s what bhyve-prewarm-pool still measures — we keep it in the rig suite as the in-memory-only lower bound, not the recommendation.