Snapshot cloning

The “sub-60ms cold start” claim lives or dies on this appendix. The pool-hit path is a snapshot-restore, not a boot. Three mechanisms to choose from — apples, oranges, and pears. We took approaches 2 + 3 in combination and landed at 17 ms — ahead of Cube’s advertised 60 ms, with real durability. Walk-through below.

Approach 1: Cloud Hypervisor memory snapshot + CoW restore

What it does. PUT /api/v1/vm.snapshot writes a snapshot bundle to disk: a memory dump, vCPU register state, device state (virtio queues, serial buffers, clock), and a config manifest. PUT /api/v1/vm.restore loads it back.

The CoW trick. If the memory dump lives on a filesystem that supports mmap + CoW semantics (tmpfs, overlayfs-on-tmpfs, ZFS with dnode CoW), restoring N microVMs from the same snapshot shares physical pages until someone writes. This is what makes the “thousands of sandboxes overhead <5MB each” property real: most pages are the kernel text/rodata and the post-boot page cache, and they stay shared.

What gets preserved: memory (modulo CoW), vCPU registers, virtio ring positions, clock. What gets lost: non-virtio device state if the device doesn’t implement snapshot (in CH 28, most do), any network state held outside the VM boundary (pf state, neighbor tables).

Cost. Restore time is dominated by mmap of the memory file + one IPI per vCPU. CH measurements put this at a few ms on modern hardware.

Approach 2: bhyve BHYVE_SNAPSHOT

What it is. A FreeBSD-tree feature, gated behind the options BHYVE_SNAPSHOT kernel option (not in GENERIC). When enabled, bhyvectl --suspend=&lt;file&gt; writes three files — file.ckp (guest memory), file.ckp.kern (CPU + device state), and file.ckp.meta (metadata for restore) — and powers the VM off. bhyve -r &lt;file&gt; resumes a VM from that checkpoint.

Lineage (FreeBSD reviews D19495, D26387, D35454): the suspend/resume skeleton was merged around FreeBSD 13; live migration (D30954, 2021) is still in review in 2026.

Status on 15.0-RELEASE-p4. Off in GENERIC. Enabling means building a custom kernel from /usr/src with the option added. Known limitations:

How you’d make it durable for an agent pool on FreeBSD. Concrete recipe, requires /usr/src on the host:

  1. Author a minimal kernel config that includes GENERIC + options BHYVE_SNAPSHOT. Build:

    cp /usr/src/sys/amd64/conf/GENERIC /usr/src/sys/amd64/conf/SNAPSHOT
    echo 'options BHYVE_SNAPSHOT' >> /usr/src/sys/amd64/conf/SNAPSHOT
    cd /usr/src && sudo make -j$(sysctl -n hw.ncpu) buildkernel KERNCONF=SNAPSHOT
    sudo make installkernel KERNCONF=SNAPSHOT && sudo reboot
  2. For each pool-worthy VM:

    sudo bhyvectl --suspend=/pool/<id>.ckp --vm=<id>
    # immediately snapshot the disk dataset atomically with the ckp file
    sudo zfs snapshot zroot/jails/<id>@suspended
  3. At next host boot, for each pool entry:

    sudo zfs clone zroot/jails/<id>@suspended zroot/jails/<id>-live
    sudo bhyve -r /pool/<id>.ckp ... <id>-live    # resumes from memory checkpoint
    sudo kill -STOP $(pgrep -xn bhyve)              # re-pause for pool

Costs: each ckp file is guest-memory-sized (e.g., 512 MB for a 512 MB VM) — no CoW-deduplication across pool entries without manual effort (see next section). Restore latency per entry is dominated by the mmap of the ckp memory file plus vCPU rehydration — in the millisecond range when the file is on NVMe.

Hybrid pool design. What a real FreeBSD pool manager might do:

That matches Cube’s architecture qualitatively — they keep a hot snapshot-clone pool in front of durable snapshot storage — at the cost of the kernel rebuild and managing two-tier pool state.

What we did, 2026-04-22. Built a SNAPSHOT kernel on honor (FreeBSD 15.0-RELEASE + options BHYVE_SNAPSHOT), wired up the suspend/resume pipeline (bhyvectl —suspend + bhyve -r), and stood up a two-tier pool — the durable cold tier on disk, the SIGSTOP’d hot tier in memory, with a rebuild path on host start. bhyve-durable-pool resumes in ~271 ms (per-entry, from disk); bhyve-durable-prewarm-pool resumes in 17 ms from the hot tier and a 50-entry pool rebuilds from cold in ~3 s on boot. See /appendix/bench-rig for the kernel-build recipe and the two-tier rig scripts.

Approach 3: SIGSTOP/SIGCONT process suspend (our proxy)

What it is. Boot a bhyve VM to the “ready” state, then send SIGSTOP to the bhyve process. The kernel freezes the process (and the vCPUs go idle), but memory stays resident. To “resume,” send SIGCONT — vCPUs run again from where they stopped.

What gets preserved. Everything the bhyve process had: guest memory, vCPU state, virtio queues, network connections (since the kernel socket structures are still alive). Host-visible network state (pf states, arp/neighbor) is preserved because we never tore it down.

What doesn’t. Anything time-dependent inside the guest that notices wall-clock jumps (NTP-synced applications may see a big time discontinuity when resumed; the guest kernel’s idle accounting gets confused). Not a big deal for short-lived agent sandboxes.

What it isn’t. A portable snapshot. You can’t suspend on host A and resume on host B. You can’t survive a host reboot. It’s process-level state, not VM-level state.

Why we use it. It gives us a fair apples-to-apples analog for the “warm pool + resume” path without requiring experimental kernel features. The performance profile (resume latency ≈ wakeup latency ≈ microseconds to low ms) is comparable to a CH snapshot restore. The security posture is actually slightly stronger (the VM has been running all along; there’s no post-restore window where device state might be stale).

Why we flag it. It doesn’t match what Cube does. Cube’s pool survives reboots (snapshots are durable on disk). Our SIGSTOP pool does not. In an outage-recovery scenario, Cube re-warms a pool from saved snapshots; bhyve-prewarm-pool has to re-boot VMs.

ZFS clones for the rootfs (bonus)

What it does. If each sandbox’s rootfs is a ZFS clone of a template snapshot, per-instance rootfs provisioning is a metadata operation (a few ms) regardless of rootfs size. We use this in jail-zfs-clone to isolate the effect of rootfs-creation time from the jail-create latency itself.

What it doesn’t do. Memory sharing. Two jails running the same binary still have their own page cache entries in most cases; OpenZFS does not participate in the same kind of KSM-style memory de-dup that KVM benefits from for identical guests. For jails the win is disk-layer copy-elision; the memory-sharing win is separate and smaller.

Summary table

PropertyCH snapshotBHYVE_SNAPSHOTSIGSTOP proxyZFS clone
Preserves memory✅ (CoW-shareable)✅ (in principle)✅ (in-process, not shareable)N/A
Preserves vCPU stateN/A
Preserves device statePartial✅ (always live)N/A
Portable across hostsClaims to✅ (as dataset)
Survives host reboot
In-tree FreeBSDExperimental
Restore latency~ms~ms (when it works)μs~ms
Scales to thousands✅ (CoW shares pages)ProbablyMemory-bound✅ (metadata only)

What we did (2026-04-22 update)

The plan above was written before we built it. Current state:

Separately, the vmm-vnode patch closes the CoW-page-sharing story for identical-guests-from-one-ckp: 1000 × 256 MiB microVMs fit in 9.1 GiB of host RAM. The original “losing the CoW page-sharing story” caveat above is gone.