KSM-equivalent on FreeBSD

Linux KSM (Kernel Samepage Merging) is what lets CubeSandbox claim “<5 MB overhead per instance” while actually keeping thousands of identical microVMs in RAM. FreeBSD ships no equivalent, and this is the last architectural gap between a Cube-shape FreeBSD stack and Cube itself. This page is what we found when we actually looked.

What KSM does, briefly

Linux KSM (mm/ksm.c, docs at kernel.org) is an opportunistic content-based page deduplicator. A kernel thread ksmd scans pages userspace opts in to via madvise(addr, len, MADV_MERGEABLE). Stable candidates land in a red-black tree keyed by page-content hash. On a match, both virtual mappings are rewritten to a single physical page marked read-only; writes trigger the normal CoW path and un-share. QEMU/KVM calls MADV_MERGEABLE on guest RAM regions so many identical VMs collapse into one physical set of kernel pages.

What FreeBSD ships (nothing, directly)

As of FreeBSD 15.0-RELEASE, there is no committed KSM-equivalent in head. We checked for:

There is identity-based sharing via vm_object shadow chains and MAP_SHARED mappings — two processes mmaping the same file or forking from a common parent share backing pages until someone writes. There is no content-based sharing.

What we measured on honor

Running 50 × 256 MB bhyve VMs resumed from per-entry bhyvectl --suspend checkpoints:

metricvalue
baseline free memory18 476 MB
after 50 VMs resumed4 114 MB free / 13 764 MB active / 3 711 MB inactive
Δ active + inactive + free consumption~13 100 MB
nominal un-shared (50 × 256 MB)12 800 MB
effective dedup~0%

Per-bhyve RSS reports 275 MB each, which sums to 13.7 GB — roughly the physical consumption observed. FreeBSD’s VM is doing its normal thing: each bhyve process has its own anonymous pages for the guest’s resident working set. No content-dedup happens anywhere in the stack.

Despite each VM being resumed from a ckp produced by bhyve code that shares most file bytes across entries (same boot, same stock FreeBSD guest), nothing in the chain causes those identical bytes to become shared RAM.

The three paths forward

Path A — vmm memseg backed by ckp file (small kernel patch)

The research agent’s original framing was “bhyve userspace-only, hundreds of LoC”. Going to the source: that was wrong. The real scope:

So to get identity-based CoW across clones resuming from the same ckp, vmm has to accept a vnode-backed memseg instead of an anonymous one. That’s a small kernel patch:

Rough size: ~200 LoC kernel + ~50 LoC userspace. Risk: vnode refcount + vm_object lifetime across VM destroy; getting that wrong panics the kernel. Testable with a pre-kernel ZFS boot environment (we already use one for the SNAPSHOT kernel swap).

Expected result if the patch works: 50-entry hot pool drops from ~13 GB to ~0.5–2 GB total, because the identical ckp pages stay in a single vnode-backed vm_object and only guest-written pages CoW into anonymous copies.

This is still the right first thing to try, because it’s one order of magnitude smaller than a full KSM port — but it’s a kernel patch with a real foot-gun surface, not the userspace afternoon the research agent’s initial report implied.

Path B — Full KSM port (bounded but expensive)

A proper FreeBSD KSM sketches out as:

Estimated size: 2 500-4 000 lines of new code in a new sys/vm/vm_ksm.c + vm_ksm.h, plus 200-400 lines of edits across vm_pageout.c/vm_fault.c/vm_object.c/vm_mmap.c/sys/sys/mman.h/vm_page.h and amd64/arm64 pmap.c (promotion interlock). Plus a vm.ksm.* sysctl tree, a ksmctl(8), a bhyve flag, and tests under tests/sys/vm/.

Wall time: 6-12 months for one senior FreeBSD VM developer (Mark Johnston, Konstantin Belousov, Alan Cox caliber). 18 months if the developer has to climb the pmap learning curve.

As a funded sponsorship — think $150-300K with the FreeBSD Foundation or a direct retainer — it is realistic. As an unfunded side project it is not.

Path C — ZFS ARC-only dedup (zero kernel work, partial benefit)

If checkpoint files live on ZFS with compression=lz4 or zstd, the ARC caches compressed blocks once and all readers share them. This gives:

So ZFS compression shaves the cold-tier storage + resume-read cost but does not close the RAM gap. zfs set dedup=on is still not worth the RAM cost it imposes and doesn’t help the guest-memory problem either.

Useful to combine with Path A or B; not a substitute.

Result — Path A works (2026-04-22)

We wrote Path A and measured it. The patch is patches/vmm-memseg-vnode.diff — 286-line unified diff across five files (sys/dev/vmm/vmm_mem.c, sys/dev/vmm/vmm_mem.h, sys/dev/vmm/vmm_dev.c, sys/amd64/include/vmm_dev.h, lib/libvmmapi/vmmapi.c + vmmapi.h). The new function vm_alloc_memseg_vnode() wraps the vnode’s vm_object in a shadow so guest writes CoW into anonymous pages and guest reads fall through to shared file-backed pages.

The first draft panicked in vm_map_entry_delete during destroy — the shadow’s swap-charge accounting was reading a bogus object->cred. Fix: pass cred=NULL to vm_object_shadow. Shadow pages are transient and accounted at the vm_map entry layer on CoW fault, same pattern as a MAP_PRIVATE file mmap. Second install was clean; benchmarks/rigs/vmm-vnode-probe.c spawns N VMs that attach the same 256 MiB template and touch every page.

Nnaive costactual deltadedup
8 × 256 MiB2 048 MiB~306 MiB~85%
50 × 256 MiB12 800 MiB~5 MiB above cached template~2500×
1 000 × 256 MiB250 000 MiB~9 100 MiB~27×

Direct-ioctl probe (no bhyve process) stops at N=50 because it’s the tightest sharing test. The full-bhyve follow-up goes to N=1000 — see the homepage density table and /appendix/vmm-vnode-patch. At every scale the marginal cost per additional VM is essentially constant: the vm_object shadow metadata (few KB) plus bhyve’s own process state. The last architectural gap against CubeSandbox for the “many microVMs from one checkpoint” case is closed.

What this does not give us: content-based dedup across arbitrary pages (two VMs booted from different images with coincidentally identical pages still don’t share). That’s still Path B territory. For the CubeSandbox workload — one golden checkpoint fanning out into many instances — Path A is sufficient.

The patch is not upstreamable as-is. It needs (at minimum) a bhyve integration in snapshot.c so bhyve -r uses the new ioctl instead of read()-into-anon-memseg, a sysctl or flag to gate the new behavior, and a review round from a vmm maintainer against the vm_object_shadow refcount invariants. But it exists, it’s measured, and the density argument is now a one-sentence sentence: on a patched SNAPSHOT kernel, one thousand identical 256 MiB microVMs fit in ~9 GiB of host RAM.

References