Linux KSM (Kernel Samepage Merging) is what lets CubeSandbox claim “<5 MB overhead per instance” while actually keeping thousands of identical microVMs in RAM. FreeBSD ships no equivalent, and this is the last architectural gap between a Cube-shape FreeBSD stack and Cube itself. This page is what we found when we actually looked.
What KSM does, briefly
Linux KSM (mm/ksm.c, docs at
kernel.org)
is an opportunistic content-based page deduplicator. A kernel thread
ksmd scans pages userspace opts in to via
madvise(addr, len, MADV_MERGEABLE). Stable candidates land in a
red-black tree keyed by page-content hash. On a match, both virtual
mappings are rewritten to a single physical page marked read-only;
writes trigger the normal CoW path and un-share. QEMU/KVM calls
MADV_MERGEABLE on guest RAM regions so many identical VMs collapse
into one physical set of kernel pages.
What FreeBSD ships (nothing, directly)
As of FreeBSD 15.0-RELEASE, there is no committed KSM-equivalent in head. We checked for:
- Phabricator reviews at reviews.freebsd.org
matching
ksm,samepage,page merge,page dedup— none. - Quarterly reports and 15.0 release notes — no mentions of content-based page sharing in the VM subsystem.
freebsd-virtualization@archives — periodic “has anyone looked at this?” threads. The standing answer from VM maintainers: hairy interactions withvm_objectcollapse, superpage promotion, and pmap-level invariants; narrow user base; hasn’t been prioritized.- Downstream forks (Netflix, Sato-san’s out-of-tree work, iXsystems, Klara) — none carrying a KSM backport we could find.
There is identity-based sharing via vm_object shadow chains
and MAP_SHARED mappings — two processes mmaping the same file
or forking from a common parent share backing pages until someone
writes. There is no content-based sharing.
What we measured on honor
Running 50 × 256 MB bhyve VMs resumed from per-entry
bhyvectl --suspend checkpoints:
| metric | value |
|---|---|
| baseline free memory | 18 476 MB |
| after 50 VMs resumed | 4 114 MB free / 13 764 MB active / 3 711 MB inactive |
| Δ active + inactive + free consumption | ~13 100 MB |
| nominal un-shared (50 × 256 MB) | 12 800 MB |
| effective dedup | ~0% |
Per-bhyve RSS reports 275 MB each, which sums to 13.7 GB — roughly the physical consumption observed. FreeBSD’s VM is doing its normal thing: each bhyve process has its own anonymous pages for the guest’s resident working set. No content-dedup happens anywhere in the stack.
Despite each VM being resumed from a ckp produced by bhyve code that shares most file bytes across entries (same boot, same stock FreeBSD guest), nothing in the chain causes those identical bytes to become shared RAM.
The three paths forward
Path A — vmm memseg backed by ckp file (small kernel patch)
The research agent’s original framing was “bhyve userspace-only, hundreds of LoC”. Going to the source: that was wrong. The real scope:
bhyve -rrestores byread(ckp_fd, baseaddr, len)inusr.sbin/bhyve/snapshot.c:657.baseaddris a userspace mmap window onto vmm-kernel-allocated guest memory. The mmap isMAP_SHAREDagainst/dev/vmm/<name>.- The kernel-side memseg is
vm_object_allocate(OBJT_SWAP, …)atsys/dev/vmm/vmm_mem.c:190— pure anonymous, no hook for file-backing.
So to get identity-based CoW across clones resuming from the same ckp, vmm has to accept a vnode-backed memseg instead of an anonymous one. That’s a small kernel patch:
sys/dev/vmm/vmm_mem.c: newvm_alloc_memseg_vnode(vm, ident, len, sysmem, vp)that callsvnode_pager_allocate()on the ckp file’s vnode to build thevm_object, then wires it into the samevm_mem_segslots.sys/dev/vmm/vmm_dev.c: newVM_ALLOC_MEMSEG_VNODEioctl taking{ident, len, sysmem, fd}. Pulls a vnode from the fd viagetvnode().sys/dev/vmm/vmm_ioctl.h: ioctl number.lib/libvmmapi/vmmapi.c: userspacevm_alloc_memseg_vnode().usr.sbin/bhyve/snapshot.c: inrestore_vm_mem, call the new API with the ckp fd instead of the currentread-into-anon-memseg path.
Rough size: ~200 LoC kernel + ~50 LoC userspace. Risk: vnode refcount + vm_object lifetime across VM destroy; getting that wrong panics the kernel. Testable with a pre-kernel ZFS boot environment (we already use one for the SNAPSHOT kernel swap).
Expected result if the patch works: 50-entry hot pool drops from
~13 GB to ~0.5–2 GB total, because the identical ckp pages
stay in a single vnode-backed vm_object and only
guest-written pages CoW into anonymous copies.
This is still the right first thing to try, because it’s one order of magnitude smaller than a full KSM port — but it’s a kernel patch with a real foot-gun surface, not the userspace afternoon the research agent’s initial report implied.
Path B — Full KSM port (bounded but expensive)
A proper FreeBSD KSM sketches out as:
- Hook location: new kernel thread
pagemergerd, co-located withvm_pageout. Scansvm_objects flagged mergeable. - Opt-in: add
MADV_MERGEABLE/MADV_UNMERGEABLEtosys/sys/mman.h;kern_madviseinsys/vm/vm_mmap.csets a newOBJ_MERGEABLEflag on thevm_object. bhyve calls it on guest RAM fromusr.sbin/bhyve/mem.c. - Data structures: two red-black trees via
sys/tree.h— a stable tree keyed by page-content hash (SipHash/xxhash), an unstable tree for candidates. Per-bucket mutex. - Merge: VM_OBJECT_WLOCK both objects in address-order to avoid
deadlock, verify contents still match,
pmap_removethe duplicate mapping,pmap_enterpointing at the canonical page withVM_PROT_READonly, bump survivor refcount, free the duplicate. Mark with a newPG_MERGEDbit invm_page.flags. - CoW:
vm_fault()on a write to aPG_MERGEDpage allocates a fresh private page, drops the merged refcount. - Hardest interaction: amd64
pmap_promote_pde()tries to fold 4K mappings into 2M superpages. Merged pages must be demoted or the merger must operate above the promotion layer. - Other races:
vm_object_collapse()can splice shadow chains underneath the merger;vm_page_busy_acquire()must be held during rewrites;OBJT_SWAP(bhyve’s guest RAM) vsOBJT_DEFAULTdiffer in how resident pages are accounted.
Estimated size: 2 500-4 000 lines of new code in a new
sys/vm/vm_ksm.c + vm_ksm.h, plus 200-400
lines of edits across
vm_pageout.c/vm_fault.c/vm_object.c/vm_mmap.c/sys/sys/mman.h/vm_page.h
and amd64/arm64 pmap.c (promotion interlock). Plus a
vm.ksm.* sysctl tree, a ksmctl(8), a bhyve
flag, and tests under tests/sys/vm/.
Wall time: 6-12 months for one senior FreeBSD VM developer (Mark Johnston, Konstantin Belousov, Alan Cox caliber). 18 months if the developer has to climb the pmap learning curve.
As a funded sponsorship — think $150-300K with the FreeBSD Foundation or a direct retainer — it is realistic. As an unfunded side project it is not.
Path C — ZFS ARC-only dedup (zero kernel work, partial benefit)
If checkpoint files live on ZFS with compression=lz4 or zstd, the
ARC caches compressed blocks once and all readers share them. This
gives:
- Disk and ARC-level dedup — 50 identical ckps on disk compressed to one set of blocks, one ARC copy.
- No reduction in guest live memory. Once the VM faults pages in, each bhyve process still has its own anonymous pages.
So ZFS compression shaves the cold-tier storage + resume-read cost
but does not close the RAM gap. zfs set dedup=on is still not
worth the RAM cost it imposes and doesn’t help the guest-memory
problem either.
Useful to combine with Path A or B; not a substitute.
Result — Path A works (2026-04-22)
We wrote Path A and measured it. The patch is
patches/vmm-memseg-vnode.diff
— 286-line unified diff across five files
(sys/dev/vmm/vmm_mem.c, sys/dev/vmm/vmm_mem.h,
sys/dev/vmm/vmm_dev.c, sys/amd64/include/vmm_dev.h,
lib/libvmmapi/vmmapi.c + vmmapi.h).
The new function vm_alloc_memseg_vnode() wraps the
vnode’s vm_object in a shadow so guest writes CoW into
anonymous pages and guest reads fall through to shared file-backed
pages.
The first draft panicked in vm_map_entry_delete during
destroy — the shadow’s swap-charge accounting was reading a bogus
object->cred. Fix: pass cred=NULL to
vm_object_shadow. Shadow pages are transient and
accounted at the vm_map entry layer on CoW fault, same
pattern as a MAP_PRIVATE file mmap. Second install was
clean; benchmarks/rigs/vmm-vnode-probe.c spawns N VMs
that attach the same 256 MiB template and touch every page.
| N | naive cost | actual delta | dedup |
|---|---|---|---|
| 8 × 256 MiB | 2 048 MiB | ~306 MiB | ~85% |
| 50 × 256 MiB | 12 800 MiB | ~5 MiB above cached template | ~2500× |
| 1 000 × 256 MiB | 250 000 MiB | ~9 100 MiB | ~27× |
Direct-ioctl probe (no bhyve process) stops at N=50 because it’s the
tightest sharing test. The full-bhyve follow-up goes to N=1000 — see
the homepage density table and
/appendix/vmm-vnode-patch.
At every scale the marginal cost per additional VM is essentially
constant: the vm_object shadow metadata (few KB) plus
bhyve’s own process state. The last architectural gap
against CubeSandbox for the “many microVMs from one checkpoint” case
is closed.
What this does not give us: content-based dedup across arbitrary pages (two VMs booted from different images with coincidentally identical pages still don’t share). That’s still Path B territory. For the CubeSandbox workload — one golden checkpoint fanning out into many instances — Path A is sufficient.
The patch is not upstreamable as-is. It needs (at minimum) a
bhyve integration in snapshot.c so bhyve -r
uses the new ioctl instead of read()-into-anon-memseg,
a sysctl or flag to gate the new behavior, and a review round from
a vmm maintainer against the vm_object_shadow refcount
invariants. But it exists, it’s measured, and the density argument
is now a one-sentence sentence: on a patched SNAPSHOT kernel,
one thousand identical 256 MiB microVMs fit in ~9 GiB of host
RAM.
References
- Linux KSM admin guide
- Andrea Arcangeli — KSM intro (LWN)
- Gupta et al., Difference Engine: Harnessing Memory Redundancy in Virtual Machines, OSDI 2008
- Miłoś et al., Satori: Enlightened Page Sharing, USENIX ATC 2009
- FreeBSD VM sources:
sys/vm/vm_pageout.c,sys/vm/vm_fault.c,sys/vm/vm_object.c,sys/vm/vm_mmap.c - bhyve memory backing:
usr.sbin/bhyve/mem.c,lib/libvmmapi/vmmapi.c - reviews.freebsd.org — searched
ksm,samepage,page merge,page dedup— no relevant reviews - freebsd-virtualization@ archives