KSM-equivalent on FreeBSD

Linux KSM (Kernel Samepage Merging) is what lets CubeSandbox claim “<5 MB overhead per instance” while actually keeping thousands of identical microVMs in RAM. FreeBSD ships no equivalent, and this is the last architectural gap between a Cube-shape FreeBSD stack and Cube itself. This page is what we found when we actually looked.

What KSM does, briefly

Linux KSM (mm/ksm.c, docs at kernel.org) is an opportunistic content-based page deduplicator. A kernel thread ksmd scans pages userspace opts in to via madvise(addr, len, MADV_MERGEABLE). Stable candidates land in a red-black tree keyed by page-content hash. On a match, both virtual mappings are rewritten to a single physical page marked read-only; writes trigger the normal CoW path and un-share. QEMU/KVM calls MADV_MERGEABLE on guest RAM regions so many identical VMs collapse into one physical set of kernel pages.

What FreeBSD ships (nothing, directly)

As of FreeBSD 15.0-RELEASE, there is no committed KSM-equivalent in head. We checked for:

Phabricator reviews at reviews.freebsd.org matching ksm, samepage, page merge, page dedup — none.
Quarterly reports and 15.0 release notes — no mentions of content-based page sharing in the VM subsystem.
freebsd-virtualization@ archives — periodic “has anyone looked at this?” threads. The standing answer from VM maintainers: hairy interactions with vm_object collapse, superpage promotion, and pmap-level invariants; narrow user base; hasn’t been prioritized.
Downstream forks (Netflix, Sato-san’s out-of-tree work, iXsystems, Klara) — none carrying a KSM backport we could find.

There is identity-based sharing via vm_object shadow chains and MAP_SHARED mappings — two processes mmaping the same file or forking from a common parent share backing pages until someone writes. There is no content-based sharing.

What we measured on honor

Running 50 × 256 MB bhyve VMs resumed from per-entry bhyvectl --suspend checkpoints:

metric	value
baseline free memory	18 476 MB
after 50 VMs resumed	4 114 MB free / 13 764 MB active / 3 711 MB inactive
Δ active + inactive + free consumption	~13 100 MB
nominal un-shared (50 × 256 MB)	12 800 MB
effective dedup	~0%

Per-bhyve RSS reports 275 MB each, which sums to 13.7 GB — roughly the physical consumption observed. FreeBSD’s VM is doing its normal thing: each bhyve process has its own anonymous pages for the guest’s resident working set. No content-dedup happens anywhere in the stack.

Despite each VM being resumed from a ckp produced by bhyve code that shares most file bytes across entries (same boot, same stock FreeBSD guest), nothing in the chain causes those identical bytes to become shared RAM.

The three paths forward

Path A — vmm memseg backed by ckp file (small kernel patch)

The research agent’s original framing was “bhyve userspace-only, hundreds of LoC”. Going to the source: that was wrong. The real scope:

bhyve -r restores by read(ckp_fd, baseaddr, len) in usr.sbin/bhyve/snapshot.c:657.
baseaddr is a userspace mmap window onto vmm-kernel-allocated guest memory. The mmap is MAP_SHARED against /dev/vmm/<name>.
The kernel-side memseg is vm_object_allocate(OBJT_SWAP, …) at sys/dev/vmm/vmm_mem.c:190 — pure anonymous, no hook for file-backing.

So to get identity-based CoW across clones resuming from the same ckp, vmm has to accept a vnode-backed memseg instead of an anonymous one. That’s a small kernel patch:

sys/dev/vmm/vmm_mem.c: new vm_alloc_memseg_vnode(vm, ident, len, sysmem, vp) that calls vnode_pager_allocate() on the ckp file’s vnode to build the vm_object, then wires it into the same vm_mem_seg slots.
sys/dev/vmm/vmm_dev.c: new VM_ALLOC_MEMSEG_VNODE ioctl taking {ident, len, sysmem, fd}. Pulls a vnode from the fd via getvnode().
sys/dev/vmm/vmm_ioctl.h: ioctl number.
lib/libvmmapi/vmmapi.c: userspace vm_alloc_memseg_vnode().
usr.sbin/bhyve/snapshot.c: in restore_vm_mem, call the new API with the ckp fd instead of the current read-into-anon-memseg path.

Rough size: ~200 LoC kernel + ~50 LoC userspace. Risk: vnode refcount + vm_object lifetime across VM destroy; getting that wrong panics the kernel. Testable with a pre-kernel ZFS boot environment (we already use one for the SNAPSHOT kernel swap).

Expected result if the patch works: 50-entry hot pool drops from ~13 GB to ~0.5–2 GB total, because the identical ckp pages stay in a single vnode-backed vm_object and only guest-written pages CoW into anonymous copies.

This is still the right first thing to try, because it’s one order of magnitude smaller than a full KSM port — but it’s a kernel patch with a real foot-gun surface, not the userspace afternoon the research agent’s initial report implied.

Path B — Full KSM port (bounded but expensive)

A proper FreeBSD KSM sketches out as:

Hook location: new kernel thread pagemergerd, co-located with vm_pageout. Scans vm_objects flagged mergeable.
Opt-in: add MADV_MERGEABLE / MADV_UNMERGEABLE to sys/sys/mman.h; kern_madvise in sys/vm/vm_mmap.c sets a new OBJ_MERGEABLE flag on the vm_object. bhyve calls it on guest RAM from usr.sbin/bhyve/mem.c.
Data structures: two red-black trees via sys/tree.h — a stable tree keyed by page-content hash (SipHash/xxhash), an unstable tree for candidates. Per-bucket mutex.
Merge: VM_OBJECT_WLOCK both objects in address-order to avoid deadlock, verify contents still match, pmap_remove the duplicate mapping, pmap_enter pointing at the canonical page with VM_PROT_READ only, bump survivor refcount, free the duplicate. Mark with a new PG_MERGED bit in vm_page.flags.
CoW: vm_fault() on a write to a PG_MERGED page allocates a fresh private page, drops the merged refcount.
Hardest interaction: amd64 pmap_promote_pde() tries to fold 4K mappings into 2M superpages. Merged pages must be demoted or the merger must operate above the promotion layer.
Other races: vm_object_collapse() can splice shadow chains underneath the merger; vm_page_busy_acquire() must be held during rewrites; OBJT_SWAP (bhyve’s guest RAM) vs OBJT_DEFAULT differ in how resident pages are accounted.

Estimated size: 2 500-4 000 lines of new code in a new sys/vm/vm_ksm.c + vm_ksm.h, plus 200-400 lines of edits across vm_pageout.c/vm_fault.c/vm_object.c/vm_mmap.c/sys/sys/mman.h/vm_page.h and amd64/arm64 pmap.c (promotion interlock). Plus a vm.ksm.* sysctl tree, a ksmctl(8), a bhyve flag, and tests under tests/sys/vm/.

Wall time: 6-12 months for one senior FreeBSD VM developer (Mark Johnston, Konstantin Belousov, Alan Cox caliber). 18 months if the developer has to climb the pmap learning curve.

As a funded sponsorship — think $150-300K with the FreeBSD Foundation or a direct retainer — it is realistic. As an unfunded side project it is not.

Path C — ZFS ARC-only dedup (zero kernel work, partial benefit)

If checkpoint files live on ZFS with compression=lz4 or zstd, the ARC caches compressed blocks once and all readers share them. This gives:

Disk and ARC-level dedup — 50 identical ckps on disk compressed to one set of blocks, one ARC copy.
No reduction in guest live memory. Once the VM faults pages in, each bhyve process still has its own anonymous pages.

So ZFS compression shaves the cold-tier storage + resume-read cost but does not close the RAM gap. zfs set dedup=on is still not worth the RAM cost it imposes and doesn’t help the guest-memory problem either.

Useful to combine with Path A or B; not a substitute.

Result — Path A works (2026-04-22)

We wrote Path A and measured it. The patch is patches/vmm-memseg-vnode.diff — 286-line unified diff across five files (sys/dev/vmm/vmm_mem.c, sys/dev/vmm/vmm_mem.h, sys/dev/vmm/vmm_dev.c, sys/amd64/include/vmm_dev.h, lib/libvmmapi/vmmapi.c + vmmapi.h). The new function vm_alloc_memseg_vnode() wraps the vnode’s vm_object in a shadow so guest writes CoW into anonymous pages and guest reads fall through to shared file-backed pages.

The first draft panicked in vm_map_entry_delete during destroy — the shadow’s swap-charge accounting was reading a bogus object->cred. Fix: pass cred=NULL to vm_object_shadow. Shadow pages are transient and accounted at the vm_map entry layer on CoW fault, same pattern as a MAP_PRIVATE file mmap. Second install was clean; benchmarks/rigs/vmm-vnode-probe.c spawns N VMs that attach the same 256 MiB template and touch every page.

N	naive cost	actual delta	dedup
8 × 256 MiB	2 048 MiB	~306 MiB	~85%
50 × 256 MiB	12 800 MiB	~5 MiB above cached template	~2500×
1 000 × 256 MiB	250 000 MiB	~9 100 MiB	~27×

Direct-ioctl probe (no bhyve process) stops at N=50 because it’s the tightest sharing test. The full-bhyve follow-up goes to N=1000 — see the homepage density table and /appendix/vmm-vnode-patch. At every scale the marginal cost per additional VM is essentially constant: the vm_object shadow metadata (few KB) plus bhyve’s own process state. The last architectural gap against CubeSandbox for the “many microVMs from one checkpoint” case is closed.

What this does not give us: content-based dedup across arbitrary pages (two VMs booted from different images with coincidentally identical pages still don’t share). That’s still Path B territory. For the CubeSandbox workload — one golden checkpoint fanning out into many instances — Path A is sufficient.

The patch is not upstreamable as-is. It needs (at minimum) a bhyve integration in snapshot.c so bhyve -r uses the new ioctl instead of read()-into-anon-memseg, a sysctl or flag to gate the new behavior, and a review round from a vmm maintainer against the vm_object_shadow refcount invariants. But it exists, it’s measured, and the density argument is now a one-sentence sentence: on a patched SNAPSHOT kernel, one thousand identical 256 MiB microVMs fit in ~9 GiB of host RAM.

References

Linux KSM admin guide
Andrea Arcangeli — KSM intro (LWN)
Gupta et al., Difference Engine: Harnessing Memory Redundancy in Virtual Machines, OSDI 2008
Miłoś et al., Satori: Enlightened Page Sharing, USENIX ATC 2009
FreeBSD VM sources: sys/vm/vm_pageout.c, sys/vm/vm_fault.c, sys/vm/vm_object.c, sys/vm/vm_mmap.c
bhyve memory backing: usr.sbin/bhyve/mem.c, lib/libvmmapi/vmmapi.c
reviews.freebsd.org — searched ksm, samepage, page merge, page dedup — no relevant reviews
freebsd-virtualization@ archives