The KSM-equivalent appendix ends
with a claim: the last architectural gap against CubeSandbox is
closable by a ~week-scale kernel patch, not a year-scale port. This
page is that patch, read out loud. We inline the live diff from
patches/vmm-memseg-vnode.diff rather than copy-pasting — if
the file in the repo changes, so does this page.
Why this exists
bhyve’s guest memory is allocated as an anonymous vm_object of type
OBJT_SWAP. That’s fine for a single VM, but it means every microVM
resumed from the same checkpoint file gets its own independent set of
anonymous pages, even though the source bytes are identical. At the
CubeSandbox workload — one golden checkpoint fanning out into hundreds
of identical microVMs — this is hilariously wasteful. The Linux answer
is KSM, a kernel thread that hashes pages post-hoc and merges matches.
A full KSM port to FreeBSD is a 2500-4000 LoC project touching
pmap_promote_pde, vm_object_collapse, and the anonymous-memory
lifecycle — a six-to-twelve-month undertaking.
But we don’t actually need content-addressable dedup of arbitrary
pages. We need this one file mapped into many VMs, with writes
going into per-VM CoW shadows. That’s MAP_PRIVATE. The kernel
already knows how to do it for ordinary mmap. What it didn’t have was
a way to plug a MAP_PRIVATE-over-file object into a vmm memseg.
This patch adds that plug. On honor with the patch installed, 1 000 concurrent 256 MiB bhyve VMs from a single checkpoint fit in ~9.1 GiB of host RAM, where the unpatched baseline would need ~250 GiB.
What the diff does, file by file
sys/dev/vmm/vmm_mem.c
The core of the work: a new function that takes a vnode and produces
a memseg whose vm_object is an anonymous shadow layered over the
vnode’s page cache. Guest reads fall through; guest writes CoW into
the shadow.
sys/dev/vmm/vmm_mem.c
#include <sys/types.h> #include <sys/lock.h> #include <sys/malloc.h> #include <sys/rwlock.h> #include <sys/sx.h> #include <sys/systm.h> #include <sys/vnode.h> #include <machine/vmm.h> #include <vm/vm_map.h> #include <vm/vm_object.h> #include <vm/vm_page.h> #include <vm/vm_pager.h> #include <dev/vmm/vmm_dev.h> #include <dev/vmm/vmm_mem.h> seg->sysmem = sysmem; return (0); } /* * Allocate a guest memory segment backed by a vnode. The guest sees a * private CoW view of the file: reads fall through to pages cached in * the vnode's vm_object (shared across every guest that maps the same * file), writes allocate fresh anonymous pages in a shadow. * * This is the minimum kernel surface needed to dedup RAM across many * microVMs resumed from a common checkpoint file. * * Refcount / charge invariants (see sys/vm/vm_object.c): * * vm_object_shadow(&obj, &off, len, cred, shared): * - consumes one reference from the caller on *obj and transfers * it into the new shadow's backing_object slot. The caller's * pointer is replaced with the shadow (caller's net refcount on * the original backing: unchanged). * - calls vm_object_allocate_anon(len, backing, cred, len), which * sets shadow->cred = cred and shadow->charge = len when cred * is non-NULL. It does NOT call swap_reserve_by_cred(); the * caller is responsible for that. The pairing release happens * in vm_object_terminate via swap_release_by_cred(charge, cred). * * Consequences for this path: * - We do not call swap_reserve_by_cred on the shadow, therefore we * MUST pass cred=NULL to vm_object_shadow. Passing td_ucred here * (the v1 behavior) produced a paired release-without-reserve on * object teardown, which manifested as a GPF in the swap charge * accounting during vm_map_entry_delete. Shadow pages still count * against the process's RACCT_SWAP via the vm_map entry's cred * (set in vm_map_insert for the bhyve-side mmap), exactly the * same way a MAP_PRIVATE file mmap is charged. * - vnode_create_vobject does NOT return a reference; vp->v_object * is a raw pointer with lifetime tied to the vnode. We therefore * MUST call vm_object_reference(backing) before vm_object_shadow * so the reference the shadow consumes is one we own. Omitting * this underflows the vnode object's refcount on VM teardown. */ int vm_alloc_memseg_vnode(struct vm *vm, int ident, size_t len, bool sysmem, struct vnode *vp, struct ucred *cred) { struct vm_mem_seg *seg; struct vm_mem *mem; struct vattr va; vm_object_t backing, shadow; vm_ooffset_t offset; int error; mem = vm_mem(vm); vm_assert_memseg_xlocked(vm); if (ident < 0 || ident >= VM_MAX_MEMSEGS) return (EINVAL); if (len == 0 || (len & PAGE_MASK) != 0) return (EINVAL); seg = &mem->mem_segs[ident]; if (seg->object != NULL) { if (seg->len == len && seg->sysmem == sysmem) return (EEXIST); return (EINVAL); } error = vget(vp, LK_SHARED); if (error != 0) return (error); if (vp->v_type != VREG) { error = EINVAL; goto out; } error = VOP_GETATTR(vp, &va, cred); if (error != 0) goto out; if ((size_t)va.va_size < len) { error = EINVAL; goto out; } /* * Attach (or retrieve) the vnode's VM object. For an already-open * regular file vp->v_object is usually non-NULL; vnode_create_vobject * is idempotent and covers the cold-open case. */ error = vnode_create_vobject(vp, va.va_size, curthread); if (error != 0) goto out; backing = vp->v_object; if (backing == NULL) { error = EINVAL; goto out; } vm_object_reference(backing); offset = 0; shadow = backing; vm_object_shadow(&shadow, &offset, len, NULL, false); seg->len = len; seg->object = shadow; seg->sysmem = sysmem; error = 0; out: VOP_UNLOCK(vp); return (error); } int sys/dev/vmm/vmm_mem.h
Just exposes the prototype to callers in vmm_dev.c. The forward
declarations for struct vnode and struct ucred let consumers
include this header without also pulling in sys/vnode.h.
sys/dev/vmm/vmm_dev.c
The ioctl plumbing. Adds the dispatch-table entry with the same
locking flags as VM_ALLOC_MEMSEG (exclusive memseg lock, lock all
vcpus), and the case handler that resolves the user-supplied fd to
a vnode via fgetvp_read and calls our new function.
sys/dev/vmm/vmm_dev.c
*/ #include <sys/param.h> #include <sys/capsicum.h> #include <sys/conf.h> #include <sys/fcntl.h> #include <sys/file.h> #include <sys/ioccom.h> #include <sys/jail.h> #include <sys/kernel.h> #include <sys/sysctl.h> #include <sys/ucred.h> #include <sys/uio.h> #include <sys/vnode.h> #include <machine/vmm.h> #endif /* __amd64__ */ VMMDEV_IOCTL(VM_ALLOC_MEMSEG, VMMDEV_IOCTL_XLOCK_MEMSEGS | VMMDEV_IOCTL_LOCK_ALL_VCPUS), VMMDEV_IOCTL(VM_ALLOC_MEMSEG_VNODE, VMMDEV_IOCTL_XLOCK_MEMSEGS | VMMDEV_IOCTL_LOCK_ALL_VCPUS), VMMDEV_IOCTL(VM_MMAP_MEMSEG, VMMDEV_IOCTL_XLOCK_MEMSEGS | VMMDEV_IOCTL_LOCK_ALL_VCPUS), VMMDEV_IOCTL(VM_MUNMAP_MEMSEG, } error = alloc_memseg(sc, mseg, sizeof(mseg->name), domainset); break; } case VM_ALLOC_MEMSEG_VNODE: { struct vm_memseg_vnode *msegv = (struct vm_memseg_vnode *)data; struct vnode *vp; cap_rights_t rights; /* * flags and offset are reserved for future use (e.g. * MEMSEG_VNODE_READONLY, partial-file memsegs). Reject * non-zero today so userspace can't accidentally depend * on unimplemented semantics. */ if (msegv->flags != 0 || msegv->offset != 0) { error = EINVAL; break; } error = fgetvp_read(td, msegv->fd, cap_rights_init(&rights, CAP_MMAP_R), &vp); if (error != 0) break; error = vm_alloc_memseg_vnode(sc->vm, msegv->segid, msegv->len, msegv->sysmem != 0, vp, td->td_ucred); vrele(vp); break; } case VM_GET_MEMSEG: error = get_memseg(sc, (struct vm_memseg *)data, sizeof(((struct vm_memseg *)0)->name)); sys/amd64/include/vmm_dev.h
The ABI: a new struct and a new ioctl number.
sys/amd64/include/vmm_dev.h
int ds_policy; }; /* * Allocate a system-memory segment whose contents are backed by the * regular file referenced by 'fd'. Guest pages start as shared * references to the file's vm_object; guest writes CoW into private * anonymous pages. Multiple VMs pointing at the same file share * identical backing pages across the system. * * 'offset' and 'flags' are reserved for future use (partial-file * memsegs, MEMSEG_VNODE_READONLY, MEMSEG_VNODE_HUGE); the kernel * rejects non-zero values today. */ struct vm_memseg_vnode { size_t len; /* segment length, page-aligned */ uint64_t offset; /* offset into the vnode (reserved, must be 0) */ uint32_t flags; /* reserved, must be 0 */ int segid; int sysmem; /* 1 = VM_SYSMEM, 0 = devmem */ int fd; uint32_t _pad; /* explicit tail pad for ABI stability */ }; struct vm_register { int cpuid; int regnum; /* enum vm_reg_name */ /* checkpoint */ IOCNUM_SNAPSHOT_REQ = 113, IOCNUM_RESTORE_TIME = 115 IOCNUM_RESTORE_TIME = 115, /* vnode-backed memory segment */ IOCNUM_ALLOC_MEMSEG_VNODE = 116 }; #define VM_RUN \ _IOWR('v', IOCNUM_SNAPSHOT_REQ, struct vm_snapshot_meta) #define VM_RESTORE_TIME \ _IOWR('v', IOCNUM_RESTORE_TIME, int) #define VM_ALLOC_MEMSEG_VNODE \ _IOW('v', IOCNUM_ALLOC_MEMSEG_VNODE, struct vm_memseg_vnode) #endif lib/libvmmapi/vmmapi.c and .h
Thin userspace wrapper. One function, one ioctl call.
lib/libvmmapi/vmmapi.c
return (error); } /* * Allocate a system-memory segment whose contents are backed by the * file referenced by 'fd'. Pages are shared across every vm that * allocates from the same underlying vnode; guest writes CoW into * private anonymous pages. * * Unlike vm_alloc_memseg() this does not reuse an existing segment — * the expected call pattern is "create vm, then immediately attach * file-backed sysmem, then mmap it into the guest address space". * * 'offset' and 'flags' are reserved for future use (partial-file * memsegs, MEMSEG_VNODE_READONLY); the kernel currently rejects * non-zero values with EINVAL. */ int vm_alloc_memseg_vnode(struct vmctx *ctx, int segid, size_t len, bool sysmem, int fd, uint64_t offset, uint32_t flags) { struct vm_memseg_vnode msegv; memset(&msegv, 0, sizeof(msegv)); msegv.segid = segid; msegv.len = len; msegv.sysmem = sysmem ? 1 : 0; msegv.fd = fd; msegv.offset = offset; msegv.flags = flags; return (ioctl(ctx->fd, VM_ALLOC_MEMSEG_VNODE, &msegv)); } int vm_get_memseg(struct vmctx *ctx, int segid, size_t *lenp, char *namebuf, size_t bufsize) lib/libvmmapi/vmmapi.h
size_t namesiz); /* * Allocate a system-memory segment whose contents are backed by the * file referenced by 'fd'. Writes by the guest CoW into private * anonymous pages; reads share the file-backed vm_object with every * other vm that attached the same vnode. 'offset' and 'flags' are * reserved for future use and must be zero today. */ int vm_alloc_memseg_vnode(struct vmctx *ctx, int segid, size_t len, bool sysmem, int fd, uint64_t offset, uint32_t flags); /* * Iterate over the guest address space. This function finds an address range * that starts at an address >= *gpa. * The bhyve-side patch
Kernel alone isn’t enough: bhyve has to actually call the new ioctl.
patches/bhyve-vnode-restore.diff wires it in, guarded behind an
opt-in config key so the default bhyve -r path is unchanged.
lib/libvmmapi/internal.h
VM_SUSPEND, \ VM_REINIT, \ VM_ALLOC_MEMSEG, \ VM_ALLOC_MEMSEG_VNODE, \ VM_GET_MEMSEG, \ VM_MMAP_MEMSEG, \ VM_MMAP_MEMSEG, \ usr.sbin/bhyve/bhyverun.c
if (get_config_bool_default("memory.guest_in_core", false)) memflags |= VM_MEM_F_INCORE; vm_set_memflags(ctx, memflags); #ifdef BHYVE_SNAPSHOT /* * Path A (patches/vmm-memseg-vnode.diff): if restoring AND the * operator opted in, pre-create the sysmem memseg as a shadow * over the checkpoint file's vnode. vm_setup_memory_domains * will then reuse the existing memseg; guest pages are shared * across every vm that restores from the same ckp, and * restore_vm_mem becomes a no-op because the pages are already * "loaded" via the vnode pager. */ if (restore_file != NULL && get_config_bool_default("snapshot.vnode_restore", false)) { if (vm_alloc_memseg_vnode(ctx, VM_SYSMEM, memsize, true, rstate.vmmem_fd, 0, 0) != 0) { fprintf(stderr, "vm_alloc_memseg_vnode failed (%d): %s\n", errno, strerror(errno)); exit(4); } } #endif error = vm_setup_memory_domains(ctx, VM_MMAP_ALL, guest_domains, guest_ndomains); if (error) { exit(1); } FPRINTLN(stdout, "Restoring vm mem..."); if (restore_vm_mem(ctx, &rstate) != 0) { EPRINTLN("Failed to restore VM memory."); exit(1); if (get_config_bool_default("snapshot.vnode_restore", false)) { FPRINTLN(stdout, "Skipping vm mem copy (vnode-backed restore)."); } else { FPRINTLN(stdout, "Restoring vm mem..."); if (restore_vm_mem(ctx, &rstate) != 0) { EPRINTLN("Failed to restore VM memory."); exit(1); } } FPRINTLN(stdout, "Restoring pci devs..."); What the upstream review found
The full audit lives at
patches/upstream-review.md. Top five items before this is submittable:
- Reproduce the v1 GPF on an INVARIANTS kernel and capture
the real stack. The
cred=NULLrationalization may not be what actually fixed the crash. - Fix the refcount handling. The
vm_object_reference(backing)beforevm_object_shadowmay be leaking one reference per call — a slow leak that 1000-VM tests don’t catch but production soak tests will. - Add
flagsandoffsetfields tostruct vm_memseg_vnodenow. Free forward-compat; not free once the ioctl ships. - Bump
lib/libvmmapi/Symbol.map. The new symbol isn’t exported from the shared library today; caught this by reading, not testing. - Write one ATF test.
create-map-CoW-destroy asserting
vm.object_countreturns to baseline. Single test, catches three of the top four blockers automatically.
This is the punch list for “submit to Phabricator.” Everything in the review beyond that list is style-pass or documentation — the patches land or don’t on these five.
References
patches/vmm-memseg-vnode.diff— kernel + libvmmapipatches/bhyve-vnode-restore.diff— bhyve userlandpatches/README.md— apply instructionspatches/upstream-review.md— full reviewer punch listbenchmarks/rigs/bhyve-fanout-rss.sh— the N-VM fanout rig- /appendix/ksm-equivalent — the motivating gap