patches/vmm-memseg-vnode, annotated

The KSM-equivalent appendix ends with a claim: the last architectural gap against CubeSandbox is closable by a ~week-scale kernel patch, not a year-scale port. This page is that patch, read out loud. We inline the live diff from patches/vmm-memseg-vnode.diff rather than copy-pasting — if the file in the repo changes, so does this page.

Why this exists

bhyve’s guest memory is allocated as an anonymous vm_object of type OBJT_SWAP. That’s fine for a single VM, but it means every microVM resumed from the same checkpoint file gets its own independent set of anonymous pages, even though the source bytes are identical. At the CubeSandbox workload — one golden checkpoint fanning out into hundreds of identical microVMs — this is hilariously wasteful. The Linux answer is KSM, a kernel thread that hashes pages post-hoc and merges matches. A full KSM port to FreeBSD is a 2500-4000 LoC project touching pmap_promote_pde, vm_object_collapse, and the anonymous-memory lifecycle — a six-to-twelve-month undertaking.

But we don’t actually need content-addressable dedup of arbitrary pages. We need this one file mapped into many VMs, with writes going into per-VM CoW shadows. That’s MAP_PRIVATE. The kernel already knows how to do it for ordinary mmap. What it didn’t have was a way to plug a MAP_PRIVATE-over-file object into a vmm memseg.

This patch adds that plug. On honor with the patch installed, 1 000 concurrent 256 MiB bhyve VMs from a single checkpoint fit in ~9.1 GiB of host RAM, where the unpatched baseline would need ~250 GiB.

What the diff does, file by file

sys/dev/vmm/vmm_mem.c

The core of the work: a new function that takes a vnode and produces a memseg whose vm_object is an anonymous shadow layered over the vnode’s page cache. Guest reads fall through; guest writes CoW into the shadow.

sys/dev/vmm/vmm_mem.c

@@ -8,8 +8,10 @@
8 8 #include <sys/types.h>
9 9 #include <sys/lock.h>
10 10 #include <sys/malloc.h>
11 + #include <sys/rwlock.h>
11 12 #include <sys/sx.h>
12 13 #include <sys/systm.h>
14 + #include <sys/vnode.h>
13 15
14 16 #include <machine/vmm.h>
15 17
@@ -20,6 +22,7 @@
20 22 #include <vm/vm_map.h>
21 23 #include <vm/vm_object.h>
22 24 #include <vm/vm_page.h>
25 + #include <vm/vm_pager.h>
23 26
24 27 #include <dev/vmm/vmm_dev.h>
25 28 #include <dev/vmm/vmm_mem.h>
@@ -198,6 +201,114 @@
198 201 seg->sysmem = sysmem;
199 202
200 203 return (0);
204 + }
205 +
206 + /*
207 + * Allocate a guest memory segment backed by a vnode. The guest sees a
208 + * private CoW view of the file: reads fall through to pages cached in
209 + * the vnode's vm_object (shared across every guest that maps the same
210 + * file), writes allocate fresh anonymous pages in a shadow.
211 + *
212 + * This is the minimum kernel surface needed to dedup RAM across many
213 + * microVMs resumed from a common checkpoint file.
214 + *
215 + * Refcount / charge invariants (see sys/vm/vm_object.c):
216 + *
217 + * vm_object_shadow(&obj, &off, len, cred, shared):
218 + * - consumes one reference from the caller on *obj and transfers
219 + * it into the new shadow's backing_object slot. The caller's
220 + * pointer is replaced with the shadow (caller's net refcount on
221 + * the original backing: unchanged).
222 + * - calls vm_object_allocate_anon(len, backing, cred, len), which
223 + * sets shadow->cred = cred and shadow->charge = len when cred
224 + * is non-NULL. It does NOT call swap_reserve_by_cred(); the
225 + * caller is responsible for that. The pairing release happens
226 + * in vm_object_terminate via swap_release_by_cred(charge, cred).
227 + *
228 + * Consequences for this path:
229 + * - We do not call swap_reserve_by_cred on the shadow, therefore we
230 + * MUST pass cred=NULL to vm_object_shadow. Passing td_ucred here
231 + * (the v1 behavior) produced a paired release-without-reserve on
232 + * object teardown, which manifested as a GPF in the swap charge
233 + * accounting during vm_map_entry_delete. Shadow pages still count
234 + * against the process's RACCT_SWAP via the vm_map entry's cred
235 + * (set in vm_map_insert for the bhyve-side mmap), exactly the
236 + * same way a MAP_PRIVATE file mmap is charged.
237 + * - vnode_create_vobject does NOT return a reference; vp->v_object
238 + * is a raw pointer with lifetime tied to the vnode. We therefore
239 + * MUST call vm_object_reference(backing) before vm_object_shadow
240 + * so the reference the shadow consumes is one we own. Omitting
241 + * this underflows the vnode object's refcount on VM teardown.
242 + */
243 + int
244 + vm_alloc_memseg_vnode(struct vm *vm, int ident, size_t len, bool sysmem,
245 + struct vnode *vp, struct ucred *cred)
246 + {
247 + struct vm_mem_seg *seg;
248 + struct vm_mem *mem;
249 + struct vattr va;
250 + vm_object_t backing, shadow;
251 + vm_ooffset_t offset;
252 + int error;
253 +
254 + mem = vm_mem(vm);
255 + vm_assert_memseg_xlocked(vm);
256 +
257 + if (ident < 0 || ident >= VM_MAX_MEMSEGS)
258 + return (EINVAL);
259 + if (len == 0 || (len & PAGE_MASK) != 0)
260 + return (EINVAL);
261 +
262 + seg = &mem->mem_segs[ident];
263 + if (seg->object != NULL) {
264 + if (seg->len == len && seg->sysmem == sysmem)
265 + return (EEXIST);
266 + return (EINVAL);
267 + }
268 +
269 + error = vget(vp, LK_SHARED);
270 + if (error != 0)
271 + return (error);
272 +
273 + if (vp->v_type != VREG) {
274 + error = EINVAL;
275 + goto out;
276 + }
277 + error = VOP_GETATTR(vp, &va, cred);
278 + if (error != 0)
279 + goto out;
280 + if ((size_t)va.va_size < len) {
281 + error = EINVAL;
282 + goto out;
283 + }
284 +
285 + /*
286 + * Attach (or retrieve) the vnode's VM object. For an already-open
287 + * regular file vp->v_object is usually non-NULL; vnode_create_vobject
288 + * is idempotent and covers the cold-open case.
289 + */
290 + error = vnode_create_vobject(vp, va.va_size, curthread);
291 + if (error != 0)
292 + goto out;
293 + backing = vp->v_object;
294 + if (backing == NULL) {
295 + error = EINVAL;
296 + goto out;
297 + }
298 + vm_object_reference(backing);
299 + offset = 0;
300 + shadow = backing;
301 + vm_object_shadow(&shadow, &offset, len, NULL, false);
302 +
303 + seg->len = len;
304 + seg->object = shadow;
305 + seg->sysmem = sysmem;
306 +
307 + error = 0;
308 + out:
309 + VOP_UNLOCK(vp);
310 + return (error);
201 311 }
202 312
203 313 int

sys/dev/vmm/vmm_mem.h

Just exposes the prototype to callers in vmm_dev.c. The forward declarations for struct vnode and struct ucred let consumers include this header without also pulling in sys/vnode.h.

sys/dev/vmm/vmm_dev.c

The ioctl plumbing. Adds the dispatch-table entry with the same locking flags as VM_ALLOC_MEMSEG (exclusive memseg lock, lock all vcpus), and the case handler that resolves the user-supplied fd to a vnode via fgetvp_read and calls our new function.

sys/dev/vmm/vmm_dev.c

@@ -7,8 +7,10 @@
7 7 */
8 8
9 9 #include <sys/param.h>
10 + #include <sys/capsicum.h>
10 11 #include <sys/conf.h>
11 12 #include <sys/fcntl.h>
13 + #include <sys/file.h>
12 14 #include <sys/ioccom.h>
13 15 #include <sys/jail.h>
14 16 #include <sys/kernel.h>
@@ -20,6 +22,7 @@
20 22 #include <sys/sysctl.h>
21 23 #include <sys/ucred.h>
22 24 #include <sys/uio.h>
25 + #include <sys/vnode.h>
23 26
24 27 #include <machine/vmm.h>
25 28
@@ -395,6 +398,8 @@
395 398 #endif /* __amd64__ */
396 399 VMMDEV_IOCTL(VM_ALLOC_MEMSEG,
397 400 VMMDEV_IOCTL_XLOCK_MEMSEGS | VMMDEV_IOCTL_LOCK_ALL_VCPUS),
401 + VMMDEV_IOCTL(VM_ALLOC_MEMSEG_VNODE,
402 + VMMDEV_IOCTL_XLOCK_MEMSEGS | VMMDEV_IOCTL_LOCK_ALL_VCPUS),
398 403 VMMDEV_IOCTL(VM_MMAP_MEMSEG,
399 404 VMMDEV_IOCTL_XLOCK_MEMSEGS | VMMDEV_IOCTL_LOCK_ALL_VCPUS),
400 405 VMMDEV_IOCTL(VM_MUNMAP_MEMSEG,
@@ -610,6 +615,32 @@
610 615 }
611 616 error = alloc_memseg(sc, mseg, sizeof(mseg->name), domainset);
612 617
618 + break;
619 + }
620 + case VM_ALLOC_MEMSEG_VNODE: {
621 + struct vm_memseg_vnode *msegv =
622 + (struct vm_memseg_vnode *)data;
623 + struct vnode *vp;
624 + cap_rights_t rights;
625 +
626 + /*
627 + * flags and offset are reserved for future use (e.g.
628 + * MEMSEG_VNODE_READONLY, partial-file memsegs). Reject
629 + * non-zero today so userspace can't accidentally depend
630 + * on unimplemented semantics.
631 + */
632 + if (msegv->flags != 0 || msegv->offset != 0) {
633 + error = EINVAL;
634 + break;
635 + }
636 + error = fgetvp_read(td, msegv->fd,
637 + cap_rights_init(&rights, CAP_MMAP_R), &vp);
638 + if (error != 0)
639 + break;
640 + error = vm_alloc_memseg_vnode(sc->vm, msegv->segid, msegv->len,
641 + msegv->sysmem != 0, vp, td->td_ucred);
642 + vrele(vp);
643 + break;
644 + }
613 645 case VM_GET_MEMSEG:
614 646 error = get_memseg(sc, (struct vm_memseg *)data,
615 647 sizeof(((struct vm_memseg *)0)->name));

sys/amd64/include/vmm_dev.h

The ABI: a new struct and a new ioctl number.

sys/amd64/include/vmm_dev.h

@@ -60,6 +60,26 @@
60 60 int ds_policy;
61 61 };
62 62
63 + /*
64 + * Allocate a system-memory segment whose contents are backed by the
65 + * regular file referenced by 'fd'. Guest pages start as shared
66 + * references to the file's vm_object; guest writes CoW into private
67 + * anonymous pages. Multiple VMs pointing at the same file share
68 + * identical backing pages across the system.
69 + *
70 + * 'offset' and 'flags' are reserved for future use (partial-file
71 + * memsegs, MEMSEG_VNODE_READONLY, MEMSEG_VNODE_HUGE); the kernel
72 + * rejects non-zero values today.
73 + */
74 + struct vm_memseg_vnode {
75 + size_t len; /* segment length, page-aligned */
76 + uint64_t offset; /* offset into the vnode (reserved, must be 0) */
77 + uint32_t flags; /* reserved, must be 0 */
78 + int segid;
79 + int sysmem; /* 1 = VM_SYSMEM, 0 = devmem */
80 + int fd;
81 + uint32_t _pad; /* explicit tail pad for ABI stability */
82 + };
83 +
63 84 struct vm_register {
64 85 int cpuid;
65 86 int regnum; /* enum vm_reg_name */
@@ -339,7 +359,10 @@
339 359 /* checkpoint */
340 360 IOCNUM_SNAPSHOT_REQ = 113,
341 361
342 IOCNUM_RESTORE_TIME = 115
362 + IOCNUM_RESTORE_TIME = 115,
363 +
364 + /* vnode-backed memory segment */
365 + IOCNUM_ALLOC_MEMSEG_VNODE = 116
343 366 };
344 367
345 368 #define VM_RUN \
@@ -466,4 +489,6 @@
466 489 _IOWR('v', IOCNUM_SNAPSHOT_REQ, struct vm_snapshot_meta)
467 490 #define VM_RESTORE_TIME \
468 491 _IOWR('v', IOCNUM_RESTORE_TIME, int)
492 + #define VM_ALLOC_MEMSEG_VNODE \
493 + _IOW('v', IOCNUM_ALLOC_MEMSEG_VNODE, struct vm_memseg_vnode)
469 494 #endif

lib/libvmmapi/vmmapi.c and .h

Thin userspace wrapper. One function, one ioctl call.

lib/libvmmapi/vmmapi.c

@@ -428,6 +428,34 @@
428 428 return (error);
429 429 }
430 430
431 + /*
432 + * Allocate a system-memory segment whose contents are backed by the
433 + * file referenced by 'fd'. Pages are shared across every vm that
434 + * allocates from the same underlying vnode; guest writes CoW into
435 + * private anonymous pages.
436 + *
437 + * Unlike vm_alloc_memseg() this does not reuse an existing segment —
438 + * the expected call pattern is "create vm, then immediately attach
439 + * file-backed sysmem, then mmap it into the guest address space".
440 + *
441 + * 'offset' and 'flags' are reserved for future use (partial-file
442 + * memsegs, MEMSEG_VNODE_READONLY); the kernel currently rejects
443 + * non-zero values with EINVAL.
444 + */
445 + int
446 + vm_alloc_memseg_vnode(struct vmctx *ctx, int segid, size_t len, bool sysmem,
447 + int fd, uint64_t offset, uint32_t flags)
448 + {
449 + struct vm_memseg_vnode msegv;
450 +
451 + memset(&msegv, 0, sizeof(msegv));
452 + msegv.segid = segid;
453 + msegv.len = len;
454 + msegv.sysmem = sysmem ? 1 : 0;
455 + msegv.fd = fd;
456 + msegv.offset = offset;
457 + msegv.flags = flags;
458 + return (ioctl(ctx->fd, VM_ALLOC_MEMSEG_VNODE, &msegv));
459 + }
460 +
431 461 int
432 462 vm_get_memseg(struct vmctx *ctx, int segid, size_t *lenp, char *namebuf,
433 463 size_t bufsize)

lib/libvmmapi/vmmapi.h

@@ -83,6 +83,15 @@
83 83 size_t namesiz);
84 84
85 85 /*
86 + * Allocate a system-memory segment whose contents are backed by the
87 + * file referenced by 'fd'. Writes by the guest CoW into private
88 + * anonymous pages; reads share the file-backed vm_object with every
89 + * other vm that attached the same vnode. 'offset' and 'flags' are
90 + * reserved for future use and must be zero today.
91 + */
92 + int vm_alloc_memseg_vnode(struct vmctx *ctx, int segid, size_t len,
93 + bool sysmem, int fd, uint64_t offset, uint32_t flags);
94 +
95 + /*
86 96 * Iterate over the guest address space. This function finds an address range
87 97 * that starts at an address >= *gpa.
88 98 *

The bhyve-side patch

Kernel alone isn’t enough: bhyve has to actually call the new ioctl. patches/bhyve-vnode-restore.diff wires it in, guarded behind an opt-in config key so the default bhyve -r path is unchanged.

lib/libvmmapi/internal.h

@@ -38,6 +38,7 @@
38 38 VM_SUSPEND, \
39 39 VM_REINIT, \
40 40 VM_ALLOC_MEMSEG, \
41 + VM_ALLOC_MEMSEG_VNODE, \
41 42 VM_GET_MEMSEG, \
42 43 VM_MMAP_MEMSEG, \
43 44 VM_MMAP_MEMSEG, \

usr.sbin/bhyve/bhyverun.c

@@ -887,6 +887,27 @@
887 887 if (get_config_bool_default("memory.guest_in_core", false))
888 888 memflags |= VM_MEM_F_INCORE;
889 889 vm_set_memflags(ctx, memflags);
890 + #ifdef BHYVE_SNAPSHOT
891 + /*
892 + * Path A (patches/vmm-memseg-vnode.diff): if restoring AND the
893 + * operator opted in, pre-create the sysmem memseg as a shadow
894 + * over the checkpoint file's vnode. vm_setup_memory_domains
895 + * will then reuse the existing memseg; guest pages are shared
896 + * across every vm that restores from the same ckp, and
897 + * restore_vm_mem becomes a no-op because the pages are already
898 + * "loaded" via the vnode pager.
899 + */
900 + if (restore_file != NULL &&
901 + get_config_bool_default("snapshot.vnode_restore", false)) {
902 + if (vm_alloc_memseg_vnode(ctx, VM_SYSMEM, memsize, true,
903 + rstate.vmmem_fd, 0, 0) != 0) {
904 + fprintf(stderr,
905 + "vm_alloc_memseg_vnode failed (%d): %s\n",
906 + errno, strerror(errno));
907 + exit(4);
908 + }
909 + }
910 + #endif
890 911 error = vm_setup_memory_domains(ctx, VM_MMAP_ALL, guest_domains,
891 912 guest_ndomains);
892 913 if (error) {
@@ -949,10 +970,15 @@
949 970 exit(1);
950 971 }
951 972
952 FPRINTLN(stdout, "Restoring vm mem...");
953 if (restore_vm_mem(ctx, &rstate) != 0) {
954 EPRINTLN("Failed to restore VM memory.");
955 exit(1);
973 + if (get_config_bool_default("snapshot.vnode_restore", false)) {
974 + FPRINTLN(stdout,
975 + "Skipping vm mem copy (vnode-backed restore).");
976 + } else {
977 + FPRINTLN(stdout, "Restoring vm mem...");
978 + if (restore_vm_mem(ctx, &rstate) != 0) {
979 + EPRINTLN("Failed to restore VM memory.");
980 + exit(1);
981 + }
956 982 }
957 983
958 984 FPRINTLN(stdout, "Restoring pci devs...");
959 985

What the upstream review found

The full audit lives at patches/upstream-review.md. Top five items before this is submittable:

  1. Reproduce the v1 GPF on an INVARIANTS kernel and capture the real stack. The cred=NULL rationalization may not be what actually fixed the crash.
  2. Fix the refcount handling. The vm_object_reference(backing) before vm_object_shadow may be leaking one reference per call — a slow leak that 1000-VM tests don’t catch but production soak tests will.
  3. Add flags and offset fields to struct vm_memseg_vnode now. Free forward-compat; not free once the ioctl ships.
  4. Bump lib/libvmmapi/Symbol.map. The new symbol isn’t exported from the shared library today; caught this by reading, not testing.
  5. Write one ATF test. create-map-CoW-destroy asserting vm.object_count returns to baseline. Single test, catches three of the top four blockers automatically.

This is the punch list for “submit to Phabricator.” Everything in the review beyond that list is style-pass or documentation — the patches land or don’t on these five.

References