patches/vmm-memseg-vnode, annotated

The KSM-equivalent appendix ends with a claim: the last architectural gap against CubeSandbox is closable by a ~week-scale kernel patch, not a year-scale port. This page is that patch, read out loud. We inline the live diff from patches/vmm-memseg-vnode.diff rather than copy-pasting — if the file in the repo changes, so does this page.

Why this exists

bhyve’s guest memory is allocated as an anonymous vm_object of type OBJT_SWAP. That’s fine for a single VM, but it means every microVM resumed from the same checkpoint file gets its own independent set of anonymous pages, even though the source bytes are identical. At the CubeSandbox workload — one golden checkpoint fanning out into hundreds of identical microVMs — this is hilariously wasteful. The Linux answer is KSM, a kernel thread that hashes pages post-hoc and merges matches. A full KSM port to FreeBSD is a 2500-4000 LoC project touching pmap_promote_pde, vm_object_collapse, and the anonymous-memory lifecycle — a six-to-twelve-month undertaking.

But we don’t actually need content-addressable dedup of arbitrary pages. We need this one file mapped into many VMs, with writes going into per-VM CoW shadows. That’s MAP_PRIVATE. The kernel already knows how to do it for ordinary mmap. What it didn’t have was a way to plug a MAP_PRIVATE-over-file object into a vmm memseg.

This patch adds that plug. On honor with the patch installed, 1 000 concurrent 256 MiB bhyve VMs from a single checkpoint fit in ~9.1 GiB of host RAM, where the unpatched baseline would need ~250 GiB.

What the diff does, file by file

`sys/dev/vmm/vmm_mem.c`

The core of the work: a new function that takes a vnode and produces a memseg whose vm_object is an anonymous shadow layered over the vnode’s page cache. Guest reads fall through; guest writes CoW into the shadow.

sys/dev/vmm/vmm_mem.c

@@ -8,8 +8,10 @@

8 8 #include <sys/types.h>

9 9 #include <sys/lock.h>

10 10 #include <sys/malloc.h>

11 + #include <sys/rwlock.h>

11 12 #include <sys/sx.h>

12 13 #include <sys/systm.h>

14 + #include <sys/vnode.h>

13 15

14 16 #include <machine/vmm.h>

15 17

@@ -20,6 +22,7 @@

20 22 #include <vm/vm_map.h>

21 23 #include <vm/vm_object.h>

22 24 #include <vm/vm_page.h>

25 + #include <vm/vm_pager.h>

23 26

24 27 #include <dev/vmm/vmm_dev.h>

25 28 #include <dev/vmm/vmm_mem.h>

@@ -198,6 +201,114 @@

198 201 seg->sysmem = sysmem;

199 202

200 203 return (0);

204 + }

205 +

206 + /*

207 + * Allocate a guest memory segment backed by a vnode. The guest sees a

208 + * private CoW view of the file: reads fall through to pages cached in

209 + * the vnode's vm_object (shared across every guest that maps the same

210 + * file), writes allocate fresh anonymous pages in a shadow.

211 + *

212 + * This is the minimum kernel surface needed to dedup RAM across many

213 + * microVMs resumed from a common checkpoint file.

214 + *

215 + * Refcount / charge invariants (see sys/vm/vm_object.c):

216 + *

217 + * vm_object_shadow(&obj, &off, len, cred, shared):

218 + * - consumes one reference from the caller on *obj and transfers

219 + * it into the new shadow's backing_object slot. The caller's

220 + * pointer is replaced with the shadow (caller's net refcount on

221 + * the original backing: unchanged).

222 + * - calls vm_object_allocate_anon(len, backing, cred, len), which

223 + * sets shadow->cred = cred and shadow->charge = len when cred

224 + * is non-NULL. It does NOT call swap_reserve_by_cred(); the

225 + * caller is responsible for that. The pairing release happens

226 + * in vm_object_terminate via swap_release_by_cred(charge, cred).

227 + *

228 + * Consequences for this path:

229 + * - We do not call swap_reserve_by_cred on the shadow, therefore we

230 + * MUST pass cred=NULL to vm_object_shadow. Passing td_ucred here

231 + * (the v1 behavior) produced a paired release-without-reserve on

232 + * object teardown, which manifested as a GPF in the swap charge

233 + * accounting during vm_map_entry_delete. Shadow pages still count

234 + * against the process's RACCT_SWAP via the vm_map entry's cred

235 + * (set in vm_map_insert for the bhyve-side mmap), exactly the

236 + * same way a MAP_PRIVATE file mmap is charged.

237 + * - vnode_create_vobject does NOT return a reference; vp->v_object

238 + * is a raw pointer with lifetime tied to the vnode. We therefore

239 + * MUST call vm_object_reference(backing) before vm_object_shadow

240 + * so the reference the shadow consumes is one we own. Omitting

241 + * this underflows the vnode object's refcount on VM teardown.

242 + */

243 + int

244 + vm_alloc_memseg_vnode(struct vm *vm, int ident, size_t len, bool sysmem,

245 + struct vnode *vp, struct ucred *cred)

246 + {

247 + struct vm_mem_seg *seg;

248 + struct vm_mem *mem;

249 + struct vattr va;

250 + vm_object_t backing, shadow;

251 + vm_ooffset_t offset;

252 + int error;

253 +

254 + mem = vm_mem(vm);

255 + vm_assert_memseg_xlocked(vm);

256 +

257 + if (ident < 0 || ident >= VM_MAX_MEMSEGS)

258 + return (EINVAL);

259 + if (len == 0 || (len & PAGE_MASK) != 0)

260 + return (EINVAL);

261 +

262 + seg = &mem->mem_segs[ident];

263 + if (seg->object != NULL) {

264 + if (seg->len == len && seg->sysmem == sysmem)

265 + return (EEXIST);

266 + return (EINVAL);

267 + }

268 +

269 + error = vget(vp, LK_SHARED);

270 + if (error != 0)

271 + return (error);

272 +

273 + if (vp->v_type != VREG) {

274 + error = EINVAL;

275 + goto out;

276 + }

277 + error = VOP_GETATTR(vp, &va, cred);

278 + if (error != 0)

279 + goto out;

280 + if ((size_t)va.va_size < len) {

281 + error = EINVAL;

282 + goto out;

283 + }

284 +

285 + /*

286 + * Attach (or retrieve) the vnode's VM object. For an already-open

287 + * regular file vp->v_object is usually non-NULL; vnode_create_vobject

288 + * is idempotent and covers the cold-open case.

289 + */

290 + error = vnode_create_vobject(vp, va.va_size, curthread);

291 + if (error != 0)

292 + goto out;

293 + backing = vp->v_object;

294 + if (backing == NULL) {

295 + error = EINVAL;

296 + goto out;

297 + }

298 + vm_object_reference(backing);

299 + offset = 0;

300 + shadow = backing;

301 + vm_object_shadow(&shadow, &offset, len, NULL, false);

302 +

303 + seg->len = len;

304 + seg->object = shadow;

305 + seg->sysmem = sysmem;

306 +

307 + error = 0;

308 + out:

309 + VOP_UNLOCK(vp);

310 + return (error);

201 311 }

202 312

203 313 int

`sys/dev/vmm/vmm_mem.h`

Just exposes the prototype to callers in vmm_dev.c. The forward declarations for struct vnode and struct ucred let consumers include this header without also pulling in sys/vnode.h.

`sys/dev/vmm/vmm_dev.c`

The ioctl plumbing. Adds the dispatch-table entry with the same locking flags as VM_ALLOC_MEMSEG (exclusive memseg lock, lock all vcpus), and the case handler that resolves the user-supplied fd to a vnode via fgetvp_read and calls our new function.

sys/dev/vmm/vmm_dev.c

@@ -7,8 +7,10 @@

7 7 */

8 8

9 9 #include <sys/param.h>

10 + #include <sys/capsicum.h>

10 11 #include <sys/conf.h>

11 12 #include <sys/fcntl.h>

13 + #include <sys/file.h>

12 14 #include <sys/ioccom.h>

13 15 #include <sys/jail.h>

14 16 #include <sys/kernel.h>

@@ -20,6 +22,7 @@

20 22 #include <sys/sysctl.h>

21 23 #include <sys/ucred.h>

22 24 #include <sys/uio.h>

25 + #include <sys/vnode.h>

23 26

24 27 #include <machine/vmm.h>

25 28

@@ -395,6 +398,8 @@

395 398 #endif /* __amd64__ */

396 399 VMMDEV_IOCTL(VM_ALLOC_MEMSEG,

397 400 VMMDEV_IOCTL_XLOCK_MEMSEGS | VMMDEV_IOCTL_LOCK_ALL_VCPUS),

401 + VMMDEV_IOCTL(VM_ALLOC_MEMSEG_VNODE,

402 + VMMDEV_IOCTL_XLOCK_MEMSEGS | VMMDEV_IOCTL_LOCK_ALL_VCPUS),

398 403 VMMDEV_IOCTL(VM_MMAP_MEMSEG,

399 404 VMMDEV_IOCTL_XLOCK_MEMSEGS | VMMDEV_IOCTL_LOCK_ALL_VCPUS),

400 405 VMMDEV_IOCTL(VM_MUNMAP_MEMSEG,

@@ -610,6 +615,32 @@

610 615 }

611 616 error = alloc_memseg(sc, mseg, sizeof(mseg->name), domainset);

612 617

618 + break;

619 + }

620 + case VM_ALLOC_MEMSEG_VNODE: {

621 + struct vm_memseg_vnode *msegv =

622 + (struct vm_memseg_vnode *)data;

623 + struct vnode *vp;

624 + cap_rights_t rights;

625 +

626 + /*

627 + * flags and offset are reserved for future use (e.g.

628 + * MEMSEG_VNODE_READONLY, partial-file memsegs). Reject

629 + * non-zero today so userspace can't accidentally depend

630 + * on unimplemented semantics.

631 + */

632 + if (msegv->flags != 0 || msegv->offset != 0) {

633 + error = EINVAL;

634 + break;

635 + }

636 + error = fgetvp_read(td, msegv->fd,

637 + cap_rights_init(&rights, CAP_MMAP_R), &vp);

638 + if (error != 0)

639 + break;

640 + error = vm_alloc_memseg_vnode(sc->vm, msegv->segid, msegv->len,

641 + msegv->sysmem != 0, vp, td->td_ucred);

642 + vrele(vp);

643 + break;

644 + }

613 645 case VM_GET_MEMSEG:

614 646 error = get_memseg(sc, (struct vm_memseg *)data,

615 647 sizeof(((struct vm_memseg *)0)->name));

`sys/amd64/include/vmm_dev.h`

The ABI: a new struct and a new ioctl number.

sys/amd64/include/vmm_dev.h

@@ -60,6 +60,26 @@

60 60 int ds_policy;

61 61 };

62 62

63 + /*

64 + * Allocate a system-memory segment whose contents are backed by the

65 + * regular file referenced by 'fd'. Guest pages start as shared

66 + * references to the file's vm_object; guest writes CoW into private

67 + * anonymous pages. Multiple VMs pointing at the same file share

68 + * identical backing pages across the system.

69 + *

70 + * 'offset' and 'flags' are reserved for future use (partial-file

71 + * memsegs, MEMSEG_VNODE_READONLY, MEMSEG_VNODE_HUGE); the kernel

72 + * rejects non-zero values today.

73 + */

74 + struct vm_memseg_vnode {

75 + size_t len; /* segment length, page-aligned */

76 + uint64_t offset; /* offset into the vnode (reserved, must be 0) */

77 + uint32_t flags; /* reserved, must be 0 */

78 + int segid;

79 + int sysmem; /* 1 = VM_SYSMEM, 0 = devmem */

80 + int fd;

81 + uint32_t _pad; /* explicit tail pad for ABI stability */

82 + };

83 +

63 84 struct vm_register {

64 85 int cpuid;

65 86 int regnum; /* enum vm_reg_name */

@@ -339,7 +359,10 @@

339 359 /* checkpoint */

340 360 IOCNUM_SNAPSHOT_REQ = 113,

341 361

342 − IOCNUM_RESTORE_TIME = 115

362 + IOCNUM_RESTORE_TIME = 115,

363 +

364 + /* vnode-backed memory segment */

365 + IOCNUM_ALLOC_MEMSEG_VNODE = 116

343 366 };

344 367

345 368 #define VM_RUN \

@@ -466,4 +489,6 @@

466 489 _IOWR('v', IOCNUM_SNAPSHOT_REQ, struct vm_snapshot_meta)

467 490 #define VM_RESTORE_TIME \

468 491 _IOWR('v', IOCNUM_RESTORE_TIME, int)

492 + #define VM_ALLOC_MEMSEG_VNODE \

493 + _IOW('v', IOCNUM_ALLOC_MEMSEG_VNODE, struct vm_memseg_vnode)

469 494 #endif

`lib/libvmmapi/vmmapi.c` and `.h`

Thin userspace wrapper. One function, one ioctl call.

lib/libvmmapi/vmmapi.c

@@ -428,6 +428,34 @@

428 428 return (error);

429 429 }

430 430

431 + /*

432 + * Allocate a system-memory segment whose contents are backed by the

433 + * file referenced by 'fd'. Pages are shared across every vm that

434 + * allocates from the same underlying vnode; guest writes CoW into

435 + * private anonymous pages.

436 + *

437 + * Unlike vm_alloc_memseg() this does not reuse an existing segment —

438 + * the expected call pattern is "create vm, then immediately attach

439 + * file-backed sysmem, then mmap it into the guest address space".

440 + *

441 + * 'offset' and 'flags' are reserved for future use (partial-file

442 + * memsegs, MEMSEG_VNODE_READONLY); the kernel currently rejects

443 + * non-zero values with EINVAL.

444 + */

445 + int

446 + vm_alloc_memseg_vnode(struct vmctx *ctx, int segid, size_t len, bool sysmem,

447 + int fd, uint64_t offset, uint32_t flags)

448 + {

449 + struct vm_memseg_vnode msegv;

450 +

451 + memset(&msegv, 0, sizeof(msegv));

452 + msegv.segid = segid;

453 + msegv.len = len;

454 + msegv.sysmem = sysmem ? 1 : 0;

455 + msegv.fd = fd;

456 + msegv.offset = offset;

457 + msegv.flags = flags;

458 + return (ioctl(ctx->fd, VM_ALLOC_MEMSEG_VNODE, &msegv));

459 + }

460 +

431 461 int

432 462 vm_get_memseg(struct vmctx *ctx, int segid, size_t *lenp, char *namebuf,

433 463 size_t bufsize)

lib/libvmmapi/vmmapi.h

@@ -83,6 +83,15 @@

83 83 size_t namesiz);

84 84

85 85 /*

86 + * Allocate a system-memory segment whose contents are backed by the

87 + * file referenced by 'fd'. Writes by the guest CoW into private

88 + * anonymous pages; reads share the file-backed vm_object with every

89 + * other vm that attached the same vnode. 'offset' and 'flags' are

90 + * reserved for future use and must be zero today.

91 + */

92 + int vm_alloc_memseg_vnode(struct vmctx *ctx, int segid, size_t len,

93 + bool sysmem, int fd, uint64_t offset, uint32_t flags);

94 +

95 + /*

86 96 * Iterate over the guest address space. This function finds an address range

87 97 * that starts at an address >= *gpa.

88 98 *

The bhyve-side patch

Kernel alone isn’t enough: bhyve has to actually call the new ioctl. patches/bhyve-vnode-restore.diff wires it in, guarded behind an opt-in config key so the default bhyve -r path is unchanged.

lib/libvmmapi/internal.h

@@ -38,6 +38,7 @@

38 38 VM_SUSPEND, \

39 39 VM_REINIT, \

40 40 VM_ALLOC_MEMSEG, \

41 + VM_ALLOC_MEMSEG_VNODE, \

41 42 VM_GET_MEMSEG, \

42 43 VM_MMAP_MEMSEG, \

43 44 VM_MMAP_MEMSEG, \

usr.sbin/bhyve/bhyverun.c

@@ -887,6 +887,27 @@

887 887 if (get_config_bool_default("memory.guest_in_core", false))

888 888 memflags |= VM_MEM_F_INCORE;

889 889 vm_set_memflags(ctx, memflags);

890 + #ifdef BHYVE_SNAPSHOT

891 + /*

892 + * Path A (patches/vmm-memseg-vnode.diff): if restoring AND the

893 + * operator opted in, pre-create the sysmem memseg as a shadow

894 + * over the checkpoint file's vnode. vm_setup_memory_domains

895 + * will then reuse the existing memseg; guest pages are shared

896 + * across every vm that restores from the same ckp, and

897 + * restore_vm_mem becomes a no-op because the pages are already

898 + * "loaded" via the vnode pager.

899 + */

900 + if (restore_file != NULL &&

901 + get_config_bool_default("snapshot.vnode_restore", false)) {

902 + if (vm_alloc_memseg_vnode(ctx, VM_SYSMEM, memsize, true,

903 + rstate.vmmem_fd, 0, 0) != 0) {

904 + fprintf(stderr,

905 + "vm_alloc_memseg_vnode failed (%d): %s\n",

906 + errno, strerror(errno));

907 + exit(4);

908 + }

909 + }

910 + #endif

890 911 error = vm_setup_memory_domains(ctx, VM_MMAP_ALL, guest_domains,

891 912 guest_ndomains);

892 913 if (error) {

@@ -949,10 +970,15 @@

949 970 exit(1);

950 971 }

951 972

952 − FPRINTLN(stdout, "Restoring vm mem...");

953 − if (restore_vm_mem(ctx, &rstate) != 0) {

954 − EPRINTLN("Failed to restore VM memory.");

955 − exit(1);

973 + if (get_config_bool_default("snapshot.vnode_restore", false)) {

974 + FPRINTLN(stdout,

975 + "Skipping vm mem copy (vnode-backed restore).");

976 + } else {

977 + FPRINTLN(stdout, "Restoring vm mem...");

978 + if (restore_vm_mem(ctx, &rstate) != 0) {

979 + EPRINTLN("Failed to restore VM memory.");

980 + exit(1);

981 + }

956 982 }

957 983

958 984 FPRINTLN(stdout, "Restoring pci devs...");

959 985

What the upstream review found

The full audit lives at patches/upstream-review.md. Top five items before this is submittable:

Reproduce the v1 GPF on an INVARIANTS kernel and capture the real stack. The cred=NULL rationalization may not be what actually fixed the crash.
Fix the refcount handling. The vm_object_reference(backing) before vm_object_shadow may be leaking one reference per call — a slow leak that 1000-VM tests don’t catch but production soak tests will.
Add flags and offset fields to struct vm_memseg_vnode now. Free forward-compat; not free once the ioctl ships.
Bump lib/libvmmapi/Symbol.map. The new symbol isn’t exported from the shared library today; caught this by reading, not testing.
Write one ATF test. create-map-CoW-destroy asserting vm.object_count returns to baseline. Single test, catches three of the top four blockers automatically.

This is the punch list for “submit to Phabricator.” Everything in the review beyond that list is style-pass or documentation — the patches land or don’t on these five.

References

patches/vmm-memseg-vnode.diff — kernel + libvmmapi
patches/bhyve-vnode-restore.diff — bhyve userland
patches/README.md — apply instructions
patches/upstream-review.md — full reviewer punch list
benchmarks/rigs/bhyve-fanout-rss.sh — the N-VM fanout rig
/appendix/ksm-equivalent — the motivating gap

Why this exists

What the diff does, file by file

sys/dev/vmm/vmm_mem.c

sys/dev/vmm/vmm_mem.c

sys/dev/vmm/vmm_mem.h

sys/dev/vmm/vmm_dev.c

sys/dev/vmm/vmm_dev.c

sys/amd64/include/vmm_dev.h

sys/amd64/include/vmm_dev.h

lib/libvmmapi/vmmapi.c and .h

lib/libvmmapi/vmmapi.c

lib/libvmmapi/vmmapi.h

The bhyve-side patch

lib/libvmmapi/internal.h

usr.sbin/bhyve/bhyverun.c

What the upstream review found

References

`sys/dev/vmm/vmm_mem.c`

`sys/dev/vmm/vmm_mem.h`

`sys/dev/vmm/vmm_dev.c`

`sys/amd64/include/vmm_dev.h`

`lib/libvmmapi/vmmapi.c` and `.h`