VNET jails (per-sandbox IPs)

Coppice’s jails used to run with ip4=inherit, sharing honor’s host IP. That was enough for the envd surface (the gateway owns those listeners) but it collapses the moment a user wants to expose anything from inside the jail — a Jupyter web frontend, a user web app, Chromium’s CDP on :9222. It also made pf filtering a uid/tag exercise rather than a subnet one. Step 3 of #69 moved to VNET jails on a dedicated bridge, and this page is the map of what changed.

Why we moved off ip4=inherit

Three problems, one switch.

First, listener collisions. Two sandboxes both wanting :9222 for CDP can’t have it; they share the host’s IP, and the kernel’s socket table is flat. Any serious browser-sandbox story needs per-jail addressability, full stop.

Second, pf rule gymnastics. A pass quick from any rule inside cube/sandbox-<short> matched all traffic leaving the host IP — not just this sandbox’s — because from pf’s point of view every jail was the host. Disambiguating meant tagging by uid (the per-sandbox jail uid), which works, but is fiddly and couples pf semantics to the jail backend. Source IPs are what pf wants to filter on, and the new setup gives it one.

Third, browser-sandbox (tracked as #60) is explicitly blocked without this. Playwright connects to a CDP WebSocket and expects to reach it at a stable address; no address, no browser sandbox.

Subnet plan

Two disjoint /24s on honor:

The split is deliberate. bhyve rigs keep running on their own bridge with their own address plan; jails get their own space; no address reuse means no confused pf matches. The allocator lives in e2b-compat/src/ipalloc.rs and is fresh every gateway process.

Per-sandbox lifecycle

On POST /sandboxes, the backend:

  1. Allocates an IP from the pool (IpAllocator::allocate).
  2. Creates an epair pair — ifconfig epair create, which returns epair<N>a and epair<N>b as a pair.
  3. Adds the a-end to coppicenet0 on the host.
  4. Launches the jail with vnet=new and vnet.interface=epair<N>b, which hands the b-end into the jail’s fresh vnet.
  5. The jail’s exec.start (not exec.prestart — that’s a landmine) assigns the IP inside the jail and adds route add default 10.78.0.1.

The exec.prestart vs exec.start distinction matters enough to spell out. exec.prestart runs on the host, before the vnet hand-off; the b-end isn’t in the jail’s stack yet, so any ifconfig or route commands poke the host’s routing table. On a host that already has a default route, route add default hits “File exists” and the jail comes up half-configured. exec.start runs inside the jail after hand-off, against a clean empty route table, and does the right thing.

Teardown reverses in the order that matters: jail -r first (which returns the b-end to the host), then ifconfig <a> destroy (which also removes the b-end, since epair pairs die together). The a-end name is stashed in a map keyed by sandbox id at create time so teardown can find it.

pf anchor semantics change

With ip4=inherit, a sandbox’s anchor rules read from any and relied on anchor scoping to not bleed. With VNET, every rule is source-IP-scoped:

block quick from 10.78.0.42 to <sandbox_deny>
pass  quick from 10.78.0.42 to <sandbox_allow>

The air-gapped fragment’s terminal rule goes from block quick all to block quick from 10.78.0.42 to any. Critically, lo0 inside the anchor now means the jail’s loopback, not the host’s — each VNET has its own. That’s the clean semantics we always wanted: in-jail services reach each other on 127.0.0.1 without accidentally reaching host services on the same address.

The coppicenet0 bridge bring-up

Bridge creation plus NAT is handled by tools/coppice-net-setup.sh — idempotent, dead-man-switched. It:

  1. Creates the bridge (if missing) with 10.78.0.1/24.
  2. Appends nat-anchor "cube/*" (at top of the root ruleset, before filter rules — pf requires this ordering) and anchor "cube/*" (with the other filter anchors) to the existing root pf. Siblings like cube_policy and cube/sandbox-* are preserved.
  3. Loads a NAT rule into cube/jail-nat: nat on vm-public from 10.78.0.0/24 to any -> (vm-public).

vm-public, not re0. Honor’s root pf has set skip on re0 — meaning re0 is invisible to pf. NAT out re0 would never fire. The vm-public interface is the bhyve bridge that actually carries uplink traffic, and that’s where the NAT rule lives.

The dead-man switch is the usual daemon(8) pattern: reload rules now, schedule a revert after N seconds, cancel the revert once the smoke test passes. If the script crashes before the smoke test, root pf reverts to its previous state and ssh stays alive.

Startup reconstitution is punted

If the gateway restarts with live sandboxes still running, the IpAllocator forgets which IPs are in use. A fresh sandbox can in principle draw an IP that’s already live on an existing jail. The code path (reconstitute_ip_reservations) is stubbed in freebsd_jail.rs with a TODO and a sketch of what it needs to do: jls for jail names, ifconfig coppicenet0 for a-end members, map those to b-end IPs inside each jail, re-register with the allocator.

In practice this rarely bites: gateway restarts typically tear sandboxes down as a side effect, and when they don’t, the next collision surfaces fast. It’s a correctness gap, not an operational one, and it stays on the list.

DNS via local_unbound on the bridge gateway

With VNET the jail’s /etc/resolv.conf is inherited from honor at clone time. Honor’s is nameserver 127.0.0.1 — fine for honor (its own local_unbound listens there), but inside a VNET jail that 127.0.0.1 is the jail’s empty loopback, not the host’s resolver. DNS just dies.

Fix: run local_unbound (FreeBSD base, no ports) on the bridge gateway 10.78.0.1:53 as well as 127.0.0.1, and bake nameserver 10.78.0.1 into the jail-template ZFS datasets so every future clone inherits the right answer. tools/coppice-net-setup.sh drops a sentinel-wrapped /var/unbound/conf.d/coppice.conf that adds the bridge interface plus access-control: 10.78.0.0/24 allow, and rewrites forward.conf to forward . to 1.1.1.1 and 9.9.9.9 while preserving any LAN-scoped forward-zone blocks (so honor’s own host mybox.lan keeps working).

The split-binding is dead-man-by-construction: the host’s own resolver path is unchanged, so an unbound crash or misconfig won’t take honor’s DNS with it — worst case, new sandboxes can’t resolve until the service recovers. Air-gap compatibility rides free on step 7’s pass quick from <ip> to 10.78.0.1 rule — DNS queries are just more bridge-gateway traffic. Adding interface: to unbound requires a full service local_unbound restart, not a reload; the script detects this and restarts only when a new interface needs to be bound.

Air-gapped fragment learns to pass the gateway

The air-gapped fragment (see air-gapped) installs a blanket block quick from <ip> to any as the terminal rule. Under ip4=inherit this worked because pass quick on lo0 covered gateway traffic — the sandbox and gateway both lived on the host’s loopback. Under VNET, the gateway is at 10.78.0.1 from the sandbox’s point of view, which isn’t loopback and isn’t covered by the other pass rules. A follow-up commit in step 7 added pass quick from <ip> to 10.78.0.1 between the loopback pass and the DNS allowlist, so air-gapped sandboxes stay reachable for envd, metadata, and any other host-side control-plane service bound on the bridge. “Air-gapped” means no external internet, not no gateway.

Summary

componentvaluenote
bhyve pool subnet10.77.0.0/24 on cubenet0Unchanged by #69. Disjoint from jail subnet by design.
jail subnet10.78.0.0/24 on coppicenet0New in #69 step 3.
bridge gateway10.78.0.1/24Configured on coppicenet0; jail’s default route points here.
allocator range10.78.0.10 – 10.78.0.250IpAllocator in e2b-compat/src/ipalloc.rs. 240 concurrent sandboxes.
NAT anchorcube/jail-natnat on vm-public from 10.78.0.0/24 to any -> (vm-public). Not re0 — pf skips re0.
root pf hooksnat-anchor “cube/*” + anchor “cube/*“Installed by tools/coppice-net-setup.sh, dead-man-switched.
per-sandbox anchor shapefrom 10.78.0.<M>Rules source-IP-scoped. lo0 now means jail’s own loopback.
DNS resolverlocal_unbound on 10.78.0.1:53Base-system unbound, also listens on 127.0.0.1 so honor’s own DNS is untouched. Forwards . to 1.1.1.1 + 9.9.9.9. Template /etc/resolv.conf points at 10.78.0.1.
startup reconstitutionpuntedTODO in freebsd_jail.rs; gateway restart with live jails may double-allocate.

What this unblocks

Cross-refs: wildcard DNS for the SDK-routing side of the story (gateway still owns envd, regardless of per-jail IPs), air-gapped for the pf fragment that uses the new source-IP scoping, and eBPF → pf for the broader pf-as-policy story.