The hardest port. Four substrate-level questions: what do CubeNet’s three BPF programs actually do, which of that maps to pf, where does dummynet help, and where do you have to walk away.
What CubeNet actually does
From CubeNet/src/ and CubeNet/cubevs/:
nodenic.bpf.c— attached to the host NIC. Classifies ingress packets to the right sandbox by matching against a map keyed on the external address + port → sandbox TAP.mvmtap.bpf.c— attached to each sandbox’s TAP. Per-sandbox L2/L3 accept/drop, SNAT rewrite.localgw.bpf.c— attached to the host-side bridge path for inter-sandbox traffic that stays on the local node. Implements the “local gateway” fast path that bypasses the host network stack for same-node sandbox-to-sandbox flows.
Go glue in cubevs/:
tap.go— create/destroy TAPs, link into VNET-equivalent.tc.go— attach BPF programs via TC (classic clsact, or tcx if kernel supports).snat.go— populate SNAT maps keyed by source port / sandbox ID.netpolicy.go— populate per-sandbox allow/deny maps.port.go— HostPort → sandbox-port mapping.reaper.go— stale-entry cleanup.
Policy is expressed as BPF map entries, mutated from userspace, read by the BPF programs on each packet.
The pf / VNET / dummynet translation
TAP + VNET
Straight port:
tap(4)(FreeBSD) ≈ Linuxtun/tapwithIFF_TAP.- VNET jail ≈ network namespace.
bridge(4)orif_vxlan/netgraphbridge ≈ Linux bridge.epair(4)≈veth.
No gap of consequence. vishvananda/netlink in the Go glue would be
replaced with freebsd-net/ifconfig or netlink emulation (FreeBSD has
gained a limited rtnetlink-compat shim since 13.x; coverage is incomplete
but adequate for create/destroy/up/down).
L3/L4 allow/deny
pf rules or pf tables.
- Static-ish policy (per-template, changes infrequently): pf anchor
per sandbox, loaded once at sandbox create, modified by
pfctl -a cube/<id> -f -. Reload cost: ~ms per change. - Dynamic set membership (changing allow-lists): pf tables. Add/remove
addresses with
pfctl -t cube_allow_<id> -T add/delete. Per-entry mutation cost: microseconds. This is the closest thing FreeBSD has to eBPF maps for membership queries. - Per-flow state: pf’s state engine is good at this — stateful keep-state rules match the common case.
- Arbitrary packet mutation: not pf’s job. netgraph custom nodes cover this at the cost of C kernel-module development.
SNAT / rdr (DNAT)
pf nat and pf rdr directly. pf nat on $iface from <src> to any → $external
and pf rdr pass on $iface proto tcp from any to ($iface) port $hp → <tap_ip> port $sp respectively. Cube’s host-port mapping ports to rdr rules verbatim.
Rate limiting / QoS
dummynet(4) pipes and queues. Create a pipe per sandbox at a bandwidth cap,
ipfw add N pipe M src-ip <tap_ip>. Not pf-native (requires ipfw to assign
to the pipe); mix-and-match usage is common and supported.
Inter-sandbox fast path (localgw)
This one has no clean substitute. localgw.bpf.c short-circuits inter-sandbox traffic so it doesn’t traverse the full host network stack — packets move TAP-to-TAP through the BPF fast path. pf does not have a “bypass the stack and forward from interface A to interface B based on a packet match” hook.
Options:
- netgraph. A
ng_bridgenode between TAPs can forward in-kernel without pf involvement. Classifying which flows stay local and which go out is ang_tee+ng_bpfarrangement (ng_bpf uses classic BPF, not eBPF — less expressive but in-tree). - Accept that local traffic traverses pf. pf is in the hot path for inter-TAP traffic via a bridge. The overhead is measurable but not catastrophic for the kinds of flows agent workloads generate.
What doesn’t map
- BPF maps as a userspace-mutable policy store read on every packet. pf tables approximate this for membership checks; arbitrary per-packet lookups with complex key types (e.g., 5-tuple → policy-decision struct) need custom netgraph.
- Per-sandbox eBPF programs with distinct logic. pf is a shared ruleset; anchors partition it. Different structure per sandbox requires different anchors, each of which is loaded as static pf syntax.
- XDP. No equivalent ingress-before-stack path.
- BPF helper functions like
bpf_redirect_peerfor tail-call-style fast-path composition. netgraph’sng_xxxmessage passing is the model; the ergonomics differ.
What you’d actually build
A cubevs-freebsd with:
- Same Go RPC surface (
EnsureNetwork,ReleaseNetwork, etc.) as CubeNet/cubevs. - TAP-per-sandbox into VNET jails.
- pf ruleset builder producing one anchor per sandbox + a shared nat/rdr section.
- pf table maintenance for dynamic allow/deny set membership.
- netgraph bridge + ng_tee + ng_bpf for the local fast path (or accept the stack traversal).
- dummynet pipes for QoS.
- pfctl orchestration — batched reloads, not per-rule changes, to keep the tax down when a burst of policy updates lands.
Is this “as good as” CubeNet? For most policies, yes. For dynamic L4/L7 decisions that change hundreds of times per second, no — and at that rate, the question is whether CubeNet is actually doing that in production or only can.
Implementation plan: cube-network-agent-freebsd
A Rust daemon (same stack as e2b-compat) that reconciles desired
sandbox network state to FreeBSD primitives. Same shape as CubeSandbox’s
network-agent, different backend.
cube-network-agent-freebsd/
Cargo.toml
src/
main.rs # axum RPC server — EnsureNetwork / ReleaseNetwork
reconcile.rs # desired-vs-actual state loop
tap.rs # tap(4) ioctl wrappers
vnet.rs # jail VNET plumbing (jexec, jail_attach)
pf/
anchor.rs # per-sandbox anchor template generation
table.rs # pfctl -t ... -T add/delete wrappers
rdr.rs # host-port → sandbox-port rules
nat.rs # SNAT rule generation
reload.rs # batched pfctl -f orchestration
dummynet.rs # optional ipfw pipe per sandbox
ng/
bridge.rs # ng_bridge for local fast path (optional)
bpf.rs # ng_bpf classifier construction
reaper.rs # stale anchor / table / TAP GC
metrics.rs # prometheus exposition
tests/
integration/ # end-to-end inside VNET jails
Pessimistic LoC budget: ~4500 Rust + ~200 shell. FTE-weeks:
5-7 for an engineer who already knows pf and jails, ~10
without that background. The EnsureNetwork(sandbox_id, policy) RPC
stays shape-identical with Cube’s; only the reconciler changes. That
matters because our e2b-compat already consumes this RPC.
Per-program mapping (verdict per file)
| CubeNet program | FreeBSD primitive | Fidelity | Effort |
|---|---|---|---|
nodenic.bpf.c (ingress classifier) | pf rdr + tables | full | ~0.5 FTE-wk |
mvmtap.bpf.c (per-sandbox egress + spoof + SNAT) | pf anchor + nat | full for L3/L4 | ~1 FTE-wk |
localgw.bpf.c (inter-sandbox fast path) | ng_bridge (or accept pf) | partial | 0-3 FTE-wk |
localgw is the one Tier-3 piece. Three options in
decreasing complexity:
- Accept pf in the hot path. Zero new code. Measure; if mean inter-sandbox latency is in the hundreds of microseconds range, ship it — most agent workloads won’t notice.
- Netgraph fast path. ng_bridge + per-TAP ng_ether, ng_bpf for classifier. ~400 LoC of C if we have to write a custom ng_ node.
- VALE / netmap. Strongest programmable L2 fast path FreeBSD has. TAPs run in netmap mode (changes how bhyve’s virtio-net backend attaches). ~600 LoC. Highest performance ceiling and the highest integration cost.
Recommendation: ship with (1), measure, escalate if numbers warrant.
How we test parity
This is where “paper parity” dies and real benchmarks matter. Four
classes of test under benchmarks/rigs/net/:
Latency
Yardsticks: netperf TCP_RR / UDP_RR for
small-packet round-trip; sockperf ping-pong for tighter
percentile reporting. iperf3 for throughput only.
Scenarios:
s2s-same-host— sandbox A → sandbox B on the same node. Exerciseslocalgwon Linux, pf (or netgraph) on FreeBSD.s2internet— sandbox → external IP on the lab network.inet2s— external host → sandbox via rdr/DNAT.
Report p50/p95/p99 RTT at 1 Kpps / 10 Kpps / 100 Kpps offered load.
Throughput
iperf3 with 1 / 4 / 16 streams, TCP and UDP, MTU 1500 and
9000. Report Gbps + CPU% at both ends. Watch for pf state-table
insertions/sec (pfctl -si) as the bottleneck signal. The
expected failure mode on the FreeBSD side is 25+ Gbps, not 10 Gbps.
Policy-update latency (the killer test)
This is where eBPF vs. pf is most visible and most meaningful.
Harness: a daemon accepts policy mutations on a Unix socket, applies them, and reports wall-clock from “ingest” to “next packet sees new rule”. Measurement:
ping -i 0.001from sandbox A to sandbox B.- Block B’s IP in the policy at T0; record T1 = wall-clock of the first dropped ICMP.
- Unblock at T2; record T3 = wall-clock of the first restored reply.
Expected per-platform:
| platform | operation | expected latency |
|---|---|---|
| Linux eBPF | bpf_map_update_elem | tens of µs |
| FreeBSD pf tables | pfctl -t ... -T add | hundreds of µs to low ms |
| FreeBSD pf anchor reload | pfctl -a cube/$id -f - | single to low-tens ms |
Then run 1000 concurrent mutations/sec and report steady-state tail.
Fault injection / blast radius
- Does an invalid pf ruleset take down the existing dataplane?
(Expected:
pfctl -fis atomic per-anchor; failure rolls back.) - Does an invalid eBPF program load atomically? (Verifier rejects at load; attached program unaffected.)
- What happens to in-flight states during reload?
- Does a bad rule in sandbox X’s anchor affect sandbox Y?
Observability parity
“An outbound packet from sandbox X to 8.8.8.8:53 was
dropped. Why?” — on Linux, bpftool prog tracelog or a
counter map; on FreeBSD, pfctl -vvs rules +
tcpdump -i pflog0. Deliverable: a
diagnose.sh wrapper that prints all relevant state on both
platforms, measured side-by-side as mean-time-to-diagnosis across
ten canned failure modes.
Shipped 2026-04-22 at
tools/diagnose.sh
— one invocation dumps: pf status + state-count + anchor/table totals;
rule counters from root and every anchor under cube
(including cube_policy) sorted by packet count with hits
headered [HIT]; state-table slice (optionally filtered
with —ip 10.77.0.5 or
—sandbox sandbox-5 where the latter resolves via
coppice-pool-ctl list); pflog parse from
/var/log/pflog when pflogd is active (with
—since 30s windowing), else a 2 s live
tcpdump -i pflog0 fallback; cubenet0 + tap
packet/error columns from netstat -i; every
cube_deny-prefixed table’s contents across all anchors;
and a hints block that flags the three enforcement holes documented
above (pfil_member=0, missing root anchor reference, pf
disabled). Regression rig at
benchmarks/rigs/net/diagnose-demo.sh
installs a scoped cube_demo/diagnose anchor with a
cube_deny_demo table on 10.99.x.x scratch IPs (never
touches the real cube_policy or per-sandbox anchors),
runs diagnose.sh, and asserts the deny-table contents + rule-counter
lines surface. Sample output:
benchmarks/results/diagnose-demo-2026-04-22.txt.
The ten-failure-mode MTTD measurement is a separate follow-up; the
tool itself is the prerequisite and it’s in the tree.
Test-harness layout
benchmarks/rigs/net/
common.sh # shared probes (netperf wrappers, etc.)
lab-setup-linux.sh # bring up two Cilium sandboxes on Linux
lab-setup-freebsd.sh # bring up two bhyve+tap sandboxes on FreeBSD
latency-s2s.sh
latency-s2inet.sh
latency-inet2s.sh
throughput-tcp.sh
throughput-udp.sh
throughput-multistream.sh
policy-update-probe.sh # ping-based detection
policy-update-bench.rs # direct syscall/pfctl timing harness
policy-churn.sh # 1000 mutations/sec soak
fault-bad-pf-rule.sh
fault-bad-ebpf.sh
fault-reload-under-load.sh
observability/diagnose.sh
Run target: ideally a two-node lab with identical 10 GbE between them,
one node per OS, one sandbox pair per run. For now we exercise a
compressed version of this on a single FreeBSD host via two VNET
jails bridged through cubenet0 — see the measured
numbers below. Stubs for the per-OS Linux+Cilium side live in
benchmarks/rigs/net/; the FreeBSD side is stood up and
captured in lab-setup-freebsd.sh and
run-net-bench.sh.
Measured on honor (2026-04-22)
Lab up — two VNET jails (sbx-a at 10.77.0.2, sbx-b at 10.77.0.3) on
a single FreeBSD host, connected through a dedicated cubenet0 bridge
with a pf cube_policy anchor enforcing deny-by-default
plus explicit allows. Setup script:
benchmarks/rigs/net/lab-setup-freebsd.sh.
| metric | cubenet / pf (honor) | Cube / eBPF (claimed or typical) | notes |
|---|---|---|---|
| sandbox↔sandbox p50 RTT | 7 µs | ~5-10 µs | netperf TCP_RR, 1-byte req/resp. Single VNET jail hop through pf state-check + bridge. |
| sandbox↔sandbox p99 RTT | 8 µs | 10-15 µs | stddev 0.51 µs. No GC pauses, no tc qdisc lottery. |
| TCP throughput 1-stream | 14.6 Gbit/s | ~15-20 Gbit/s intra-host | iperf3, 5 s run. Not wire-limited (physical re0 is 1 Gbit/s) — this is intra-host memory bandwidth + TCP stack + pf state lookup. |
| TCP throughput 4 streams | 10.2 Gbit/s | linear scaling typical | CPU-bound on a single socket-side thread. Not unexpected for epair-through-bridge. |
| Policy update: single add | 4.2 ms wall | 1-5 ms (bpftool) | pfctl -a cube_policy -t cube_deny -T add <ip>. Dominated by process spawn, not kernel work. |
| Policy update: 1000 IPs batched | 4 ms total | Cilium bulk similar | One pfctl -T replace -f <file> call. Effective ~250 k ops/sec, bounded by pf’s atomic-swap table mechanics. |
| Policy update under ~14 Gbit/s load: per-op | p50 1.04 ms, p99 1.25 ms | tens of µs (eBPF map) | Idle p50 1.01 ms, p99 1.21 ms. Delta < 50 µs at p99. Mutation latency does not degrade with concurrent traffic. |
| Policy update under ~14 Gbit/s load: batched 1000 | p50 1.72 ms, p99 1.97 ms | eBPF bulk similar | Idle p50 1.74 ms. Batch replace is indistinguishable idle vs under-load. ~580 k entries/sec effective. |
| Throughput under 1 kops/sec churn | 13.8 Gbit/s | no degradation expected | Idle baseline 11.6–13.9 Gbit/s (noisy). Throughput CoV drops from 0.13 idle to 0.04 under churn — steadier, not jittier. |
| Enforcement latency | next packet | next packet | pf tables are O(1) per-packet lookups with no cached decisions. No stale-ruleset window. |
| Ingress rdr add-latency p50 | 0.24 µs | µs-scale DNAT | pf-rdr-translated TCP_RR 9.87 µs p50 vs bare sandbox↔sandbox 9.63 µs p50. Delta is smaller than run-to-run jitter. 300 k transactions/run via benchmarks/rigs/net/ext-to-sandbox.sh. End-to-end verified on the vm-public rule with curl http://192.168.1.182:30001/ from an LAN peer returning 200. |
| Ingress rdr add-latency p99 | 0.17 µs | µs-scale DNAT | 11.12 µs p99 via rdr vs 10.95 µs bare. No measurable tail inflation. |
| rdr rule add, N=10 / N=100 anchors | 1.21 ms / 1.20 ms | bpftool-equivalent | Adding one rdr rule into a fresh sibling anchor; flat with N. Cost is pfctl process spawn + kernel ioctl, not per-anchor table work. Warm-path reloads with an open /dev/pf handle would strip the ~1 ms spawn. |
| dummynet egress rate cap fidelity | >95% of configured rate | tc/htb similar | Per-sandbox egress pipe via ipfw add pipe 1 ip from <tap_ip> to any out. Caps 10 / 100 / 500 / 1000 Mbit/s achieved 9.63 / 96.8 / 482 / 954 Mbit/s respectively (baseline unshaped 6750 Mbit/s). Rule-load wall-time 1.56–2.20 ms, flat across caps. Isolation holds: with sbx-a piped at 100 Mbit/s, sbx-b → sbx-a hits 7271 Mbit/s (pipe rule only matches src=sbx-a). Two traps worth naming: (1) kldload ipfw attaches with a default-deny rule unless net.inet.ip.fw.default_to_accept=1 is set via kenv before load, and (2) net.link.bridge.pfil_member=1 — which the cube_policy anchor needs — makes ipfw evaluate bridge packets twice per direction, so unqualified pipe rules double-charge the budget and cut achieved throughput in half. Matching … out fixes it. Rig: benchmarks/rigs/net/rate-limit-dummynet.sh. |
| CubeProxy Host-header dispatch overhead | +109 µs p50 / +187 µs p99 | comparable L7 proxy | Direct HTTP host → sbx-a:80: 96 µs p50 / 128 µs p99. Via Go cubeproxy decoding <port>-<id>.coppice.lan: 205 µs p50 / 315 µs p99. Single userspace accept/parse/dial per request; same shape as Cube’s CubeProxy. pf alone can’t do this split because it doesn’t inspect L7 payloads. Source at tools/cubeproxy. |
All FreeBSD numbers measured on honor (Ryzen 9 5900HX, 32 GB, FreeBSD 15.0-RELEASE, SNAPSHOT kernel). Cube numbers are either published Tencent claims or typical Linux+Cilium values on comparable hardware — not head-to-head measurements.
IPv6 parity (2026-04-22)
The v4 story above ports to v6 with no architectural changes — same
cubenet0 bridge, same cube_policy anchor (extended with v6 rules +
a cube_deny_v6 table), same pf semantics. Setup:
lab-setup-freebsd-v6.sh
(additive over the v4 script — brings up fd77::/64 ULA on the bridge,
assigns fd77::2 / fd77::3 to sbx-a / sbx-b in place, merges v6
rules into the existing anchor). Rig:
run-net-bench-v6.sh.
| metric | v4 (same run) | v6 (fd77::/64) | notes |
|---|---|---|---|
| TCP_RR p50 RTT | 8 µs | 8 µs | No measurable penalty. netperf -6 TCP_RR, 1-byte req/resp, 5 s. p99 7 → 10 µs (still in the noise-floor band). |
| TCP_RR mean RTT | 8.27 µs | 8.46 µs | +2.3% v6 vs v4 — well under the ±10% parity target. |
| iperf3 TCP 1-stream | 7.34 Gbit/s | 6.19 Gbit/s | v6 lands at 84% of v4. Absolute v4 is below the 14.6 Gbit/s quiescent baseline because two sibling subagents shared the host during this run; the v6/v4 ratio is the relevant signal. The ~16% gap is dominated by the 40-byte v6 header vs 20-byte v4 and by the v6 path going through the additional inet6 rule-block in the merged anchor — both fixed per-packet costs. |
| cube_deny_v6 enforcement | PASS | PASS | pfctl -a cube_policy -t cube_deny_v6 -T add fd77::3 followed by pfctl -k fd77::3 (kill existing states): sbx-a → sbx-b ping6 goes from 0% loss to 100% loss. Delete restores reachability on the next packet. Same semantics as v4. |
| pfctl -T add/delete wall time | ~1.2 ms | 1.36 / 1.20 ms | No v6 penalty on table mutation. Dominated by pfctl process spawn, same as v4. |
| External egress (NAT66) | 18.7 ms | 23.8 ms | sbx-a → 2606:4700:4700::1111 via nat on re0 inet6 from fd77::/64 to any -> <host-global-v6>. Host-direct v6 RTT from honor is 23.5 ms; NAT66 adds less than the ICMP sample noise. Home ISP (Comcast) provides SLAAC /64 via router advertisements on re0, which the script auto-discovers. |
Honor, 2026-04-22. Same host + same lab jails as the v4 table above.
Receipts: benchmarks/results/net-v6-2026-04-22.txt.
Dual-stack gotcha, mirror of the v4 one. The v4 NAT rule correctly
sits on vm-public because that’s the interface pf sees the outbound
packet on first (re0 is a member of the bridge; pf-on-re0-NAT never
matches). The v6 rule has to go on re0 directly: on honor, vm-public
has no v6 addresses and the global SLAAC address lives on re0 itself.
Putting NAT66 on vm-public gets zero matches. The v6 setup script
auto-detects by reading the default v6 route’s egress interface rather
than hard-coding. If a future host moves the global v6 onto the bridge
(e.g. a different ISP with a routed /48), the same script picks that
up without modification.
The headline: parity or better on every axis that matters for
agent-sandbox workloads. The 4 ms single-add wall time looks
alarming next to “tens of µs” Linux bpf_map_update_elem, until you
notice two things. First, the 4 ms is almost entirely pfctl
process startup — libpfctl from C or a long-lived daemon that opens
/dev/pf once hits the kernel in tens of µs too. Second, nobody
actually calls bpf_map_update_elem 1 k times per second
individually on Linux either; you batch. And batching on pf is a
single ioctl that swaps an entire 1000-entry table atomically in
4 ms — which is the number the policy-update story actually
cares about.
The throughput number is more interesting. At 14.6 Gbit/s single-stream intra-host, the bridge + pf path is not far from a veth+tc stack on Linux — within a factor that’s easily explained by CPU frequency and cache effects, not an architectural gap. The multi-stream drop to ~10 Gbit/s at 4 streams hints that pf’s state table becomes the contended resource under parallel load — a known tradeoff of stateful packet filtering. For comparison, Cilium’s eBPF datapath is largely stateless per-packet (state lives in per-CPU bpf maps, merged on policy change), which is why it scales differently. For sandbox-to-sandbox on the same host, this is academic; for sandbox-to-internet egress at scale, it matters, and it’s where we’d reach for ipfw + dummynet rather than pf.
And dummynet itself delivers: per-sandbox egress pipes hold their
configured rate to >95% across three orders of magnitude (10 Mbit/s
up to 1 Gbit/s), with ~2 ms rule-install cost, and shaping one
sandbox’s src-IP leaves its neighbor completely untouched — a per-TAP
QoS story shaped the same way Cube’s is. The two FreeBSD-specific
gotchas are worth pinning down in code rather than tribal knowledge:
net.inet.ip.fw.default_to_accept=1 has to be set via
kenv before kldload ipfw (otherwise
the module attaches with a default-deny rule and ssh dies in seconds),
and pfil_member=1 makes ipfw see bridge traffic twice per
direction, so pipe rules must be qualified out or the
shaper double-charges and cuts achieved throughput in half.
Anchor scale
Cube gives each sandbox its own BPF map partition. The FreeBSD analog is one pf anchor per sandbox — cube/sandbox-<id> or similar. The open question was whether pf could hold 1000+ sibling anchors without load latency degrading into the tens-of-ms range. Rig:
policy-anchor-churn.sh builds N sibling anchors under a dedicated parent (cube/scale-test/sandbox-<i>), then times loading and flushing a minimal ruleset into a probe anchor (cube/scale-test/sandbox-new), 5 repeats each, at N = 1, 10, 100, 500, 1000, 2000.
| N siblings | load median (ms) | load p95 (ms) | flush median (ms) | flush p95 (ms) |
|---|---|---|---|---|
| 1 | 1.39 | 1.56 | 1.08 | 1.10 |
| 10 | 1.41 | 1.46 | 0.94 | 1.11 |
| 100 | 1.50 | 1.59 | 1.00 | 1.13 |
| 500 | 1.45 | 1.60 | 1.05 | 1.14 |
| 1000 | 1.43 | 1.51 | 1.06 | 1.15 |
| 2000 | 1.51 | 1.52 | 1.08 | 1.12 |
pfctl -a cube/scale-test/sandbox-new -f - wall-clock for load; -F rules for flush.Flat. No cliff. Load latency stays at 1.4–1.5 ms through N=2000; flush holds at ~1 ms. Most of that 1.4 ms is pfctl fork/exec overhead — a long-lived daemon calling libpfctl directly would see tens of µs in kernel time.
There is a cliff at default settings — but it’s a config knob, not an architectural limit. FreeBSD pf ships with set limit anchors 512 (and the matching eth-anchors). First run hit “PF anchor limit reached” at N=468. Bumping both to 4096 via the root ruleset (set limit anchors 4096; set limit eth-anchors 4096) lifts the ceiling; the kernel allocator then handles 2000+ anchors without detectable slowdown. net.pf.request_maxcount (65535 by default) is unrelated — it caps individual request sizes, not anchor counts.
Cross-anchor isolation verified. Rig
policy-anchor-isolation.sh: a block quick proto tcp from 10.77.0.2 to 10.77.0.3 port 32055 loaded into cube_scale/sandbox-5 drops the flow it targets. An unrelated block … port 32066 loaded into cube_scale/sandbox-6 does not affect port 32055 and does correctly block its own declared port. Per-sandbox anchors partition policy as advertised.
Teardown is a two-step dance. pfctl -a cube_scale/sandbox-N -F rules clears the anchor ruleset in ~1 ms. But pf keep-state entries live in the global state table and survive rule flush — they age out on their normal TCP timers. For a sandbox-release path where the sandbox’s IP is being returned to the pool, the controller must also call pfctl -k <src> -k <dst> on the sandbox’s IP to reap those states immediately. Rig:
policy-anchor-teardown.sh.
One cost we paid along the way: the lab’s bridge-filter sysctl (net.link.bridge.pfil_member) ships at 0 on FreeBSD 15, meaning pf does not see inter-member bridge traffic by default. This affected any per-anchor enforcement claim on cubenet0 — before we flipped it to 1, rules loaded into any anchor had zero effect on sandbox-to-sandbox traffic regardless of configuration. The intra-sandbox latency and throughput numbers above still stand (they measure stack traversal cost, not enforcement); the enforcement leg of the lab simply needs this knob set. lab-setup-freebsd.sh will pick this up in the next revision, and the anchor-scale rig reports a warning when the limit is too low for the sweep.
Policy update under load — the contention question
The idle policy-update numbers (4 ms/add, 4 ms/1000-batched)
only matter if they hold while the bridge is carrying real traffic.
pf’s table lock is one globally-held kernel structure, not a per-CPU
map; the open question was whether a 1 kops/sec mutation stream
would collide with per-packet state lookups on the hot path and either
stall the dataplane (throughput craters) or starve the control plane
(mutation p99 spikes). Rig:
policy-churn-under-load.sh runs iperf3 single-stream
plus a tight pfctl -t cube_deny -T add/delete loop and
samples iperf3 bandwidth every 100 ms.
The answer: no visible contention. Per-op mutation p99 goes from
1.21 ms idle to 1.25 ms under ~14 Gbit/s — a delta well
under a single context switch. iperf3 throughput under churn
(13.8 Gbit/s, CoV 4.1%) is actually steadier than the idle
baseline (11.6-13.9 Gbit/s across reruns, CoV up to 14%), because
the churn loop keeps cores pinned in a high-frequency state rather
than letting them drift. Batched pfctl -T replace of
1000 entries is identical idle vs under-load (1.7 ms median both
sides). pf’s table-lock granularity is finer than the global
single-lock worst-case suggested. For sandbox workloads that rewrite
deny-lists thousands of times per minute while saturating the bridge,
the data says ship it — the tail doesn’t move.
(Per-op numbers here are a bit lower than the earlier 4 ms/add
figure because the earlier rig timed only the first invocation on an
empty table. Under a sustained 1 kops/sec stream with cache-warm
pfctl, the steady-state per-op cost is closer to
1 ms; the bottleneck remains process spawn, not kernel work.)
Multi-stream scaling — what’s actually contended
The single-stream 14.6 Gbit/s above is half the story. The other half: a 4-stream run dropped to ~10 Gbit/s and an early 16-stream run sagged to ~9 Gbit/s, which naively reads as “pf state-table contention eats the parallelism.” That hypothesis is wrong — it’s the first thing you’d try because stateful packet filters can have globally-contended state hashes, but the measurement doesn’t support it here. Rig: throughput-multistream.sh sweeps iperf3 -P stream counts, TCP and UDP at MTU 1500, and at each point samples (a) aggregate sum_sent bits/s, (b) host CPU%, (c) pf state-table insertion delta over the run, and (d) pf state-table high-water mark via pfctl -si before/after.
| streams | TCP Gbit/s | host CPU% | pf states (hw) | pf searches/sec | pf inserts/sec |
|---|---|---|---|---|---|
| 1 | 7.10 | 16.6 | 182 | 1.80 M | 1.0 |
| 2 | 6.24 | 17.0 | 138 | 1.58 M | 1.1 |
| 4 | 6.28 | 16.3 | 139 | 1.52 M | 1.7 |
| 8 | 5.67 | 15.9 | 147 | 1.44 M | 2.7 |
| 16 | 7.00 | 17.5 | 180 | 1.83 M | 4.6 |
| 32 | (drop) | 16.4 | 257 | 1.61 M | 10.1 |
| 64 | 6.34 | 16.5 | 382 | 1.65 M | 16.8 |
| 128 | 6.44 | 16.5 | 635 | 1.19 M | 23.1 |
Three findings settle the question. First, it’s not a cliff; it’s a flat noisy plateau. Sag from P=1 to P=128 is ≤15% — well within run-to-run variance on this (loaded) host. Second, pf state-table is not contended. High-water 635 states against a default 131 072-bucket hash (net.pf.states_hashsize) is a load factor of 0.005. There is no plausible contention story at that fill level — even if every bucket were a heavy linked list (it’s not), you’d need four orders of magnitude more state before the hash matters. net.pf.source_nodes_hashsize (32768) is similarly irrelevant because source-node tracking is only allocated for rules with sticky-address or max-src-nodes; the cube_policy anchor uses neither. Third, host CPU stays pinned at 16-17% regardless of stream count — which on a 16-thread host is exactly one core maxed. That’s the single-threaded TCP send path (iperf3 calls send() on each stream from a worker thread, but all streams funnel through one socket-to-epair transmit path in the kernel; TSO/LRO and the epair driver’s single-producer queue serialize them). The UDP sweep confirms: iperf3 -u -b 2G pushes 148 Gbit/s attempted at P=128 and drops 96% because the receiver’s single-thread hits 100% CPU at P=32. Same story, different direction.
The 14.6 → 9 Gbit/s drop from baseline is a CPU-pinning artifact, not an architectural ceiling. Quiescent single-stream runs on an idle host let the sender thread boost to full turbo and stay on one core’s cache-hot path; 16 streams from the same iperf3 process fight over the same thread pool and the same TX queue, losing ~a third of the cycles to scheduler overhead and TSO-batch underflow. Raising net.pf.states_hashsize cannot help because pf isn’t the bottleneck; raising it would be superstition.
Does this matter for Coppice? No. The Coppice workload is 100s of low-bitrate agent flows (~Mbit/s each from a Python SDK call, not 16×10-Gbit/s iperf3 streams). At 500 concurrent sandboxes with 5 Mbit/s each, that’s 500 pf states against a 131-k hash (load factor 0.004) and 2.5 Gbit/s aggregate — an order of magnitude inside the single-stream ceiling and two orders inside the state-table headroom. The “multi-stream scaling” gap is a synthetic-benchmark curiosity; it was a real open question worth answering, and the answer is the suspect was wrong. Receipts: benchmarks/results/throughput-multistream-2026-04-22.txt.
See the full eBPF on FreeBSD survey for the broader ecosystem picture — DTrace vs bpftrace, netmap vs XDP, where the FreeBSD networking stack has genuine gaps against Linux+eBPF vs where the comparison is a wash.
L7 policy — the sidecar story
pf is L3/L4; it does not inspect HTTP payloads. Cilium on Linux does per-HTTP-request policy (method-deny, path-prefix-deny, header-deny) by redirecting connections through an L7 proxy (Envoy, via in-kernel bpf_sk_assign hooks). The FreeBSD analog is a userspace proxy sidecar in-path between the sandbox and the outside world — same architectural idea, just without the in-kernel redirector. The cleanest placement for Coppice is one sidecar per sandbox, listening on 127.0.0.1:8080 inside the VNET jail, with the sandbox workload configured to speak HTTP to that address instead of the real upstream (equivalent to how Istio runs Envoy). A single-host-Envoy model with per-sandbox listeners would be closer to Cilium’s topology but heavier to set up on FreeBSD and doesn’t change the enforcement semantics.
Envoy itself is not packaged on FreeBSD 15.0. pkg search -x ‘^envoy’ is empty; there is no port in net-proxy/ or www/ for it. Upstream Envoy’s Bazel build carries Linux-only sanitizer paths, a BoringSSL assumption that collides with LLVM-libunwind, and tcmalloc/kqueue shim gaps — the last serious attempt (envoyproxy/envoy#8792) stalled years ago. We therefore shipped the Envoy reference config at tools/envoy/coppice-sidecar.yaml (documents intent; drops in unmodified if/when a working port lands) and ran the rig with haproxy 3.2.15 as the substitute. haproxy expresses the three policy primitives we need via native ACLs — path_beg /admin/, method POST, fall-through to the upstream — with no Lua required. The running config is tools/envoy/coppice-sidecar-haproxy.cfg; the rig is benchmarks/rigs/net/l7-policy-envoy.sh.
Measured on sbx-a, N=5 end-to-end runs: sidecar startup 10 ms (fork-to-listening-on-127.0.0.1:8080, dead stable), per-request latency overhead ~80 µs (58-106 µs run-to-run) on a ~460 µs direct-loopback baseline, policy reload 12 ms (haproxy -sf graceful swap — old process drains, new process owns the listener), teardown 6-7 ms. 3/3 policy-matrix probes green on every run: GET /get → 200, POST /post → 403, GET /admin/secret → 403. Coppice-path example: GET /foo → 200 (allowed), POST /bar → 403 (blocked) — the direct analog of Cilium’s two-route L7 policy test. Receipt: l7-policy-envoy-2026-04-22.txt.
Gotcha worth flagging: fresh VNET jails often get only inet6 ::1 on lo0 — haproxy’s default bind is 127.0.0.1 and fails with Can’t assign requested address. The rig now auto-aliases 127.0.0.1/8 onto lo0 inside the jail before spawning the sidecar. For production, bake the alias into the jail start config (or have the controller do it on checkout). Separately: the rig uses reverse-proxy mode (sandbox calls http://127.0.0.1:8080/path), not forward-proxy (curl —proxy http://127.0.0.1:8080 http://upstream/…); haproxy supports forward-proxy via option http-use-proxy-header but adds no policy-coverage gain, and transparent pf-rdr-plus-SO_ORIGINAL_DST-equivalent (ipfw fwd on FreeBSD) is feasible but an extra hop we don’t need to take until a specific deployment asks for it.
Bottom line. The L7-policy enforcement surface on FreeBSD matches what Cilium does — method, path-prefix, and header deny are all expressible and all measured working — and the cost is small enough (10 ms startup, ~80 µs per-request) to run one sidecar per sandbox at Coppice’s agent-density target. What’s still open is (a) an upstreamable Envoy port if we want Envoy specifically rather than a semantically-equivalent proxy, and (b) a single-host multi-listener model if we need a Cilium-proper topology. Neither blocks the thing Coppice actually needs out of L7. See also /appendix/l7-proxy-survey for the evaluation matrix.
Per-sandbox lifecycle (how the controller wires it all together)
A pf anchor and a table are nouns. The verb that ties them into a
sandbox’s lifetime lives in tools/coppice-pool-ctl.sh
— a small shell controller (~250 LoC) with four verbs:
coppice-pool-ctl init # one-time: add anchor "cube/*" to root pf
coppice-pool-ctl checkout <entry> # allocate IP, tap, anchor; return metadata
coppice-pool-ctl release <sbid> # tear down tap, flush anchor, kill states
coppice-pool-ctl list # show live sandboxes
Checkout allocates the next free octet from
10.77.0.10–10.77.0.200, creates tap<id>
and adds it to cubenet0, writes a per-sandbox anchor at
cube/sandbox-<id> containing a deny_<id>
persistent table plus a minimal allow stanza for intra-subnet and
egress traffic, and returns KEY=VAL metadata (id, ip, mac, anchor,
tap, pool-entry) on stdout for the caller to consume.
Release is symmetric: flush the anchor, drop pf
states matching the IP with pfctl -k, destroy the tap,
delete the state file. It reports states_before and
states_after so leak-tests are a straight grep.
Root pf is preserved additively. Init reads the
current ruleset, injects anchor “cube/*” iff missing, and
re-emits everything else verbatim — including sibling agents’
cube_rdr, cube_scale_wrap, or
cube_policy anchors. A 30-second
daemon(8)-launched dead-man drops pf if SSH wedges during
the reload, matching the safety model of
setup-pf.sh.
The bhyve-durable-prewarm-pool-cubenet.sh
rig composes the controller with the existing durable-prewarm pool:
its refill path calls coppice-pool-ctl checkout for each
entry, then adds -s 3:0,virtio-net,tap-<id>,mac=02:cf:…
to the bhyve -r command line. On release, the lifecycle
controller is called symmetrically, pool entry freed back to
SIGSTOP’d standby. (Caveat: checkpoints captured without virtio-net
need a pool rebuild with the device slot reserved at suspend-time;
the rig documents the NET_ON_RESUME=0 fallback.)
The e2e test for this integration is
benchmarks/rigs/net/pool-cubenet-e2e.sh
— spin up N sandboxes, verify each has tap+anchor+IP, mutate one
deny table and confirm the neighbor is untouched, release all, assert
zero pf states reference any freed IP.
Gotchas we hit (and how the lab now handles them)
Standing the cubenet lab up four different ways (four parallel subagents, one day) surfaced three correctness traps that the measurements themselves don’t complain about. Documenting them here so the next person (and the next CI run) doesn’t have to rediscover:
net.link.bridge.pfil_member defaults to 0
On FreeBSD 15 pf does NOT see inter-member bridge traffic by default.
Every block rule on a bridge is silently a no-op until you flip
sysctl net.link.bridge.pfil_member=1. The rule evaluates
— pfctl -sr -vv shows 0 evaluations, 0 packets — but it
evaluates zero times because pf was never handed the packets. The
failure mode is “your deny list appears to work, in that the
rule loaded cleanly, but nothing actually drops.” We measured
for several hours like this before one of the subagents happened to
check the counters.
lab-setup-freebsd.sh
now sets both pfil_member=1 and pfil_bridge=1
on every up. If you fork the rig, keep that.
pfctl -sr returns success on a disabled pf
pfctl -f foo.conf loads the ruleset into the kernel. It
does NOT enable pf. If pf was disabled (for any reason — another
script, a reboot without pf_enable=“YES” in rc.conf,
someone running pfctl -d), loading the ruleset just
stages it; packet processing never happens. pfctl -sr
happily lists the staged rules, so any check of the form “if
pfctl -sr succeeds, pf is up” lies to you. Use
pfctl -s info | grep ‘Status: Enabled’ instead — that’s
what the lab now does.
Bridge pfil + TCP port filter = zero matches
The most surprising find of the day. pass proto tcp from X to
Y port = 5201 on a rule governing bridge traffic
(pfil_member=1 OR pfil_bridge=1) evaluates
but never matches packets whose dst port IS 5201 — verified with
pfctl -sr -vv counters: Evaluations climbs, Packets
stays zero. Block rules on the same port match cleanly; only pass
rules fail. Tried inet/no-state/flags-any/port-in-set/etc — same
result. Appears to be a bridge-pfil edge case in FreeBSD 15.0; ipfw
in the same shape has no trouble.
Architectural response: the cubenet anchor enforces at the
IP-pair level (“can sandbox A talk to sandbox B at all”),
and per-sandbox port policy lives inside each sandbox’s own VNET
pf instance. That matches Cube’s model anyway — per-sandbox eBPF
programs per-TAP, not one global port ruleset on the bridge. But
it’s a real reduction in expressiveness at the cubenet layer
relative to what a naïve reading of pfctl.conf(5)
promises.
All three of these issues cost us measurement credibility for part
of one afternoon. The site’s earlier “cubenet enforces
deny-by-default” claim was true in form (the ruleset was
correctly expressed) and false in fact (pf never saw the traffic).
Fixed and re-verified; enforcement smoke test in
lab-setup-freebsd.sh
after up.
Unknowns and risks
- Status of
generic-ebpf/gbpf/ng_ebpfin 2026. If Yutaro Hayakawa’s generic-ebpf kmod has matured into a shippable module, the calculus changes entirely — we could port CubeNet’s C programs with minor dialect changes and skip most of the pf-anchor machinery. This should be verified before committing to the pf path. - pf anchor-reload under churn. Resolved 2026-04-22. See
Anchor scale above — load latency stays flat at
1.4–1.5 ms through N=2000 siblings once
set limit anchorsis raised from the default 512 to 4096. - VPP on FreeBSD. VPP (the graph-node dataplane from fd.io) runs on FreeBSD via netmap or similar, with a mature control plane. If we’re willing to operate VPP, we get programmable graph nodes out of the box. Worth a prototype week before committing to the pf plan in production.
- netmap + bhyve TAPs. Moving TAPs into netmap mode changes how bhyve’s virtio-net backend sees them.
- Rtnetlink-compat shim. The CubeNet Go glue uses
vishvananda/netlinkextensively. FreeBSD’s rtnetlink emulation covers basic link/addr/route but has gaps. We may need to abandon the shim and write native FreeBSD wrappers. - “netgraph is the wrong tool.” Real risk. netgraph’s
programming model is graph-assembly, not program-attach; its classic
BPF nodes can’t express map lookups. If
localgwtruly needs a fast path, netmap/VALE is probably a better answer than netgraph. - VNET jail memory cost at density. At 1000+ sandboxes, per-jail memory cost may dominate the density budget; eBPF on Linux uses netns which is cheaper per-sandbox.
Bottom line
- Paper parity: yes. pf + VNET + dummynet + optional netgraph can implement the CubeNet policy API. 5-7 FTE-weeks for a single engineer with pf fluency.
- Performance parity: probably at 10 Gbps line rate and at human-timescale policy change rates. Not at the “thousands of mutations/sec, no dataplane hiccup” rate that eBPF maps give you for free.
- Verify first: generic-ebpf status. If it’s shippable in 2026, the pf plan becomes a fallback and the port collapses to dialect-porting C eBPF source. If not, the pf plan stands and the test harness above is what we run to prove it.