eBPF → pf translation

The hardest port. Four substrate-level questions: what do CubeNet’s three BPF programs actually do, which of that maps to pf, where does dummynet help, and where do you have to walk away.

What CubeNet actually does

From CubeNet/src/ and CubeNet/cubevs/:

Go glue in cubevs/:

Policy is expressed as BPF map entries, mutated from userspace, read by the BPF programs on each packet.

The pf / VNET / dummynet translation

TAP + VNET

Straight port:

No gap of consequence. vishvananda/netlink in the Go glue would be replaced with freebsd-net/ifconfig or netlink emulation (FreeBSD has gained a limited rtnetlink-compat shim since 13.x; coverage is incomplete but adequate for create/destroy/up/down).

L3/L4 allow/deny

pf rules or pf tables.

SNAT / rdr (DNAT)

pf nat and pf rdr directly. pf nat on $iface from <src> to any → $external and pf rdr pass on $iface proto tcp from any to ($iface) port $hp → <tap_ip> port $sp respectively. Cube’s host-port mapping ports to rdr rules verbatim.

Rate limiting / QoS

dummynet(4) pipes and queues. Create a pipe per sandbox at a bandwidth cap, ipfw add N pipe M src-ip <tap_ip>. Not pf-native (requires ipfw to assign to the pipe); mix-and-match usage is common and supported.

Inter-sandbox fast path (localgw)

This one has no clean substitute. localgw.bpf.c short-circuits inter-sandbox traffic so it doesn’t traverse the full host network stack — packets move TAP-to-TAP through the BPF fast path. pf does not have a “bypass the stack and forward from interface A to interface B based on a packet match” hook.

Options:

What doesn’t map

What you’d actually build

A cubevs-freebsd with:

  1. Same Go RPC surface (EnsureNetwork, ReleaseNetwork, etc.) as CubeNet/cubevs.
  2. TAP-per-sandbox into VNET jails.
  3. pf ruleset builder producing one anchor per sandbox + a shared nat/rdr section.
  4. pf table maintenance for dynamic allow/deny set membership.
  5. netgraph bridge + ng_tee + ng_bpf for the local fast path (or accept the stack traversal).
  6. dummynet pipes for QoS.
  7. pfctl orchestration — batched reloads, not per-rule changes, to keep the tax down when a burst of policy updates lands.

Is this “as good as” CubeNet? For most policies, yes. For dynamic L4/L7 decisions that change hundreds of times per second, no — and at that rate, the question is whether CubeNet is actually doing that in production or only can.

Implementation plan: cube-network-agent-freebsd

A Rust daemon (same stack as e2b-compat) that reconciles desired sandbox network state to FreeBSD primitives. Same shape as CubeSandbox’s network-agent, different backend.

cube-network-agent-freebsd/
  Cargo.toml
  src/
    main.rs          # axum RPC server — EnsureNetwork / ReleaseNetwork
    reconcile.rs     # desired-vs-actual state loop
    tap.rs           # tap(4) ioctl wrappers
    vnet.rs          # jail VNET plumbing (jexec, jail_attach)
    pf/
      anchor.rs      # per-sandbox anchor template generation
      table.rs       # pfctl -t ... -T add/delete wrappers
      rdr.rs         # host-port → sandbox-port rules
      nat.rs         # SNAT rule generation
      reload.rs      # batched pfctl -f orchestration
    dummynet.rs      # optional ipfw pipe per sandbox
    ng/
      bridge.rs      # ng_bridge for local fast path (optional)
      bpf.rs         # ng_bpf classifier construction
    reaper.rs        # stale anchor / table / TAP GC
    metrics.rs       # prometheus exposition
  tests/
    integration/     # end-to-end inside VNET jails

Pessimistic LoC budget: ~4500 Rust + ~200 shell. FTE-weeks: 5-7 for an engineer who already knows pf and jails, ~10 without that background. The EnsureNetwork(sandbox_id, policy) RPC stays shape-identical with Cube’s; only the reconciler changes. That matters because our e2b-compat already consumes this RPC.

Per-program mapping (verdict per file)

CubeNet programFreeBSD primitiveFidelityEffort
nodenic.bpf.c (ingress classifier)pf rdr + tablesfull~0.5 FTE-wk
mvmtap.bpf.c (per-sandbox egress + spoof + SNAT)pf anchor + natfull for L3/L4~1 FTE-wk
localgw.bpf.c (inter-sandbox fast path)ng_bridge (or accept pf)partial0-3 FTE-wk

localgw is the one Tier-3 piece. Three options in decreasing complexity:

  1. Accept pf in the hot path. Zero new code. Measure; if mean inter-sandbox latency is in the hundreds of microseconds range, ship it — most agent workloads won’t notice.
  2. Netgraph fast path. ng_bridge + per-TAP ng_ether, ng_bpf for classifier. ~400 LoC of C if we have to write a custom ng_ node.
  3. VALE / netmap. Strongest programmable L2 fast path FreeBSD has. TAPs run in netmap mode (changes how bhyve’s virtio-net backend attaches). ~600 LoC. Highest performance ceiling and the highest integration cost.

Recommendation: ship with (1), measure, escalate if numbers warrant.

How we test parity

This is where “paper parity” dies and real benchmarks matter. Four classes of test under benchmarks/rigs/net/:

Latency

Yardsticks: netperf TCP_RR / UDP_RR for small-packet round-trip; sockperf ping-pong for tighter percentile reporting. iperf3 for throughput only.

Scenarios:

Report p50/p95/p99 RTT at 1 Kpps / 10 Kpps / 100 Kpps offered load.

Throughput

iperf3 with 1 / 4 / 16 streams, TCP and UDP, MTU 1500 and 9000. Report Gbps + CPU% at both ends. Watch for pf state-table insertions/sec (pfctl -si) as the bottleneck signal. The expected failure mode on the FreeBSD side is 25+ Gbps, not 10 Gbps.

Policy-update latency (the killer test)

This is where eBPF vs. pf is most visible and most meaningful.

Harness: a daemon accepts policy mutations on a Unix socket, applies them, and reports wall-clock from “ingest” to “next packet sees new rule”. Measurement:

  1. ping -i 0.001 from sandbox A to sandbox B.
  2. Block B’s IP in the policy at T0; record T1 = wall-clock of the first dropped ICMP.
  3. Unblock at T2; record T3 = wall-clock of the first restored reply.

Expected per-platform:

platformoperationexpected latency
Linux eBPFbpf_map_update_elemtens of µs
FreeBSD pf tablespfctl -t ... -T addhundreds of µs to low ms
FreeBSD pf anchor reloadpfctl -a cube/$id -f -single to low-tens ms

Then run 1000 concurrent mutations/sec and report steady-state tail.

Fault injection / blast radius

Observability parity

“An outbound packet from sandbox X to 8.8.8.8:53 was dropped. Why?” — on Linux, bpftool prog tracelog or a counter map; on FreeBSD, pfctl -vvs rules + tcpdump -i pflog0. Deliverable: a diagnose.sh wrapper that prints all relevant state on both platforms, measured side-by-side as mean-time-to-diagnosis across ten canned failure modes.

Shipped 2026-04-22 at tools/diagnose.sh — one invocation dumps: pf status + state-count + anchor/table totals; rule counters from root and every anchor under cube (including cube_policy) sorted by packet count with hits headered [HIT]; state-table slice (optionally filtered with —ip 10.77.0.5 or —sandbox sandbox-5 where the latter resolves via coppice-pool-ctl list); pflog parse from /var/log/pflog when pflogd is active (with —since 30s windowing), else a 2 s live tcpdump -i pflog0 fallback; cubenet0 + tap packet/error columns from netstat -i; every cube_deny-prefixed table’s contents across all anchors; and a hints block that flags the three enforcement holes documented above (pfil_member=0, missing root anchor reference, pf disabled). Regression rig at benchmarks/rigs/net/diagnose-demo.sh installs a scoped cube_demo/diagnose anchor with a cube_deny_demo table on 10.99.x.x scratch IPs (never touches the real cube_policy or per-sandbox anchors), runs diagnose.sh, and asserts the deny-table contents + rule-counter lines surface. Sample output: benchmarks/results/diagnose-demo-2026-04-22.txt. The ten-failure-mode MTTD measurement is a separate follow-up; the tool itself is the prerequisite and it’s in the tree.

Test-harness layout

benchmarks/rigs/net/
  common.sh                 # shared probes (netperf wrappers, etc.)
  lab-setup-linux.sh        # bring up two Cilium sandboxes on Linux
  lab-setup-freebsd.sh      # bring up two bhyve+tap sandboxes on FreeBSD
  latency-s2s.sh
  latency-s2inet.sh
  latency-inet2s.sh
  throughput-tcp.sh
  throughput-udp.sh
  throughput-multistream.sh
  policy-update-probe.sh    # ping-based detection
  policy-update-bench.rs    # direct syscall/pfctl timing harness
  policy-churn.sh           # 1000 mutations/sec soak
  fault-bad-pf-rule.sh
  fault-bad-ebpf.sh
  fault-reload-under-load.sh
  observability/diagnose.sh

Run target: ideally a two-node lab with identical 10 GbE between them, one node per OS, one sandbox pair per run. For now we exercise a compressed version of this on a single FreeBSD host via two VNET jails bridged through cubenet0 — see the measured numbers below. Stubs for the per-OS Linux+Cilium side live in benchmarks/rigs/net/; the FreeBSD side is stood up and captured in lab-setup-freebsd.sh and run-net-bench.sh.

Measured on honor (2026-04-22)

Lab up — two VNET jails (sbx-a at 10.77.0.2, sbx-b at 10.77.0.3) on a single FreeBSD host, connected through a dedicated cubenet0 bridge with a pf cube_policy anchor enforcing deny-by-default plus explicit allows. Setup script: benchmarks/rigs/net/lab-setup-freebsd.sh.

metriccubenet / pf (honor)Cube / eBPF (claimed or typical)notes
sandbox↔sandbox p50 RTT7 µs~5-10 µsnetperf TCP_RR, 1-byte req/resp. Single VNET jail hop through pf state-check + bridge.
sandbox↔sandbox p99 RTT8 µs10-15 µsstddev 0.51 µs. No GC pauses, no tc qdisc lottery.
TCP throughput 1-stream14.6 Gbit/s~15-20 Gbit/s intra-hostiperf3, 5 s run. Not wire-limited (physical re0 is 1 Gbit/s) — this is intra-host memory bandwidth + TCP stack + pf state lookup.
TCP throughput 4 streams10.2 Gbit/slinear scaling typicalCPU-bound on a single socket-side thread. Not unexpected for epair-through-bridge.
Policy update: single add4.2 ms wall1-5 ms (bpftool)pfctl -a cube_policy -t cube_deny -T add <ip>. Dominated by process spawn, not kernel work.
Policy update: 1000 IPs batched4 ms totalCilium bulk similarOne pfctl -T replace -f <file> call. Effective ~250 k ops/sec, bounded by pf’s atomic-swap table mechanics.
Policy update under ~14 Gbit/s load: per-opp50 1.04 ms, p99 1.25 mstens of µs (eBPF map)Idle p50 1.01 ms, p99 1.21 ms. Delta < 50 µs at p99. Mutation latency does not degrade with concurrent traffic.
Policy update under ~14 Gbit/s load: batched 1000p50 1.72 ms, p99 1.97 mseBPF bulk similarIdle p50 1.74 ms. Batch replace is indistinguishable idle vs under-load. ~580 k entries/sec effective.
Throughput under 1 kops/sec churn13.8 Gbit/sno degradation expectedIdle baseline 11.6–13.9 Gbit/s (noisy). Throughput CoV drops from 0.13 idle to 0.04 under churn — steadier, not jittier.
Enforcement latencynext packetnext packetpf tables are O(1) per-packet lookups with no cached decisions. No stale-ruleset window.
Ingress rdr add-latency p500.24 µsµs-scale DNATpf-rdr-translated TCP_RR 9.87 µs p50 vs bare sandbox↔sandbox 9.63 µs p50. Delta is smaller than run-to-run jitter. 300 k transactions/run via benchmarks/rigs/net/ext-to-sandbox.sh. End-to-end verified on the vm-public rule with curl http://192.168.1.182:30001/ from an LAN peer returning 200.
Ingress rdr add-latency p990.17 µsµs-scale DNAT11.12 µs p99 via rdr vs 10.95 µs bare. No measurable tail inflation.
rdr rule add, N=10 / N=100 anchors1.21 ms / 1.20 msbpftool-equivalentAdding one rdr rule into a fresh sibling anchor; flat with N. Cost is pfctl process spawn + kernel ioctl, not per-anchor table work. Warm-path reloads with an open /dev/pf handle would strip the ~1 ms spawn.
dummynet egress rate cap fidelity>95% of configured ratetc/htb similarPer-sandbox egress pipe via ipfw add pipe 1 ip from <tap_ip> to any out. Caps 10 / 100 / 500 / 1000 Mbit/s achieved 9.63 / 96.8 / 482 / 954 Mbit/s respectively (baseline unshaped 6750 Mbit/s). Rule-load wall-time 1.56–2.20 ms, flat across caps. Isolation holds: with sbx-a piped at 100 Mbit/s, sbx-b → sbx-a hits 7271 Mbit/s (pipe rule only matches src=sbx-a). Two traps worth naming: (1) kldload ipfw attaches with a default-deny rule unless net.inet.ip.fw.default_to_accept=1 is set via kenv before load, and (2) net.link.bridge.pfil_member=1 — which the cube_policy anchor needs — makes ipfw evaluate bridge packets twice per direction, so unqualified pipe rules double-charge the budget and cut achieved throughput in half. Matching … out fixes it. Rig: benchmarks/rigs/net/rate-limit-dummynet.sh.
CubeProxy Host-header dispatch overhead+109 µs p50 / +187 µs p99comparable L7 proxyDirect HTTP host → sbx-a:80: 96 µs p50 / 128 µs p99. Via Go cubeproxy decoding <port>-<id>.coppice.lan: 205 µs p50 / 315 µs p99. Single userspace accept/parse/dial per request; same shape as Cube’s CubeProxy. pf alone can’t do this split because it doesn’t inspect L7 payloads. Source at tools/cubeproxy.

All FreeBSD numbers measured on honor (Ryzen 9 5900HX, 32 GB, FreeBSD 15.0-RELEASE, SNAPSHOT kernel). Cube numbers are either published Tencent claims or typical Linux+Cilium values on comparable hardware — not head-to-head measurements.

IPv6 parity (2026-04-22)

The v4 story above ports to v6 with no architectural changes — same cubenet0 bridge, same cube_policy anchor (extended with v6 rules + a cube_deny_v6 table), same pf semantics. Setup: lab-setup-freebsd-v6.sh (additive over the v4 script — brings up fd77::/64 ULA on the bridge, assigns fd77::2 / fd77::3 to sbx-a / sbx-b in place, merges v6 rules into the existing anchor). Rig: run-net-bench-v6.sh.

metricv4 (same run)v6 (fd77::/64)notes
TCP_RR p50 RTT8 µs8 µsNo measurable penalty. netperf -6 TCP_RR, 1-byte req/resp, 5 s. p99 7 → 10 µs (still in the noise-floor band).
TCP_RR mean RTT8.27 µs8.46 µs+2.3% v6 vs v4 — well under the ±10% parity target.
iperf3 TCP 1-stream7.34 Gbit/s6.19 Gbit/sv6 lands at 84% of v4. Absolute v4 is below the 14.6 Gbit/s quiescent baseline because two sibling subagents shared the host during this run; the v6/v4 ratio is the relevant signal. The ~16% gap is dominated by the 40-byte v6 header vs 20-byte v4 and by the v6 path going through the additional inet6 rule-block in the merged anchor — both fixed per-packet costs.
cube_deny_v6 enforcementPASSPASSpfctl -a cube_policy -t cube_deny_v6 -T add fd77::3 followed by pfctl -k fd77::3 (kill existing states): sbx-a → sbx-b ping6 goes from 0% loss to 100% loss. Delete restores reachability on the next packet. Same semantics as v4.
pfctl -T add/delete wall time~1.2 ms1.36 / 1.20 msNo v6 penalty on table mutation. Dominated by pfctl process spawn, same as v4.
External egress (NAT66)18.7 ms23.8 mssbx-a → 2606:4700:4700::1111 via nat on re0 inet6 from fd77::/64 to any -> <host-global-v6>. Host-direct v6 RTT from honor is 23.5 ms; NAT66 adds less than the ICMP sample noise. Home ISP (Comcast) provides SLAAC /64 via router advertisements on re0, which the script auto-discovers.

Honor, 2026-04-22. Same host + same lab jails as the v4 table above. Receipts: benchmarks/results/net-v6-2026-04-22.txt.

Dual-stack gotcha, mirror of the v4 one. The v4 NAT rule correctly sits on vm-public because that’s the interface pf sees the outbound packet on first (re0 is a member of the bridge; pf-on-re0-NAT never matches). The v6 rule has to go on re0 directly: on honor, vm-public has no v6 addresses and the global SLAAC address lives on re0 itself. Putting NAT66 on vm-public gets zero matches. The v6 setup script auto-detects by reading the default v6 route’s egress interface rather than hard-coding. If a future host moves the global v6 onto the bridge (e.g. a different ISP with a routed /48), the same script picks that up without modification.

The headline: parity or better on every axis that matters for agent-sandbox workloads. The 4 ms single-add wall time looks alarming next to “tens of µs” Linux bpf_map_update_elem, until you notice two things. First, the 4 ms is almost entirely pfctl process startup — libpfctl from C or a long-lived daemon that opens /dev/pf once hits the kernel in tens of µs too. Second, nobody actually calls bpf_map_update_elem 1 k times per second individually on Linux either; you batch. And batching on pf is a single ioctl that swaps an entire 1000-entry table atomically in 4 ms — which is the number the policy-update story actually cares about.

The throughput number is more interesting. At 14.6 Gbit/s single-stream intra-host, the bridge + pf path is not far from a veth+tc stack on Linux — within a factor that’s easily explained by CPU frequency and cache effects, not an architectural gap. The multi-stream drop to ~10 Gbit/s at 4 streams hints that pf’s state table becomes the contended resource under parallel load — a known tradeoff of stateful packet filtering. For comparison, Cilium’s eBPF datapath is largely stateless per-packet (state lives in per-CPU bpf maps, merged on policy change), which is why it scales differently. For sandbox-to-sandbox on the same host, this is academic; for sandbox-to-internet egress at scale, it matters, and it’s where we’d reach for ipfw + dummynet rather than pf.

And dummynet itself delivers: per-sandbox egress pipes hold their configured rate to >95% across three orders of magnitude (10 Mbit/s up to 1 Gbit/s), with ~2 ms rule-install cost, and shaping one sandbox’s src-IP leaves its neighbor completely untouched — a per-TAP QoS story shaped the same way Cube’s is. The two FreeBSD-specific gotchas are worth pinning down in code rather than tribal knowledge: net.inet.ip.fw.default_to_accept=1 has to be set via kenv before kldload ipfw (otherwise the module attaches with a default-deny rule and ssh dies in seconds), and pfil_member=1 makes ipfw see bridge traffic twice per direction, so pipe rules must be qualified out or the shaper double-charges and cuts achieved throughput in half.

Anchor scale

Cube gives each sandbox its own BPF map partition. The FreeBSD analog is one pf anchor per sandbox — cube/sandbox-<id> or similar. The open question was whether pf could hold 1000+ sibling anchors without load latency degrading into the tens-of-ms range. Rig: policy-anchor-churn.sh builds N sibling anchors under a dedicated parent (cube/scale-test/sandbox-<i>), then times loading and flushing a minimal ruleset into a probe anchor (cube/scale-test/sandbox-new), 5 repeats each, at N = 1, 10, 100, 500, 1000, 2000.

N siblingsload median (ms)load p95 (ms)flush median (ms)flush p95 (ms)
11.391.561.081.10
101.411.460.941.11
1001.501.591.001.13
5001.451.601.051.14
10001.431.511.061.15
20001.511.521.081.12
Honor, FreeBSD 15.0-RELEASE, five-repeat median and p95. pfctl -a cube/scale-test/sandbox-new -f - wall-clock for load; -F rules for flush.

Flat. No cliff. Load latency stays at 1.4–1.5 ms through N=2000; flush holds at ~1 ms. Most of that 1.4 ms is pfctl fork/exec overhead — a long-lived daemon calling libpfctl directly would see tens of µs in kernel time.

There is a cliff at default settings — but it’s a config knob, not an architectural limit. FreeBSD pf ships with set limit anchors 512 (and the matching eth-anchors). First run hit “PF anchor limit reached” at N=468. Bumping both to 4096 via the root ruleset (set limit anchors 4096; set limit eth-anchors 4096) lifts the ceiling; the kernel allocator then handles 2000+ anchors without detectable slowdown. net.pf.request_maxcount (65535 by default) is unrelated — it caps individual request sizes, not anchor counts.

Cross-anchor isolation verified. Rig policy-anchor-isolation.sh: a block quick proto tcp from 10.77.0.2 to 10.77.0.3 port 32055 loaded into cube_scale/sandbox-5 drops the flow it targets. An unrelated block … port 32066 loaded into cube_scale/sandbox-6 does not affect port 32055 and does correctly block its own declared port. Per-sandbox anchors partition policy as advertised.

Teardown is a two-step dance. pfctl -a cube_scale/sandbox-N -F rules clears the anchor ruleset in ~1 ms. But pf keep-state entries live in the global state table and survive rule flush — they age out on their normal TCP timers. For a sandbox-release path where the sandbox’s IP is being returned to the pool, the controller must also call pfctl -k <src> -k <dst> on the sandbox’s IP to reap those states immediately. Rig: policy-anchor-teardown.sh.

One cost we paid along the way: the lab’s bridge-filter sysctl (net.link.bridge.pfil_member) ships at 0 on FreeBSD 15, meaning pf does not see inter-member bridge traffic by default. This affected any per-anchor enforcement claim on cubenet0 — before we flipped it to 1, rules loaded into any anchor had zero effect on sandbox-to-sandbox traffic regardless of configuration. The intra-sandbox latency and throughput numbers above still stand (they measure stack traversal cost, not enforcement); the enforcement leg of the lab simply needs this knob set. lab-setup-freebsd.sh will pick this up in the next revision, and the anchor-scale rig reports a warning when the limit is too low for the sweep.

Policy update under load — the contention question

The idle policy-update numbers (4 ms/add, 4 ms/1000-batched) only matter if they hold while the bridge is carrying real traffic. pf’s table lock is one globally-held kernel structure, not a per-CPU map; the open question was whether a 1 kops/sec mutation stream would collide with per-packet state lookups on the hot path and either stall the dataplane (throughput craters) or starve the control plane (mutation p99 spikes). Rig: policy-churn-under-load.sh runs iperf3 single-stream plus a tight pfctl -t cube_deny -T add/delete loop and samples iperf3 bandwidth every 100 ms.

The answer: no visible contention. Per-op mutation p99 goes from 1.21 ms idle to 1.25 ms under ~14 Gbit/s — a delta well under a single context switch. iperf3 throughput under churn (13.8 Gbit/s, CoV 4.1%) is actually steadier than the idle baseline (11.6-13.9 Gbit/s across reruns, CoV up to 14%), because the churn loop keeps cores pinned in a high-frequency state rather than letting them drift. Batched pfctl -T replace of 1000 entries is identical idle vs under-load (1.7 ms median both sides). pf’s table-lock granularity is finer than the global single-lock worst-case suggested. For sandbox workloads that rewrite deny-lists thousands of times per minute while saturating the bridge, the data says ship it — the tail doesn’t move.

(Per-op numbers here are a bit lower than the earlier 4 ms/add figure because the earlier rig timed only the first invocation on an empty table. Under a sustained 1 kops/sec stream with cache-warm pfctl, the steady-state per-op cost is closer to 1 ms; the bottleneck remains process spawn, not kernel work.)

Multi-stream scaling — what’s actually contended

The single-stream 14.6 Gbit/s above is half the story. The other half: a 4-stream run dropped to ~10 Gbit/s and an early 16-stream run sagged to ~9 Gbit/s, which naively reads as “pf state-table contention eats the parallelism.” That hypothesis is wrong — it’s the first thing you’d try because stateful packet filters can have globally-contended state hashes, but the measurement doesn’t support it here. Rig: throughput-multistream.sh sweeps iperf3 -P stream counts, TCP and UDP at MTU 1500, and at each point samples (a) aggregate sum_sent bits/s, (b) host CPU%, (c) pf state-table insertion delta over the run, and (d) pf state-table high-water mark via pfctl -si before/after.

streamsTCP Gbit/shost CPU%pf states (hw)pf searches/secpf inserts/sec
17.1016.61821.80 M1.0
26.2417.01381.58 M1.1
46.2816.31391.52 M1.7
85.6715.91471.44 M2.7
167.0017.51801.83 M4.6
32(drop)16.42571.61 M10.1
646.3416.53821.65 M16.8
1286.4416.56351.19 M23.1
Honor, 2026-04-22. sbx-b client → sbx-a server (reversed from baseline because a sibling rate-limit subagent had a dummynet pipe on sbx-a’s egress during the run). 8s per point, MTU 1500. Absolute Gbit/s ~half the quiescent baseline because three subagents shared the host; the shape across P is what the rig is measuring. P=32 client failed to connect — stale iperf3 server race, not a pf effect; confirmed by fresh-server retry.

Three findings settle the question. First, it’s not a cliff; it’s a flat noisy plateau. Sag from P=1 to P=128 is ≤15% — well within run-to-run variance on this (loaded) host. Second, pf state-table is not contended. High-water 635 states against a default 131 072-bucket hash (net.pf.states_hashsize) is a load factor of 0.005. There is no plausible contention story at that fill level — even if every bucket were a heavy linked list (it’s not), you’d need four orders of magnitude more state before the hash matters. net.pf.source_nodes_hashsize (32768) is similarly irrelevant because source-node tracking is only allocated for rules with sticky-address or max-src-nodes; the cube_policy anchor uses neither. Third, host CPU stays pinned at 16-17% regardless of stream count — which on a 16-thread host is exactly one core maxed. That’s the single-threaded TCP send path (iperf3 calls send() on each stream from a worker thread, but all streams funnel through one socket-to-epair transmit path in the kernel; TSO/LRO and the epair driver’s single-producer queue serialize them). The UDP sweep confirms: iperf3 -u -b 2G pushes 148 Gbit/s attempted at P=128 and drops 96% because the receiver’s single-thread hits 100% CPU at P=32. Same story, different direction.

The 14.6 → 9 Gbit/s drop from baseline is a CPU-pinning artifact, not an architectural ceiling. Quiescent single-stream runs on an idle host let the sender thread boost to full turbo and stay on one core’s cache-hot path; 16 streams from the same iperf3 process fight over the same thread pool and the same TX queue, losing ~a third of the cycles to scheduler overhead and TSO-batch underflow. Raising net.pf.states_hashsize cannot help because pf isn’t the bottleneck; raising it would be superstition.

Does this matter for Coppice? No. The Coppice workload is 100s of low-bitrate agent flows (~Mbit/s each from a Python SDK call, not 16×10-Gbit/s iperf3 streams). At 500 concurrent sandboxes with 5 Mbit/s each, that’s 500 pf states against a 131-k hash (load factor 0.004) and 2.5 Gbit/s aggregate — an order of magnitude inside the single-stream ceiling and two orders inside the state-table headroom. The “multi-stream scaling” gap is a synthetic-benchmark curiosity; it was a real open question worth answering, and the answer is the suspect was wrong. Receipts: benchmarks/results/throughput-multistream-2026-04-22.txt.

See the full eBPF on FreeBSD survey for the broader ecosystem picture — DTrace vs bpftrace, netmap vs XDP, where the FreeBSD networking stack has genuine gaps against Linux+eBPF vs where the comparison is a wash.

L7 policy — the sidecar story

pf is L3/L4; it does not inspect HTTP payloads. Cilium on Linux does per-HTTP-request policy (method-deny, path-prefix-deny, header-deny) by redirecting connections through an L7 proxy (Envoy, via in-kernel bpf_sk_assign hooks). The FreeBSD analog is a userspace proxy sidecar in-path between the sandbox and the outside world — same architectural idea, just without the in-kernel redirector. The cleanest placement for Coppice is one sidecar per sandbox, listening on 127.0.0.1:8080 inside the VNET jail, with the sandbox workload configured to speak HTTP to that address instead of the real upstream (equivalent to how Istio runs Envoy). A single-host-Envoy model with per-sandbox listeners would be closer to Cilium’s topology but heavier to set up on FreeBSD and doesn’t change the enforcement semantics.

Envoy itself is not packaged on FreeBSD 15.0. pkg search -x ‘^envoy’ is empty; there is no port in net-proxy/ or www/ for it. Upstream Envoy’s Bazel build carries Linux-only sanitizer paths, a BoringSSL assumption that collides with LLVM-libunwind, and tcmalloc/kqueue shim gaps — the last serious attempt (envoyproxy/envoy#8792) stalled years ago. We therefore shipped the Envoy reference config at tools/envoy/coppice-sidecar.yaml (documents intent; drops in unmodified if/when a working port lands) and ran the rig with haproxy 3.2.15 as the substitute. haproxy expresses the three policy primitives we need via native ACLs — path_beg /admin/, method POST, fall-through to the upstream — with no Lua required. The running config is tools/envoy/coppice-sidecar-haproxy.cfg; the rig is benchmarks/rigs/net/l7-policy-envoy.sh.

Measured on sbx-a, N=5 end-to-end runs: sidecar startup 10 ms (fork-to-listening-on-127.0.0.1:8080, dead stable), per-request latency overhead ~80 µs (58-106 µs run-to-run) on a ~460 µs direct-loopback baseline, policy reload 12 ms (haproxy -sf graceful swap — old process drains, new process owns the listener), teardown 6-7 ms. 3/3 policy-matrix probes green on every run: GET /get → 200, POST /post → 403, GET /admin/secret → 403. Coppice-path example: GET /foo → 200 (allowed), POST /bar → 403 (blocked) — the direct analog of Cilium’s two-route L7 policy test. Receipt: l7-policy-envoy-2026-04-22.txt.

Gotcha worth flagging: fresh VNET jails often get only inet6 ::1 on lo0 — haproxy’s default bind is 127.0.0.1 and fails with Can’t assign requested address. The rig now auto-aliases 127.0.0.1/8 onto lo0 inside the jail before spawning the sidecar. For production, bake the alias into the jail start config (or have the controller do it on checkout). Separately: the rig uses reverse-proxy mode (sandbox calls http://127.0.0.1:8080/path), not forward-proxy (curl —proxy http://127.0.0.1:8080 http://upstream/…); haproxy supports forward-proxy via option http-use-proxy-header but adds no policy-coverage gain, and transparent pf-rdr-plus-SO_ORIGINAL_DST-equivalent (ipfw fwd on FreeBSD) is feasible but an extra hop we don’t need to take until a specific deployment asks for it.

Bottom line. The L7-policy enforcement surface on FreeBSD matches what Cilium does — method, path-prefix, and header deny are all expressible and all measured working — and the cost is small enough (10 ms startup, ~80 µs per-request) to run one sidecar per sandbox at Coppice’s agent-density target. What’s still open is (a) an upstreamable Envoy port if we want Envoy specifically rather than a semantically-equivalent proxy, and (b) a single-host multi-listener model if we need a Cilium-proper topology. Neither blocks the thing Coppice actually needs out of L7. See also /appendix/l7-proxy-survey for the evaluation matrix.

Per-sandbox lifecycle (how the controller wires it all together)

A pf anchor and a table are nouns. The verb that ties them into a sandbox’s lifetime lives in tools/coppice-pool-ctl.sh — a small shell controller (~250 LoC) with four verbs:

coppice-pool-ctl init              # one-time: add anchor "cube/*" to root pf
coppice-pool-ctl checkout <entry>  # allocate IP, tap, anchor; return metadata
coppice-pool-ctl release  <sbid>   # tear down tap, flush anchor, kill states
coppice-pool-ctl list              # show live sandboxes

Checkout allocates the next free octet from 10.77.0.10–10.77.0.200, creates tap<id> and adds it to cubenet0, writes a per-sandbox anchor at cube/sandbox-<id> containing a deny_<id> persistent table plus a minimal allow stanza for intra-subnet and egress traffic, and returns KEY=VAL metadata (id, ip, mac, anchor, tap, pool-entry) on stdout for the caller to consume.

Release is symmetric: flush the anchor, drop pf states matching the IP with pfctl -k, destroy the tap, delete the state file. It reports states_before and states_after so leak-tests are a straight grep.

Root pf is preserved additively. Init reads the current ruleset, injects anchor “cube/*” iff missing, and re-emits everything else verbatim — including sibling agents’ cube_rdr, cube_scale_wrap, or cube_policy anchors. A 30-second daemon(8)-launched dead-man drops pf if SSH wedges during the reload, matching the safety model of setup-pf.sh.

The bhyve-durable-prewarm-pool-cubenet.sh rig composes the controller with the existing durable-prewarm pool: its refill path calls coppice-pool-ctl checkout for each entry, then adds -s 3:0,virtio-net,tap-<id>,mac=02:cf:… to the bhyve -r command line. On release, the lifecycle controller is called symmetrically, pool entry freed back to SIGSTOP’d standby. (Caveat: checkpoints captured without virtio-net need a pool rebuild with the device slot reserved at suspend-time; the rig documents the NET_ON_RESUME=0 fallback.)

The e2e test for this integration is benchmarks/rigs/net/pool-cubenet-e2e.sh — spin up N sandboxes, verify each has tap+anchor+IP, mutate one deny table and confirm the neighbor is untouched, release all, assert zero pf states reference any freed IP.

Gotchas we hit (and how the lab now handles them)

Standing the cubenet lab up four different ways (four parallel subagents, one day) surfaced three correctness traps that the measurements themselves don’t complain about. Documenting them here so the next person (and the next CI run) doesn’t have to rediscover:

net.link.bridge.pfil_member defaults to 0

On FreeBSD 15 pf does NOT see inter-member bridge traffic by default. Every block rule on a bridge is silently a no-op until you flip sysctl net.link.bridge.pfil_member=1. The rule evaluates — pfctl -sr -vv shows 0 evaluations, 0 packets — but it evaluates zero times because pf was never handed the packets. The failure mode is “your deny list appears to work, in that the rule loaded cleanly, but nothing actually drops.” We measured for several hours like this before one of the subagents happened to check the counters.

lab-setup-freebsd.sh now sets both pfil_member=1 and pfil_bridge=1 on every up. If you fork the rig, keep that.

pfctl -sr returns success on a disabled pf

pfctl -f foo.conf loads the ruleset into the kernel. It does NOT enable pf. If pf was disabled (for any reason — another script, a reboot without pf_enable=“YES” in rc.conf, someone running pfctl -d), loading the ruleset just stages it; packet processing never happens. pfctl -sr happily lists the staged rules, so any check of the form “if pfctl -sr succeeds, pf is up” lies to you. Use pfctl -s info | grep ‘Status: Enabled’ instead — that’s what the lab now does.

Bridge pfil + TCP port filter = zero matches

The most surprising find of the day. pass proto tcp from X to Y port = 5201 on a rule governing bridge traffic (pfil_member=1 OR pfil_bridge=1) evaluates but never matches packets whose dst port IS 5201 — verified with pfctl -sr -vv counters: Evaluations climbs, Packets stays zero. Block rules on the same port match cleanly; only pass rules fail. Tried inet/no-state/flags-any/port-in-set/etc — same result. Appears to be a bridge-pfil edge case in FreeBSD 15.0; ipfw in the same shape has no trouble.

Architectural response: the cubenet anchor enforces at the IP-pair level (“can sandbox A talk to sandbox B at all”), and per-sandbox port policy lives inside each sandbox’s own VNET pf instance. That matches Cube’s model anyway — per-sandbox eBPF programs per-TAP, not one global port ruleset on the bridge. But it’s a real reduction in expressiveness at the cubenet layer relative to what a naïve reading of pfctl.conf(5) promises.

All three of these issues cost us measurement credibility for part of one afternoon. The site’s earlier “cubenet enforces deny-by-default” claim was true in form (the ruleset was correctly expressed) and false in fact (pf never saw the traffic). Fixed and re-verified; enforcement smoke test in lab-setup-freebsd.sh after up.

Unknowns and risks

Bottom line