Cube’s control plane surfaces per-sandbox CPU, memory, and disk
numbers through GET /sandboxes/:id/metrics. Until
recently our version of that route returned []. FreeBSD
has everything a caller needs — rctl(8) tracks cpu-percent and
memoryuse per jail when kern.racct.enable=1, and ZFS
knows the used + quota bytes of every sandbox dataset. This page
describes how the gateway wires those two sources together and
exposes the result on both the JSON API and on Prometheus
/metrics.
The sampler
A single tokio task, spawned in main.rs next to the TTL
reaper, wakes every COPPICE_METRICS_SAMPLE_SEC seconds
(default 10) and walks state.sandboxes. For each id it
runs three cheap host-side commands:
rctl -h -u jail:e2b-<id> # pcpu, memoryuse, readbps, writebps, ...
rctl -h -l jail:e2b-<id> # the configured rules, to recover the caps
zfs get -Hpo value used,quota zroot/jails/e2b-<id>
rctl’s -u output is one key=value pair per
line; the sampler grabs pcpu (whole-percent integer,
100 = one full core) and memoryuse (bytes
with -h suffixing K/M/G, so we parse both forms). The
-l output gives us whatever the operator configured at
create time — we extract memoryuse:deny=<n> so the
_limit_bytes gauges reflect the cap, not just the current
usage.
The whole sweep is non-blocking: sandbox ids are snapshotted under a
short-lived read lock, then the lock is dropped before any
process-spawning host command runs. A sandbox that vanished
mid-sweep, or a jail whose racct entry is empty, logs one
warn! and the loop continues — one bad sandbox never
stalls the rest of the tick.
What /metrics exposes
Five labeled gauges per live sandbox, plus one static gauge for the racct flag:
# TYPE coppice_sandbox_cpu_percent gauge
coppice_sandbox_cpu_percent{sandbox="a1b2c3d4e5f6",template="python"} 12.500
# TYPE coppice_sandbox_memory_bytes gauge
coppice_sandbox_memory_bytes{sandbox="a1b2c3d4e5f6",template="python"} 54919168
# TYPE coppice_sandbox_memory_limit_bytes gauge
coppice_sandbox_memory_limit_bytes{sandbox="a1b2c3d4e5f6",template="python"} 134217728
# TYPE coppice_sandbox_disk_used_bytes gauge
coppice_sandbox_disk_used_bytes{sandbox="a1b2c3d4e5f6",template="python"} 1048576
# TYPE coppice_sandbox_disk_limit_bytes gauge
coppice_sandbox_disk_limit_bytes{sandbox="a1b2c3d4e5f6",template="python"} 104857600
# TYPE coppice_sandbox_uptime_seconds gauge
coppice_sandbox_uptime_seconds{sandbox="a1b2c3d4e5f6",template="python"} 1800
# TYPE coppice_racct_enabled gauge
coppice_racct_enabled 1
The sandbox label is a 12-hex-character prefix of the
sandbox id — plenty to disambiguate concurrent sandboxes on one host
while keeping the Prometheus label-string size bounded. The
template label is whatever the caller sent as
templateID at create time.
Example PromQL sketches, all of which work with the current shape:
# top-5 sandboxes by CPU right now:
topk(5, coppice_sandbox_cpu_percent)
# sandbox that is within 10% of its memory cap:
coppice_sandbox_memory_bytes / coppice_sandbox_memory_limit_bytes > 0.9
# total cores in use across the fleet:
sum(coppice_sandbox_cpu_percent) / 100
The JSON surface
GET /sandboxes/:id now embeds the same reading inline so
SDK clients can get it without a second round-trip:
{
"sandboxID": "a1b2c3d4e5f6...",
"templateID": "python",
"state": "running",
"startedAt": "2026-04-22T20:15:03Z",
"usage": {
"cpuPercent": 12.5,
"memoryBytes": 54919168,
"memoryLimitBytes": 134217728,
"diskUsedBytes": 1048576,
"diskLimitBytes": 104857600,
"uptimeSeconds": 1800,
"sampledAt": "2026-04-22T20:45:03Z"
}
}
The usage field is omitted entirely for sandboxes the
sampler hasn’t reached yet (in the first N seconds of a sandbox’s
life). GET /sandboxes/:id/metrics returns the same data
as a single-element array, which is the shape the E2B SDK’s
sandbox.metrics() iterator expects.
The racct caveat
Everything under coppice_sandbox_cpu_percent and
coppice_sandbox_memory_bytes depends on
kern.racct.enable=1 being set at boot. Without it rctl
returns racct is disabled
and the sampler emits zero for the
cpu/mem fields. Disk usage still works — ZFS quotas don’t depend on
racct. The startup log contains a one-shot warn! when
racct is off, and the static gauge
coppice_racct_enabled 0 lets alerting rules catch
the drift before someone notices blank panels at 3am.
The honor bench box is configured with racct enabled — see
results for the
boot/loader.conf line, and the
CPU / memory limits per sandbox
row in the
feature audit.
Cost
At the current 10 s cadence, N=100 live sandboxes means ≈20 process
spawns per second (two rctl invocations + one zfs per sandbox,
amortised across the tick interval). On honor that costs ~0.2% of
one core — well under the noise floor of the workloads the sandboxes
themselves run. If that ratio ever looks tight the sampler can
batch-query all jails at once (rctl -u jail: without a
name returns all of them) but the current shape is simpler and the
cost is invisible.
Rig
benchmarks/rigs/per-sandbox-metrics-smoke.sh creates a
sandbox, starts a python3 -c ‘while True: pass’ burner
inside it, polls /metrics until
coppice_sandbox_cpu_percent{sandbox=<id>}
crosses 10, and exits non-zero if it doesn’t within 30 s. The rig
skips (exit 0) on hosts without racct enabled, so it can stay wired
into CI without a reboot requirement.