Per-sandbox metrics

Cube’s control plane surfaces per-sandbox CPU, memory, and disk numbers through GET /sandboxes/:id/metrics. Until recently our version of that route returned []. FreeBSD has everything a caller needs — rctl(8) tracks cpu-percent and memoryuse per jail when kern.racct.enable=1, and ZFS knows the used + quota bytes of every sandbox dataset. This page describes how the gateway wires those two sources together and exposes the result on both the JSON API and on Prometheus /metrics.

The sampler

A single tokio task, spawned in main.rs next to the TTL reaper, wakes every COPPICE_METRICS_SAMPLE_SEC seconds (default 10) and walks state.sandboxes. For each id it runs three cheap host-side commands:

rctl -h -u jail:e2b-<id>        # pcpu, memoryuse, readbps, writebps, ...
rctl -h -l jail:e2b-<id>        # the configured rules, to recover the caps
zfs  get -Hpo value used,quota zroot/jails/e2b-<id>

rctl’s -u output is one key=value pair per line; the sampler grabs pcpu (whole-percent integer, 100 = one full core) and memoryuse (bytes with -h suffixing K/M/G, so we parse both forms). The -l output gives us whatever the operator configured at create time — we extract memoryuse:deny=<n> so the _limit_bytes gauges reflect the cap, not just the current usage.

The whole sweep is non-blocking: sandbox ids are snapshotted under a short-lived read lock, then the lock is dropped before any process-spawning host command runs. A sandbox that vanished mid-sweep, or a jail whose racct entry is empty, logs one warn! and the loop continues — one bad sandbox never stalls the rest of the tick.

What /metrics exposes

Five labeled gauges per live sandbox, plus one static gauge for the racct flag:

# TYPE coppice_sandbox_cpu_percent gauge
coppice_sandbox_cpu_percent{sandbox="a1b2c3d4e5f6",template="python"} 12.500
# TYPE coppice_sandbox_memory_bytes gauge
coppice_sandbox_memory_bytes{sandbox="a1b2c3d4e5f6",template="python"} 54919168
# TYPE coppice_sandbox_memory_limit_bytes gauge
coppice_sandbox_memory_limit_bytes{sandbox="a1b2c3d4e5f6",template="python"} 134217728
# TYPE coppice_sandbox_disk_used_bytes gauge
coppice_sandbox_disk_used_bytes{sandbox="a1b2c3d4e5f6",template="python"} 1048576
# TYPE coppice_sandbox_disk_limit_bytes gauge
coppice_sandbox_disk_limit_bytes{sandbox="a1b2c3d4e5f6",template="python"} 104857600
# TYPE coppice_sandbox_uptime_seconds gauge
coppice_sandbox_uptime_seconds{sandbox="a1b2c3d4e5f6",template="python"} 1800
# TYPE coppice_racct_enabled gauge
coppice_racct_enabled 1

The sandbox label is a 12-hex-character prefix of the sandbox id — plenty to disambiguate concurrent sandboxes on one host while keeping the Prometheus label-string size bounded. The template label is whatever the caller sent as templateID at create time.

Example PromQL sketches, all of which work with the current shape:

# top-5 sandboxes by CPU right now:
topk(5, coppice_sandbox_cpu_percent)

# sandbox that is within 10% of its memory cap:
coppice_sandbox_memory_bytes / coppice_sandbox_memory_limit_bytes > 0.9

# total cores in use across the fleet:
sum(coppice_sandbox_cpu_percent) / 100

The JSON surface

GET /sandboxes/:id now embeds the same reading inline so SDK clients can get it without a second round-trip:

{
  "sandboxID": "a1b2c3d4e5f6...",
  "templateID": "python",
  "state": "running",
  "startedAt": "2026-04-22T20:15:03Z",
  "usage": {
    "cpuPercent": 12.5,
    "memoryBytes": 54919168,
    "memoryLimitBytes": 134217728,
    "diskUsedBytes": 1048576,
    "diskLimitBytes": 104857600,
    "uptimeSeconds": 1800,
    "sampledAt": "2026-04-22T20:45:03Z"
  }
}

The usage field is omitted entirely for sandboxes the sampler hasn’t reached yet (in the first N seconds of a sandbox’s life). GET /sandboxes/:id/metrics returns the same data as a single-element array, which is the shape the E2B SDK’s sandbox.metrics() iterator expects.

The racct caveat

Everything under coppice_sandbox_cpu_percent and coppice_sandbox_memory_bytes depends on kern.racct.enable=1 being set at boot. Without it rctl returns racct is disabled and the sampler emits zero for the cpu/mem fields. Disk usage still works — ZFS quotas don’t depend on racct. The startup log contains a one-shot warn! when racct is off, and the static gauge coppice_racct_enabled 0 lets alerting rules catch the drift before someone notices blank panels at 3am.

The honor bench box is configured with racct enabled — see results for the boot/loader.conf line, and the CPU / memory limits per sandbox row in the feature audit.

Cost

At the current 10 s cadence, N=100 live sandboxes means ≈20 process spawns per second (two rctl invocations + one zfs per sandbox, amortised across the tick interval). On honor that costs ~0.2% of one core — well under the noise floor of the workloads the sandboxes themselves run. If that ratio ever looks tight the sampler can batch-query all jails at once (rctl -u jail: without a name returns all of them) but the current shape is simpler and the cost is invisible.

Rig

benchmarks/rigs/per-sandbox-metrics-smoke.sh creates a sandbox, starts a python3 -c ‘while True: pass’ burner inside it, polls /metrics until coppice_sandbox_cpu_percent{sandbox=<id>} crosses 10, and exits non-zero if it doesn’t within 30 s. The rig skips (exit 0) on hosts without racct enabled, so it can stay wired into CI without a reboot requirement.