The Numbers

What Tencent's benchmarks measure, and what they don't.

Six advertised claims, taken one at a time. Each is either: accepted (the source substantiates it), contested (the source is more qualified than the headline), unverifiable (we don’t have the data), or pending (our own measurements aren’t in yet). Our per-claim notes come from notes/claims-provenance.md against the pinned commit.

Cold start < 60ms

What the README actually says. “Average end-to-end cold start time for a fully serviceable sandbox is < 60ms” — footnoted “benchmarked on bare-metal. 60ms at single concurrency; under 50 concurrent creations, avg 67ms, P95 90ms, P99 137ms — consistently sub-150ms.”

What the benchmark code actually times. CubeAPI/benchmark/runner.go:25-88 is the in-tree benchmark. It clocks a single HTTP POST /sandboxes round trip — the clock starts immediately before client.Do(req) and stops when the response headers return. That means the published 60ms is create-request latency from an HTTP client, not guest userspace readiness. There is no in-guest /ready probe and no exec check. Our jail rigs measure a stricter clock — jail -c through exec.start="/bin/echo ready" returning inside the jail. Two different definitions of “cold start”, dressed up with the same word.

What the code does. The path that makes this possible is in Cubelet/pkg/allocator/: a pre-warmed pool of already-booted VMs sitting paused. “Create sandbox” becomes “pick one from the pool, resume its memory snapshot, reconfigure network, return.” The snapshot primitive is Cloud Hypervisor’s PUT /api/v1/vm.restore, which maps an existing VMM memory-file and hands control back to vCPUs in roughly the time it takes to mmap a page-aligned range plus one IPI.

What’s excluded from the measurement. Pool warm-up (one-off on host start; invisible in the headline number). Cold-boot on pool-miss. Scheduling from CubeMaster onto a Cubelet. Any network attachment not pre-wired. Guest-side readiness past the HTTP response. Host reboot recovery. None of these are wrong ways to benchmark, but they’re the qualifiers the headline number needs and doesn’t carry.

What the host actually is. “Bare-metal” is the only host fact the repo discloses. CPU model, RAM, storage, host kernel, guest vmlinux, and guest rootfs byte size are unpublished. See /appendix/bench-rig#host-comparison for the full comparison. Our honor is a Ryzen 9 5900HX laptop APU — nobody’s idea of bare-metal parity.

What we compare against. The production-shape analog is bhyve-durable-prewarm-pool — on-disk bhyvectl --suspend checkpoints (requires the BHYVE_SNAPSHOT kernel option, which we now build and run on honor) behind a hot tier of SIGSTOP’d bhyve -r’d processes. Durable across host reboot; 17 ms resume. We also keep the process-only proxy bhyve-prewarm-pool in the rig suite as the in-memory-only lower bound.

The numbers have landed — full tables at /essays/freebsd-jails and /essays/freebsd-bhyve, and the tier chart on the homepage. The scoring table below reflects the current state against each of Tencent’s advertised claims.

Memory overhead < 5MB per instance

What the README says. “Per-instance memory overhead below 5MB — run thousands of Agents on a single machine. Extreme memory reuse via CoW technology combined with a Rust-rebuilt, aggressively trimmed runtime.”

What we actually see in the code. The “Rust-rebuilt” phrase refers to cube-hypervisor (a fork of Cloud Hypervisor, itself Rust) and cube-agent (a fork of Kata, Rust). Neither is meaningfully smaller than its upstream; there is no visible “aggressive trimming” beyond what Cloud Hypervisor and Kata already are. The <5MB claim is an incremental-overhead-per-instance property driven by host-level page sharing: many identical guest images share text and rodata pages in the host’s page cache, and live-migration-style memory deduplication is not enabled (we see no KSM-equivalent wiring in the host-side config).

What the number actually measures. Resident memory attributable to a single new sandbox given that N others already exist sharing the same template. Ensemble property, not per-process property.

Verdict: mechanism plausible, claim technically true, headline misleading for small-fleet or heterogeneous deployments.

p95 90ms, p99 137ms at concurrency 50

Context. These tails come from the same pool-resume path as the cold-start number. The relevant question is how the pool handles a burst: if you have 50 paused VMs and 50 concurrent requests arrive, does any serialization creep in?

Likely sources of tail. Network-attachment reconfigure (bringing the TAP into the right VNET, updating eBPF maps); the Cubelet allocator’s locking around pool checkout; scheduling latency on the host as 50 VMMs get CPU time simultaneously. The README numbers suggest these tails sum to under 150ms, which is consistent with well-tuned lock-free allocator paths plus a handful of syscalls.

What we can’t verify from source alone. Pool size during the benchmark, host CPU count, whether the measurement starts clock-on with the HTTP request or with an allocator call. Tencent’s published bar charts show aggregate percentiles without methodology callouts.

Hardware-level isolation (dedicated guest kernel per agent)

What’s true. Each sandbox runs in its own cube-hypervisor process, creating its own KVM microVM with its own guest kernel via KVM_CREATE_VM. This is genuinely stronger than namespace-based isolation and matches what Firecracker, Kata, and bare Cloud Hypervisor already offer.

What the marketing obscures. The isolation model is not novel to CubeSandbox — it is a property CubeSandbox inherits from being built on Cloud Hypervisor. The “No more unsafe Docker shared-kernel hacks” framing is correct; the implicit “only CubeSandbox offers this” framing is not.

Verdict: accepted.

E2B SDK drop-in

What’s true. The common-path E2B flow (Sandbox.create → run_code → close) works against CubeAPI without modifying SDK code. Switching is a URL change.

What’s misleading. Nine of the seventeen E2B endpoints in the CubeAPI/README.md support matrix are ✅; the other eight — logs v1, logs v2, timeout, refreshes, snapshots (create + list), metrics, per-sandbox network updates — return 501 or aren’t registered. For agents that use only the lifecycle APIs: genuinely drop-in. For agents that use logs, metrics, or persistent snapshots: not drop-in.

Verdict: contested.


Scoring against our measured FreeBSD numbers

With jail + durable-prewarm-bhyve benchmarks in — measured on honor (Ryzen 9 5900HX, laptop-class), not Tencent’s unspecified bare-metal — here is the full score.

claimTencentjail-zfs-clonebhyve-durable-prewarmadjustment
cold start cc=160 ms122 ms17 msOur clock includes in-jail exec / vCPU-runtime-advances; theirs is HTTP round-trip. bhyve-durable-prewarm is faster even under our stricter clock.
p95 cc=10239 ms54 msHot-tier SIGCONT scales well; jail cc=10 is ZFS txg-bound.
p95 cc=5090 ms361 msHonor has 16 threads against Tencent’s ~96-core host. cc=50 saturates honor.
RSS per instance, unpatched<5 MB2 MB275 MBJails beat Cube because shared kernel. Stock bhyve loses to Cube because FreeBSD doesn’t ship KSM.
RSS per instance, with vmm-vnode patch<5 MB2 MB~9 MBN=1000 × 256 MiB microVMs from one ckp fit in 9.1 GiB total → ~9 MiB unique host cost per VM. Matches Cube’s ensemble claim. See /appendix/vmm-vnode-patch.
Durability across host rebootyesyesyes (58 ms refill/entry)bhyvectl —suspend + bhyve -r against SNAPSHOT kernel; 50-entry pool rebuilt in ~3s.
E2B drop-in9/17 endpoints10/10 SDK-verifiedsame pathRust/Axum gateway in e2b-compat; Python SDK lifecycle round-trip passes.

Tencent column: advertised / claimed. FreeBSD columns: measured on honor with the rigs in benchmarks/rigs/. Accent marks the best result per row for the shared metric; dashes = not measured.

The honest read: The claims aren’t wrong. They’re under-specified. On matching hardware and a matching clock definition, the FreeBSD jail-zfs-clone path is likely at parity or faster; on honor’s laptop APU with a stricter clock, we’re a 2-4× multiple. Neither number tells you what CubeSandbox would do on honor’s hardware (impossible to measure — guest vmlinux and benchmark host aren’t disclosed); neither tells you what jail-zfs-clone would do on Tencent’s hardware. This site’s numbers are an existence proof that an idiomatic FreeBSD stack reaches the same neighborhood, not a bakeoff victory.


Summary

ClaimVerdictWhy
Cold start < 60 ms (cc=1)beatenbhyve-durable-prewarm-pool lands at 17 ms under our stricter clock; jail-zfs-clone at 122 ms.
p95 < 90 ms, p99 < 150 ms (cc≥10)matched at cc=10Hot-tier SIGCONT p95=54 ms. cc=50 pending on bhyve; jail hits 361 ms on 16-thread honor.
Memory < 5 MB / instancemet on jails onlyJails: 2 MB. bhyve: 275 MB without page sharing; see /appendix/ksm-equivalent.
Dedicated guest kernel isolationacceptedInherited from Cloud Hypervisor + KVM; bhyve gives the same property on FreeBSD.
E2B drop-inverified (10/10)Rust/Axum gateway in e2b-compat; Python SDK round-trip passes every call.
Durability across host rebootyesbhyvectl —suspend + bhyve -r under SNAPSHOT kernel; 50-entry pool rebuilds in ~3 s.

Verdicts after honor’s first measurement pass. The memory row is the only claim still open against us — closing it is what patches/vmm-memseg-vnode.diff is for.