Agent demo receipts

The CubeSandbox examples include agent-shaped demos rather than new substrate primitives: create a sandbox, write workspace files, run commands, inspect artifacts, and iterate. Coppice now has reproducible receipts for that exact path, plus a subscription-backed Codex CLI smoke and a Modal-style Python decorator demo that prove developer tooling can drive Coppice directly.

OpenAI Agents SDK

The receipt is benchmarks/rigs/openai-agents-coppice-smoke.sh. It installs openai-agents with uv, creates a real Agent with two function_tool tools, and drives the official SDK runner with a deterministic local model so no OpenAI API key is needed.

The tools are backed by a live Coppice sandbox:

The run writes /tmp/openai-agents/input.txt, executes Python inside the sandbox to produce /tmp/openai-agents/out.txt, reads the artifact back, and asserts the final output.

The session target is not hardcoded. By default the rig creates a fresh sandbox from E2B_TEMPLATE; set COPPICE_SANDBOX_ID=<id> to attach the same Agents SDK tool session to an existing sandbox instead. Set COPPICE_KEEP_SANDBOX=1 when you want a newly-created sandbox to survive the receipt for manual inspection.

Latest transcript: benchmarks/results/agent-demos/latest-openai-agents.txt.

OpenAI Agents + E2B Client Shape

The competitor example does not use MCP. It wires the Agents SDK sandbox abstraction directly to E2BSandboxClient: SandboxAgent, SandboxRunConfig, E2BSandboxClientOptions, and the bundled Shell capability.

The receipt is benchmarks/rigs/openai-agents-e2b-client-smoke.sh. In the default client-smoke mode it uses the official openai-agents[e2b] stack to create or attach a sandbox, start the sandbox session, write coppice-agent.txt, run uname -s plus cat through the E2B command client, and shut the sandbox down. This mode does not need an OPENAI_API_KEY; it proves the exact E2B client transport.

Run the keyless transport receipt:

mise run agents:openai-e2b-client-smoke

Run the real OpenAI agent path from a desktop with an API key:

OPENAI_API_KEY=... \
OPENAI_AGENTS_E2B_MODE=agent \
OPENAI_MODEL=gpt-4.1-mini \
mise run agents:openai-e2b-client-smoke

The rig opens or reuses the normal desktop tunnels 127.0.0.1:3001 -> honor:3000 and 127.0.0.1:49999 -> honor:49999. It also starts a tiny local envd proxy that injects the per-sandbox Host header, so E2B SDK traffic routes to the intended sandbox instead of relying on the gateway’s most-recent-sandbox fallback.

Useful knobs:

Latest transcript: benchmarks/results/agent-demos/latest-openai-agents-e2b-client.txt.

OpenAI Agents Code Interpreter Shape

The second Cube example layers custom sandbox capabilities on top of the same E2B client: a shell-inspection tool, a Python runner tool, and a Manifest that seeds workspace data before the agent starts. Coppice ports that shape in benchmarks/rigs/openai-agents-code-interpreter-smoke.sh.

The default deterministic mode does not require an OPENAI_API_KEY. It still uses a real SandboxAgent, SandboxRunConfig, custom Capability classes, and official E2BSandboxClient transport. The local deterministic model invokes run_python against a manifest-seeded sales.csv, writes generated artifacts under output/, then invokes shell to inspect them.

Run the keyless receipt:

mise run agents:openai-code-interpreter-smoke

Run the real OpenAI version:

OPENAI_API_KEY=... \
OPENAI_AGENTS_CODE_MODE=agent \
OPENAI_MODEL=gpt-4.1 \
mise run agents:openai-code-interpreter-smoke

Both modes assert that every Python/shell tool call exits 0 and that all of these sandbox artifacts exist: output/monthly_revenue.csv, output/monthly_revenue.svg, output/top_products.md, and output/summary.txt. The data schema is pinned in the prompt and manifest as date,product,units,unit_price, and the top-products receipt checks exactly three product rows sorted by revenue by reading output/top_products.md directly, not by asking the model for JSON or a specific markdown-table style. The harness forces headless plotting (MPLBACKEND=Agg) and the prompt tells the real model to generate the SVG directly so a GUI backend cannot leak into the receipt.

Latest transcript: benchmarks/results/agent-demos/latest-openai-agents-code-interpreter.txt.

Codex CLI subscription smoke

The receipt is benchmarks/rigs/codex-cli-coppice-smoke.sh. It runs the local codex exec binary non-interactively, so it uses the desktop Codex login/subscription state rather than an OPENAI_API_KEY. The nested Codex agent receives a constrained prompt plus a JSON output schema, then drives the live Coppice gateway: health check, create a sandbox, execute a command in it, and delete it.

This is deliberately separate from the OpenAI Agents SDK receipt above. The SDK receipt proves our tool/session shape against the SDK APIs; the Codex CLI receipt proves subscription-backed agent tooling can operate Coppice end to end through the public gateway surface.

Run it:

mise run agents:codex-cli-smoke

Useful knobs:

Latest transcript: benchmarks/results/agent-demos/latest-codex-cli.txt. The raw nested Codex event stream is preserved beside it as benchmarks/results/agent-demos/latest-codex-cli.events.jsonl, and the schema-validated final object is benchmarks/results/agent-demos/latest-codex-cli.final.json.

Python decorator ergonomics

The receipt is benchmarks/rigs/python-decorator-smoke.sh. It proves the Modal-shaped ergonomics row without adding a gateway primitive: examples/16-python-decorator.py defines a @sandboxed helper whose .remote(…) method creates or attaches to a Coppice sandbox, ships the Python function source plus JSON-native arguments through /sandboxes/:id/exec, parses a JSON result marker, and tears down newly-created sandboxes.

Run it:

mise run agents:python-decorator-smoke

Useful knobs:

Latest transcript: benchmarks/results/python-decorator/latest.txt.

Mini-RL / SWE-style loop

The receipt is benchmarks/rigs/mini-rl-training-smoke.sh. It creates a sandbox workspace with a deliberately broken policy.py and a tiny trainer/test harness. The first command run fails with reward below threshold. The rig then patches the policy, reruns the same command stream, and verifies the checkpoint: best_arm=1, score=1.000.

This is intentionally small, but it exercises the same control loop as larger SWE-bench or RL demos: write workspace, run tests, inspect failure, patch, rerun, and preserve the receipt.

Latest transcript: benchmarks/results/agent-demos/latest-mini-rl.txt.

Run the keyless agent receipts, including the decorator and mini-RL receipts:

mise run agents:demos

Combined latest transcript: benchmarks/results/agent-demos/latest-summary.txt.

What this closes

These receipts close the old “agent demos are blocked on commands” audit rows. The underlying gateway pieces were already closed separately: file operations, command streaming, logs, code execution, and sandbox lifecycle. This page ties them together in example-shaped flows that can be rerun against honor. The Codex CLI receipt is not a CubeSandbox feature row by itself; it is operational evidence that subscription-authenticated desktop agents can use the same public gateway surface without API-key plumbing.