Run any integrated benchmark (or all benchmarks), store normalized results in SQLite/JSON, and inspect history in the browser viewer.
Use the workspace Python (/Users/shawwalters/eliza-workspace/.venv/bin/python)
for consistent dependency versions across benchmark subprocesses.
- Results DB:
benchmarks/benchmark_results/orchestrator.sqlite - Viewer dataset:
benchmarks/benchmark_results/viewer_data.json - Static viewer UI:
benchmarks/viewer/index.html
/opt/miniconda3/bin/python -m benchmarks.orchestrator list-benchmarksThis verifies adapter coverage for all benchmark directories under benchmarks/.
Run one benchmark:
/opt/miniconda3/bin/python -m benchmarks.orchestrator run \
--benchmarks solana \
--provider groq \
--model qwen/qwen3-32bRun all benchmarks:
/opt/miniconda3/bin/python -m benchmarks.orchestrator run \
--all \
--provider groq \
--model qwen/qwen3-32bIdempotent behavior:
- Existing successful signatures are skipped automatically.
--rerun-failedreruns only signatures whose latest run failed.--forcealways creates a fresh run.
Examples:
# rerun only failed signatures
/opt/miniconda3/bin/python -m benchmarks.orchestrator run --all --rerun-failed --provider groq --model qwen/qwen3-32b
# force fresh runs
/opt/miniconda3/bin/python -m benchmarks.orchestrator run --all --force --provider groq --model qwen/qwen3-32bUse --extra with a JSON object for benchmark-specific knobs.
Adapter defaults are applied first, then --extra overrides are merged on top.
This keeps run --all idempotent with stable per-benchmark baseline settings
while still letting you override knobs when needed.
/opt/miniconda3/bin/python -m benchmarks.orchestrator run \
--benchmarks osworld \
--provider groq \
--model qwen/qwen3-32b \
--rerun-failed \
--extra '{"max_tasks":1,"headless":true,"vm_ready_timeout_seconds":21600}'--extra also supports a per_benchmark object for benchmark-specific overrides
in one --all run:
/Users/shawwalters/eliza-workspace/.venv/bin/python -m benchmarks.orchestrator run \
--all \
--agent eliza \
--provider groq \
--model qwen/qwen3-32b \
--extra "$(cat benchmarks/orchestrator/profiles/sample10.json)"Profile included in repo:
benchmarks/orchestrator/profiles/sample10.json- roughly 10% sampled run settings (where the benchmark supports sampling).benchmarks/orchestrator/profiles/orchestrator_subagents.json- orchestrator matrix profile forswe_bench_orchestrated,gaia_orchestrated, andorchestrator_lifecycle.
New orchestrator-centric benchmark IDs:
swe_bench_orchestratedgaia_orchestratedorchestrator_lifecycleeliza_replay
Code matrix example:
/opt/miniconda3/bin/python -m benchmarks.orchestrator run \
--benchmarks swe_bench_orchestrated \
--provider anthropic \
--model claude-sonnet-4-6 \
--extra '{"per_benchmark":{"swe_bench_orchestrated":{"matrix":true,"max_instances":3,"no_docker":true,"strict_capabilities":true}}}'Research matrix example:
/opt/miniconda3/bin/python -m benchmarks.orchestrator run \
--benchmarks gaia_orchestrated \
--provider groq \
--model qwen/qwen3-32b \
--extra '{"per_benchmark":{"gaia_orchestrated":{"matrix":true,"dataset":"sample","max_questions":10,"strict_capabilities":true}}}'Lifecycle suite example:
/opt/miniconda3/bin/python -m benchmarks.orchestrator run \
--benchmarks orchestrator_lifecycle \
--provider openai \
--model gpt-4o \
--extra '{"per_benchmark":{"orchestrator_lifecycle":{"max_scenarios":12,"strict":true}}}'Replay scoring example (from normalized Eliza capture artifacts):
/opt/miniconda3/bin/python -m benchmarks.orchestrator run \
--benchmarks eliza_replay \
--provider openai \
--model gpt-4o-mini \
--extra '{"per_benchmark":{"eliza_replay":{"capture_path":"/path/to/replays","capture_glob":"*.replay.json"}}}'capture_path is required and must point to a file or directory of normalized *.replay.json artifacts.
Serve live viewer API + UI:
/opt/miniconda3/bin/python -m benchmarks.orchestrator serve-viewer --host 127.0.0.1 --port 8877Open: http://127.0.0.1:8877/
Viewer supports:
- Historical runs across all benchmarks.
- Sorting by
agent,run_id, and other columns. - High-score comparison columns (
high_score,delta). - Filtering by benchmark/status and text search.
/opt/miniconda3/bin/python -m benchmarks.orchestrator export-viewer-dataIf an orchestrator process is interrupted, rows can remain in running state.
Recover them immediately and regenerate the viewer dataset:
/opt/miniconda3/bin/python -m benchmarks.orchestrator recover-stale-runs --stale-seconds 0Default behavior only recovers runs older than 300 seconds:
/opt/miniconda3/bin/python -m benchmarks.orchestrator recover-stale-runs/opt/miniconda3/bin/python -m benchmarks.orchestrator show-runs --desc --limit 200show-runs is sorted by (agent, run_id) and is useful for quick auditing.
Run any benchmark suite against two models and print a side-by-side delta
table. Each side is a separate run group in SQLite, but both runs share a
comparison_id so the comparison can be re-rendered later.
Spec format for --a / --b: <provider>:<model>[@<base_url>].
The optional @<base_url> is forwarded to the provider as an OpenAI-
compatible base URL; for the vllm provider this points the orchestrator at
a self-hosted vLLM endpoint started via vllm serve.
/opt/miniconda3/bin/python -m benchmarks.orchestrator compare \
--a "vllm:elizaos/eliza-1-2b@http://127.0.0.1:8001/v1" \
--b "vllm:Qwen/Qwen3.5-2B@http://127.0.0.1:8002/v1" \
--benchmarks eliza-format,bfcl,realm,context-benchOptional flags:
--max-examples Ncaps work per benchmark (forwarded asmax_examples/max_tasks/sampleso individual adapters pick it up however they natively wire sampling).--temperature 0.0(default).--out <dir>— directory forcompare-<comparison_id>.json. Defaults tobenchmarks/benchmark_results/comparisons/.
Output:
Comparison ID: cmp_20260504T120000Z_a1b2c3d4
A: vllm:elizaos/eliza-1-2b @ http://127.0.0.1:8001/v1
B: vllm:Qwen/Qwen3.5-2B @ http://127.0.0.1:8002/v1
Benchmarks: eliza-format, bfcl, realm, context-bench
benchmark | A: vllm:elizaos/eliza-1-2b | B: vllm:Qwen/Qwen3.5-2B | delta (B-A) | winner
---------------+----------------------------+-------------------------+-------------+-------
eliza-format | 0.9120 | 0.7430 | -0.1690 | A
bfcl | 0.6840 | 0.6920 | +0.0080 | B
realm | 0.5510 | 0.5310 | -0.0200 | A
context-bench | 0.7400 | 0.7250 | -0.0150 | A
Wrote benchmarks/benchmark_results/comparisons/compare-cmp_20260504T120000Z_a1b2c3d4.json
Re-render a stored comparison:
/opt/miniconda3/bin/python -m benchmarks.orchestrator view-comparison \
cmp_20260504T120000Z_a1b2c3d4The vllm provider name is registered alongside openai / groq /
anthropic: every benchmark CLI that already accepts --provider
accepts --provider vllm, and the orchestrator forwards
OPENAI_BASE_URL to the per-benchmark subprocess so OpenAI-compatible
clients hit the vLLM endpoint without code changes. Override the default
http://127.0.0.1:8001/v1 either via @<base_url> in the spec, the
VLLM_BASE_URL env var, or the per-run vllm_base_url extra config.
Each run stores:
- benchmark ID + directory
- run ID + run group ID + signature + attempt
- status, duration, score, metrics, artifacts
- provider, model, agent label
- extra config used for the run
- benchmark and Eliza commit/version metadata
- high-score reference and delta