voicebench benchmarks end-to-end voice latency with Eliza using the TypeScript runtime.
For each runtime and each mode (simple, non-simple):
- transcription time (
TRANSCRIPTIONmodel) - transcription accuracy against labels (when a dataset manifest includes expected text)
- response TTFT (time to first response token/chunk; falls back to response completion when streaming is unavailable)
- response total time
- speech-to-response-start (
transcriptionMs + responseTtftMs) - speech-to-voice-start (
transcriptionMs + responseTotalMs + firstSentenceTtsMs) for cached and uncached first sentence paths - voice generation time (
TEXT_TO_SPEECHmodel) - voice first-token proxy (first-sentence synthesis) in two paths:
- uncached first sentence
- cached first sentence while synthesizing remainder in parallel
- end-to-end time
- p95/p99 latency tails (transcription, response TTFT/total, TTS, voice TTFT, cached pipeline, end-to-end)
- in-context and out-context excerpts
- model input/output excerpts from trajectory logs (raw vs cleaned)
- thinking/XML tag detection counts on model raw output
- trajectory counts (provider accesses + LLM calls)
simple: normal path, no benchmark context injectednon-simple: injectsbenchmarkContextmetadata soCONTEXT_BENCHforces the non-simple action loop
groq: Groq for transcription + response models + voice generationelevenlabs: Groq for response models, ElevenLabs for transcription + voice generationmock: deterministic in-process model, transcription, and TTS handlers for smoke tests without external credentials
Common:
VOICEBENCH_AUDIO_PATH(optional; if unset,run.shwill try these defaults in order):benchmarks/voicebench/shared/audio/default.wavexamples/town/public/assets/background.mp3agent-town/public/assets/background.mp3run.shresolves the selected path to an absolute path before invoking the TypeScript runnermockprofile additionally falls back tobenchmarks/voicebench/shared/mock-audio.txt
Mock profile:
- no external credentials
Groq profile:
GROQ_API_KEYGROQ_LARGE_MODEL(optional; default:openai/gpt-oss-120b)GROQ_SMALL_MODEL(optional; default:openai/gpt-oss-120b)GROQ_TRANSCRIPTION_MODEL(optional; default:whisper-large-v3-turbo)GROQ_TTS_MODEL(optional; default:canopylabs/orpheus-v1-english)GROQ_TTS_VOICE(optional; default:troy)GROQ_TTS_RESPONSE_FORMAT(optional; default:wav)
ElevenLabs profile:
GROQ_API_KEYELEVENLABS_API_KEYELEVENLABS_MODEL_ID(optional; default inrun.sh:eleven_flash_v2_5)ELEVENLABS_VOICE_ID(optional; default inrun.sh:EXAVITQu4vr4xnSDxMaL)ELEVENLABS_OPTIMIZE_STREAMING_LATENCY(optional; default inrun.sh:4)ELEVENLABS_OUTPUT_FORMAT(optional; default inrun.sh:mp3_22050_32)
cd benchmarks/voicebench
./run.sh --profile=mock --iterations=1 --dataset=fixtures/manifest-mock.json
./run.sh --profile=groq
./run.sh --profile=elevenlabsRun benchmark against labeled dataset:
cd benchmarks/voicebench
./run.sh --profile=groq --dataset=fixtures/manifest-groq.json
./run.sh --profile=elevenlabs --dataset=fixtures/manifest-elevenlabs.jsonOptional flags:
--iterations=N(default fromshared/config.json)--ts-only(no-op; only TypeScript runs).--py-only/--rs-onlyexit with an error.--output-dir=/absolute/or/relative/path--dataset=/path/to/manifest.json(uses fixture samples instead of a singleVOICEBENCH_AUDIO_PATH)
Results are written as JSON in benchmarks/voicebench/results/.
- Fixture prompts live in
benchmarks/voicebench/shared/fixture_prompts.jsonl. - Response verbosity is hard-capped via
responseMaxCharsinbenchmarks/voicebench/shared/config.json. - Fixture manifests include
samples[].id,samples[].text, andsamples[].audioPath. - TypeScript runner dynamically imports plugin packages from:
plugins/plugin-groqplugins/plugin-elevenlabs
- If Bun reports missing plugin dependencies, install those plugin dependencies first.