Real-time voice transcription with intelligent turn detection using Speechmatics Voice SDK presets.
Learn how to use optimized preset configurations for different conversational AI use cases including voice assistants, note-taking, live captions, and multi-party conversations.
- How to use official Voice SDK presets
- Different turn detection modes (FIXED, ADAPTIVE, EXTERNAL) and Smart Turn ML
- How silence thresholds affect turn endings
- Sentence-based vs turn-based segmentation
- Real-time event handling with the Voice Agent client
- When to use each preset for optimal results
- Speechmatics API Key: Get one from portal.speechmatics.com
- Python 3.9+ (Voice SDK requires 3.9+)
- Microphone: Built-in or external microphone
- PyAudio: For microphone access (installation instructions below)
Step 1: Create and activate a virtual environment
On Windows:
cd python
python -m venv .venv
.venv\Scripts\activateOn Mac/Linux:
cd python
python3 -m venv .venv
source .venv/bin/activateStep 2: Install dependencies
pip install -r requirements.txtImportant
The requirements include ML dependencies for Smart Turn detection (certifi, onnxruntime, transformers).
Step 3: Configure API key
cp ../.env.example .env
# Edit .env and add your SPEECHMATICS_API_KEYStep 4: Run the example
python main.pySelect a preset from the menu (or press Enter for default), then speak into your microphone!
Note
This example demonstrates intelligent turn detection by:
- Loading preset configurations - Uses official SDK presets optimized for different use cases
- Setting up Voice Agent Client - Creates a client with the selected preset configuration
- Registering event handlers - Listens for partial segments, final segments, and turn endings
- Streaming microphone audio - Captures and sends audio in real-time
- Displaying results - Shows live transcription with speaker identification
- Detecting turn endings - Automatically identifies when the speaker has finished
The Voice SDK includes 7 optimized presets:
| Preset | Mode | Use Case | Silence Trigger | Key Feature |
|---|---|---|---|---|
| fast | FIXED | Real-time captions | 0.25s | Quick finalization |
| fixed | FIXED | General conversation | 0.5s | Fixed silence threshold |
| adaptive | ADAPTIVE | Voice assistants | 0.7s | Adapts to speech patterns |
| smart_turn | ADAPTIVE | Interviews | 0.8s | ML-based prediction (Smart Turn enabled) |
| scribe | ADAPTIVE | Note-taking | 1.0s | Sentence-level segments (Smart Turn enabled) |
| captions | FIXED | Live captioning | 0.5s | Consistent formatting |
| external | EXTERNAL | Push-to-talk | Manual | Custom control |
1. Loading Presets
from speechmatics.voice import VoiceAgentConfigPreset
# Load preset from SDK (includes all optimized settings)
config = VoiceAgentConfigPreset.load("adaptive")
# Preset includes:
# - end_of_utterance_mode (ADAPTIVE)
# - silence_trigger (1.0s)
# - max_delay (0.7s)
# - operating_point (ENHANCED)
# - and more...2. Creating the Client
from speechmatics.voice import VoiceAgentClient
client = VoiceAgentClient(
api_key=os.getenv("SPEECHMATICS_API_KEY"),
config=config
)3. Event Handlers
The example registers three event handlers:
Partial Segments (real-time updates):
@client.on(AgentServerMessageType.ADD_PARTIAL_SEGMENT)
def on_partial(message):
for segment in message.get("segments", []):
print(f"\r> {segment['text']}", end="", flush=True)Final Segments (complete transcription):
@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_final(message):
for segment in message.get("segments", []):
speaker = segment.get("speaker_id", "S1")
text = segment["text"]
print(f"\n[{speaker}]: {text}")Turn Endings (speaker finished):
@client.on(AgentServerMessageType.END_OF_TURN)
def on_turn_end(message):
print("[END OF TURN]\n")4. Streaming Audio
from speechmatics.rt import Microphone
mic = Microphone(sample_rate=16000, chunk_size=320)
mic.start()
await client.connect()
while True:
audio_chunk = await mic.read(320)
await client.send_audio(audio_chunk)5. Error Handling
from speechmatics.rt import AuthenticationError
try:
segments = await run_preset(preset_name)
except (AuthenticationError, ValueError) as e:
print(f"\nAuthentication Error: {e}")
if not os.getenv("SPEECHMATICS_API_KEY"):
print("Error: SPEECHMATICS_API_KEY not set")
print("Please set it in your .env file")Available Presets:
======================================================================
1. fast - Quick finalization, best for real-time captions
2. fixed - Fixed silence threshold, general conversational use
3. adaptive - Adapts to speech patterns, best for voice assistants
4. smart_turn - ML-based turn detection for conversations
5. scribe - Optimized for note-taking and dictation
6. captions - Consistent formatting for live captioning
7. external - Manual turn control (Press ENTER to trigger)
======================================================================
Select preset number (or press Enter for adaptive): 3
======================================================================
PRESET: ADAPTIVE
======================================================================
Mode: adaptive
Operating Point: enhanced
Silence Trigger: 0.7s
Max Delay: 0.7s
Speak into your microphone. Press Ctrl+C to stop.
======================================================================
> Hello, I need help with my account
[S1]: Hello, I need help with my account.
[END OF TURN]
> I'm having trouble logging in and resetting my password
[S1]: I'm having trouble logging in and resetting my password.
[END OF TURN]
> Um, I tried the forgot password link but it's not sending me an email
[S1]: Um, I tried the forgot password link but it's not sending me an email.
[END OF TURN]
^C
Stopped. Captured 3 segments.
======================================================================
SUMMARY
======================================================================
1. [S1]: Hello, I need help with my account.
2. [S1]: I'm having trouble logging in and resetting my password.
3. [S1]: Um, I tried the forgot password link but it's not sending me an email.
======================================================================
PRESET: SCRIBE
======================================================================
Mode: fixed
Operating Point: enhanced
Silence Trigger: 1.2s
Max Delay: 1.0s
> Meeting notes for January 15th
[S1]: Meeting notes for January 15th.
> First agenda item is quarterly review
[S1]: First agenda item is quarterly review.
[END OF TURN]
> Revenue increased by 23%
[S1]: Revenue increased by 23%.
> Customer satisfaction scores improved
[S1]: Customer satisfaction scores improved.
[END OF TURN]
Note
SCRIBE emits each sentence as a separate segment before END_OF_TURN. This is designed for note-taking where you want sentence-level granularity.
======================================================================
PRESET: EXTERNAL
======================================================================
Mode: fixed
Operating Point: enhanced
Silence Trigger: 2.0s
Max Delay: 1.0s
Press ENTER to trigger end-of-utterance manually.
Press Ctrl+C to stop.
======================================================================
> Please transfer me to technical support.
[ENTER KEY DETECTED - Triggering End of Utterance]
[S1]: Please transfer me to technical support.
[END OF TURN]
> My internet connection keeps dropping every few minutes.
[ENTER KEY DETECTED - Triggering End of Utterance]
[S1]: My internet connection keeps dropping every few minutes.
[END OF TURN]
> I've already tried restarting the router twice.
[ENTER KEY DETECTED - Triggering End of Utterance]
[S1]: I've already tried restarting the router twice.
[END OF TURN]
^C
Stopped. Captured 3 segments.
Tip
EXTERNAL mode gives you full control over when turns end. Uses manual mode (Enter Key for demo) after each utterance to trigger finalization immediately. This is ideal for push-to-talk interfaces or custom turn detection logic.
Turn Detection Modes:
- FIXED: Uniform silence threshold for all speakers - always waits the exact configured duration
- ADAPTIVE: Responds to speech characteristics including pace, pauses, and filler words - may finalize faster or slower than the reference value
- ADAPTIVE + Smart Turn: Uses ADAPTIVE mode with ML model to predict semantic turn completions - builds on an open-source turn detection model
- EXTERNAL: Manual control via Enter key (calls
client.force_end_of_utterance()) - suitable for framework integrations like Pipecat/LiveKit
Segmentation Strategies:
- Turn-based: Single segment per complete utterance (most presets)
- Sentence-based: Multiple segments per utterance (SCRIBE, CAPTIONS)
Real-time Events:
- ADD_PARTIAL_SEGMENT: Live updates as you speak
- ADD_SEGMENT: Finalized transcription segments
- END_OF_TURN: Turn completion detection
Speaker Identification:
- Automatic speaker labeling (S1, S2, etc.)
- Diarization enabled by default in presets
# Quick finalization for real-time captions
config = VoiceAgentConfigPreset.load("fast")
# Note-taking with sentence-level segments
config = VoiceAgentConfigPreset.load("scribe")
# ML-based turn detection for interviews
config = VoiceAgentConfigPreset.load("smart_turn")# Start with a preset and customize
base_config = VoiceAgentConfigPreset.load("adaptive")
custom_config = VoiceAgentConfig(
language="es", # Change to Spanish
enable_diarization=False, # Disable speaker labels
)
# Merge custom settings with preset
config = VoiceAgentConfigPreset._merge_configs(base_config, custom_config)The EXTERNAL preset allows manual control over turn endings. This example uses FIXED mode with the maximum silence trigger (2s) to minimize auto-triggering, then uses force_end_of_utterance() when Enter is pressed:
import keyboard # Cross-platform keyboard input
from speechmatics.voice import VoiceAgentConfig, EndOfUtteranceMode
# Configure for manual control: FIXED mode with max silence trigger
# Server allows 0-2 seconds for silence trigger
config = VoiceAgentConfigPreset.load(
"external",
overlay_json=VoiceAgentConfig(
end_of_utterance_mode=EndOfUtteranceMode.FIXED,
end_of_utterance_silence_trigger=2.0, # Max allowed by server
).model_dump_json(exclude_unset=True)
)
async def check_for_enter_key(client: VoiceAgentClient):
"""Background task to detect Enter key press."""
while True:
await asyncio.sleep(0.05)
if keyboard.is_pressed("enter"):
await client.force_end_of_utterance() # Triggers immediate finalization
await asyncio.sleep(0.3) # Debounce
# Run as background task alongside audio streaming
enter_task = asyncio.create_task(check_for_enter_key(client))Important
The end_of_utterance_silence_trigger setting only applies to FIXED mode. Using FIXED mode ensures the SDK properly handles the server's END_OF_UTTERANCE response. The server allows values between 0 and 2 seconds. Setting to 0 disables automatic detection entirely.
Smart Turn dependencies are already included in requirements.txt. If installing manually:
# Install ML dependencies individually
pip install certifi>=2025.10.5
pip install onnxruntime>=1.19.0,<2
pip install transformers>=4.57.0,<5
# Or use the Voice SDK bundle
pip install speechmatics-voice[smart]The Voice SDK employs a multi-stage process to identify when a speaker has completed their turn:
- Audio input - VAD (Voice Activity Detection) continuously monitors for speech activity
- Silence identification - When speech stops, a timer begins counting down from the
silence_triggervalue - Multiplier adjustments - The countdown duration is scaled based on detected speech patterns
- Smart Turn evaluation (if enabled) - An ML model predicts whether the utterance is semantically complete
- Turn completion - Once the adjusted countdown expires, the turn ends and segments are finalized
Tip
Consider silence_trigger as your starting reference:
- FIXED mode: Waits exactly this specified duration every time
- ADAPTIVE mode: Treats this as an initial value, then scales it up or down based on conversational context
The system analyzes speech patterns and applies scaling factors to the reference silence duration:
| Condition | Typical Multiplier | Effect |
|---|---|---|
| Very slow speaking pace | 3.0x | Extends wait significantly |
| Moderately slow pace | 2.0x | Doubles the wait time |
| Ends with filler words (um, uh) | 2.5x | Allows more time (speaker likely formulating thoughts) |
| Missing sentence-ending punctuation | 1.5x+ | Extends wait for completion |
| Smart Turn: high confidence complete | 0.1x | Rapid finalization |
| Smart Turn: likely incomplete | 1.7x | Extended wait |
Note
This multiplier system enables ADAPTIVE mode's intelligent behavior - it naturally extends wait times for hesitant speakers or those using filler words, while quickly finalizing clearly completed sentences.
Standard Presets (Turn-Based):
Input: "Hello. How are you today?"
Output:
[S1]: Hello. How are you today.
[END OF TURN]
Result: 1 segment per turn
SCRIBE/CAPTIONS Presets (Sentence-Based):
Input: "Hello. How are you today?"
Output:
[S1]: Hello.
[S1]: How are you today.
[END OF TURN]
Result: 2 segments (one per sentence)
Sentence-based segmentation is perfect for:
- Structured note-taking
- Creating bullet-point lists
- Generating captions with line breaks
- Separating distinct thoughts
| Preset | Threshold | Effect | Best For |
|---|---|---|---|
| FAST | 0.25s | May split mid-sentence | Fast speakers, captions |
| FIXED | 0.5s | Consistent timing | General conversation |
| ADAPTIVE | 0.7s | Balances speed and accuracy | Voice assistants |
| SCRIBE | 1.0s | Waits for complete thoughts | Dictation, notes |
The Voice SDK incorporates Silero VAD for detecting speech presence:
- Threshold: Default 0.35 (increasing this value makes detection more strict, potentially missing quieter speech)
- Min Duration: 150ms of sustained speech or silence required before confirming a state transition
Tip
In environments with background noise (traffic, air conditioning, etc.), consider adjusting the VAD threshold. Higher thresholds reduce false positives from noise but may occasionally miss softer speech.
| Mode | Best For | How It Works |
|---|---|---|
| FIXED | Predictable timing, captions | Consistently waits the exact silence trigger duration |
| ADAPTIVE | Voice assistants, conversations | Dynamically modifies wait time based on speech patterns (pace, filler words) |
| ADAPTIVE + Smart Turn | Interviews, complex conversations | ADAPTIVE mode enhanced with ML model to identify semantic turn boundaries |
| EXTERNAL | Pipecat, LiveKit, custom VAD | Application code controls turn endings via force_end_of_utterance() |
Note
EXTERNAL mode is intended for integration with frameworks such as Pipecat and LiveKit that implement their own voice activity detection. The SDK handles transcription while your application logic determines turn boundaries.
- Pipecat Integration - Build voice agents with Pipecat framework
- Real-time Translation - Add multilingual support
- Audio Intelligence - Sentiment and topic detection
- Turn Detection - Basic RT SDK turn detection
PyAudio Installation Issues
Windows:
# If pip install pyaudio fails, try:
pip install pipwin
pipwin install pyaudio
# Or download pre-built wheel from:
# https://www.lfd.uci.edu/~gohlke/pythonlibs/#pyaudioMac:
# Install portaudio first
brew install portaudio
pip install pyaudioLinux (Ubuntu/Debian):
sudo apt-get install portaudio19-dev
pip install pyaudio"Smart Turn not working"
# Dependencies should be installed from requirements.txt
# If you skipped them, install manually:
pip install certifi>=2025.10.5 onnxruntime>=1.19.0,<2 transformers>=4.57.0,<5
# Or use the Voice SDK bundle
pip install speechmatics-voice[smart]"Microphone not available" message
- Check that PyAudio is installed:
pip list | grep PyAudio - Verify microphone permissions in system settings
- Test microphone with another application
"Too many short segments with FIXED mode"
- Speaker may have slow speech or frequent pauses
- Try ADAPTIVE instead of FAST
- Or use SCRIBE for longer silence threshold (1.0s)
"Not detecting turn endings"
- Ensure you're pausing for the silence threshold duration
- Check silence threshold for your selected preset
- EXTERNAL mode requires pressing Enter key to trigger turn endings
"Keyboard library permission error" (EXTERNAL mode)
The keyboard library requires elevated permissions on some operating systems:
macOS:
# Run with sudo
sudo python main.py
# Or grant Terminal/IDE accessibility permissions:
# System Preferences > Security & Privacy > Privacy > AccessibilityLinux:
# Run with sudo
sudo python main.py
# Or add user to input group (logout required):
sudo usermod -aG input $USERNote
The keyboard library is only used for the EXTERNAL preset demo (Enter key detection). Other presets work without elevated permissions.
"Authentication failed" error
- Verify API key in
.envfile - Check your key at portal.speechmatics.com
- Ensure no extra spaces in
.envfile
Help us improve this guide:
- Found an issue? Report it
- Have suggestions? Open a discussion
Time to Complete: 15 minutes Difficulty: Intermediate API Mode: Voice Agent (Real-time) Languages: Python