Name	Name	Last commit message	Last commit date
parent directory ..
python	python
.env.example	.env.example
README.md	README.md

Voice Agent Turn Detection

Real-time voice transcription with intelligent turn detection using Speechmatics Voice SDK presets.

Learn how to use optimized preset configurations for different conversational AI use cases including voice assistants, note-taking, live captions, and multi-party conversations.

What You'll Learn

How to use official Voice SDK presets
Different turn detection modes (FIXED, ADAPTIVE, EXTERNAL) and Smart Turn ML
How silence thresholds affect turn endings
Sentence-based vs turn-based segmentation
Real-time event handling with the Voice Agent client
When to use each preset for optimal results

Prerequisites

Speechmatics API Key: Get one from portal.speechmatics.com
Python 3.9+ (Voice SDK requires 3.9+)
Microphone: Built-in or external microphone
PyAudio: For microphone access (installation instructions below)

Quick Start

Python

Step 1: Create and activate a virtual environment

On Windows:

cd python
python -m venv .venv
.venv\Scripts\activate

On Mac/Linux:

cd python
python3 -m venv .venv
source .venv/bin/activate

Step 2: Install dependencies

pip install -r requirements.txt

Important

The requirements include ML dependencies for Smart Turn detection (certifi, onnxruntime, transformers).

Step 3: Configure API key

cp ../.env.example .env
# Edit .env and add your SPEECHMATICS_API_KEY

Step 4: Run the example

python main.py

Select a preset from the menu (or press Enter for default), then speak into your microphone!

How It Works

Note

This example demonstrates intelligent turn detection by:

Loading preset configurations - Uses official SDK presets optimized for different use cases
Setting up Voice Agent Client - Creates a client with the selected preset configuration
Registering event handlers - Listens for partial segments, final segments, and turn endings
Streaming microphone audio - Captures and sends audio in real-time
Displaying results - Shows live transcription with speaker identification
Detecting turn endings - Automatically identifies when the speaker has finished

Available Presets

The Voice SDK includes 7 optimized presets:

Preset	Mode	Use Case	Silence Trigger	Key Feature
fast	FIXED	Real-time captions	0.25s	Quick finalization
fixed	FIXED	General conversation	0.5s	Fixed silence threshold
adaptive	ADAPTIVE	Voice assistants	0.7s	Adapts to speech patterns
smart_turn	ADAPTIVE	Interviews	0.8s	ML-based prediction (Smart Turn enabled)
scribe	ADAPTIVE	Note-taking	1.0s	Sentence-level segments (Smart Turn enabled)
captions	FIXED	Live captioning	0.5s	Consistent formatting
external	EXTERNAL	Push-to-talk	Manual	Custom control

Code Walkthrough

1. Loading Presets

from speechmatics.voice import VoiceAgentConfigPreset

# Load preset from SDK (includes all optimized settings)
config = VoiceAgentConfigPreset.load("adaptive")

# Preset includes:
# - end_of_utterance_mode (ADAPTIVE)
# - silence_trigger (1.0s)
# - max_delay (0.7s)
# - operating_point (ENHANCED)
# - and more...

2. Creating the Client

from speechmatics.voice import VoiceAgentClient

client = VoiceAgentClient(
    api_key=os.getenv("SPEECHMATICS_API_KEY"),
    config=config
)

3. Event Handlers

The example registers three event handlers:

Partial Segments (real-time updates):

@client.on(AgentServerMessageType.ADD_PARTIAL_SEGMENT)
def on_partial(message):
    for segment in message.get("segments", []):
        print(f"\r> {segment['text']}", end="", flush=True)

Final Segments (complete transcription):

@client.on(AgentServerMessageType.ADD_SEGMENT)
def on_final(message):
    for segment in message.get("segments", []):
        speaker = segment.get("speaker_id", "S1")
        text = segment["text"]
        print(f"\n[{speaker}]: {text}")

Turn Endings (speaker finished):

@client.on(AgentServerMessageType.END_OF_TURN)
def on_turn_end(message):
    print("[END OF TURN]\n")

4. Streaming Audio

from speechmatics.rt import Microphone

mic = Microphone(sample_rate=16000, chunk_size=320)
mic.start()

await client.connect()

while True:
    audio_chunk = await mic.read(320)
    await client.send_audio(audio_chunk)

5. Error Handling

from speechmatics.rt import AuthenticationError

try:
    segments = await run_preset(preset_name)
except (AuthenticationError, ValueError) as e:
    print(f"\nAuthentication Error: {e}")
    if not os.getenv("SPEECHMATICS_API_KEY"):
        print("Error: SPEECHMATICS_API_KEY not set")
        print("Please set it in your .env file")

Expected Output

Startup

Available Presets:
======================================================================
1. fast                     - Quick finalization, best for real-time captions
2. fixed                    - Fixed silence threshold, general conversational use
3. adaptive                 - Adapts to speech patterns, best for voice assistants
4. smart_turn               - ML-based turn detection for conversations
5. scribe                   - Optimized for note-taking and dictation
6. captions                 - Consistent formatting for live captioning
7. external                 - Manual turn control (Press ENTER to trigger)
======================================================================

Select preset number (or press Enter for adaptive): 3

Running - ADAPTIVE

======================================================================
PRESET: ADAPTIVE
======================================================================
Mode: adaptive
Operating Point: enhanced
Silence Trigger: 0.7s
Max Delay: 0.7s

Speak into your microphone. Press Ctrl+C to stop.
======================================================================

> Hello, I need help with my account
[S1]: Hello, I need help with my account.
[END OF TURN]

> I'm having trouble logging in and resetting my password
[S1]: I'm having trouble logging in and resetting my password.
[END OF TURN]

> Um, I tried the forgot password link but it's not sending me an email
[S1]: Um, I tried the forgot password link but it's not sending me an email.
[END OF TURN]

^C

Stopped. Captured 3 segments.

======================================================================
SUMMARY
======================================================================
1. [S1]: Hello, I need help with my account.
2. [S1]: I'm having trouble logging in and resetting my password.
3. [S1]: Um, I tried the forgot password link but it's not sending me an email.

Running - SCRIBE (Sentence-Based)

======================================================================
PRESET: SCRIBE
======================================================================
Mode: fixed
Operating Point: enhanced
Silence Trigger: 1.2s
Max Delay: 1.0s

> Meeting notes for January 15th
[S1]: Meeting notes for January 15th.
> First agenda item is quarterly review
[S1]: First agenda item is quarterly review.
[END OF TURN]

> Revenue increased by 23%
[S1]: Revenue increased by 23%.
> Customer satisfaction scores improved
[S1]: Customer satisfaction scores improved.
[END OF TURN]

Note

SCRIBE emits each sentence as a separate segment before END_OF_TURN. This is designed for note-taking where you want sentence-level granularity.

Running - EXTERNAL (Manual Control)

======================================================================
PRESET: EXTERNAL
======================================================================
Mode: fixed
Operating Point: enhanced
Silence Trigger: 2.0s
Max Delay: 1.0s

Press ENTER to trigger end-of-utterance manually.
Press Ctrl+C to stop.
======================================================================

> Please transfer me to technical support.
[ENTER KEY DETECTED - Triggering End of Utterance]

[S1]: Please transfer me to technical support.
[END OF TURN]

> My internet connection keeps dropping every few minutes.
[ENTER KEY DETECTED - Triggering End of Utterance]

[S1]: My internet connection keeps dropping every few minutes.
[END OF TURN]

> I've already tried restarting the router twice.
[ENTER KEY DETECTED - Triggering End of Utterance]

[S1]: I've already tried restarting the router twice.
[END OF TURN]

^C

Stopped. Captured 3 segments.

Tip

EXTERNAL mode gives you full control over when turns end. Uses manual mode (Enter Key for demo) after each utterance to trigger finalization immediately. This is ideal for push-to-talk interfaces or custom turn detection logic.

Key Features Demonstrated

Turn Detection Modes:

FIXED: Uniform silence threshold for all speakers - always waits the exact configured duration
ADAPTIVE: Responds to speech characteristics including pace, pauses, and filler words - may finalize faster or slower than the reference value
ADAPTIVE + Smart Turn: Uses ADAPTIVE mode with ML model to predict semantic turn completions - builds on an open-source turn detection model
EXTERNAL: Manual control via Enter key (calls client.force_end_of_utterance()) - suitable for framework integrations like Pipecat/LiveKit

Segmentation Strategies:

Turn-based: Single segment per complete utterance (most presets)
Sentence-based: Multiple segments per utterance (SCRIBE, CAPTIONS)

Real-time Events:

ADD_PARTIAL_SEGMENT: Live updates as you speak
ADD_SEGMENT: Finalized transcription segments
END_OF_TURN: Turn completion detection

Speaker Identification:

Automatic speaker labeling (S1, S2, etc.)
Diarization enabled by default in presets

Configuration Options

Using Different Presets

# Quick finalization for real-time captions
config = VoiceAgentConfigPreset.load("fast")

# Note-taking with sentence-level segments
config = VoiceAgentConfigPreset.load("scribe")

# ML-based turn detection for interviews
config = VoiceAgentConfigPreset.load("smart_turn")

Custom Overlay on Presets

# Start with a preset and customize
base_config = VoiceAgentConfigPreset.load("adaptive")

custom_config = VoiceAgentConfig(
    language="es",  # Change to Spanish
    enable_diarization=False,  # Disable speaker labels
)

# Merge custom settings with preset
config = VoiceAgentConfigPreset._merge_configs(base_config, custom_config)

Using EXTERNAL Mode (Manual Turn Control)

The EXTERNAL preset allows manual control over turn endings. This example uses FIXED mode with the maximum silence trigger (2s) to minimize auto-triggering, then uses force_end_of_utterance() when Enter is pressed:

import keyboard  # Cross-platform keyboard input
from speechmatics.voice import VoiceAgentConfig, EndOfUtteranceMode

# Configure for manual control: FIXED mode with max silence trigger
# Server allows 0-2 seconds for silence trigger
config = VoiceAgentConfigPreset.load(
    "external",
    overlay_json=VoiceAgentConfig(
        end_of_utterance_mode=EndOfUtteranceMode.FIXED,
        end_of_utterance_silence_trigger=2.0,  # Max allowed by server
    ).model_dump_json(exclude_unset=True)
)

async def check_for_enter_key(client: VoiceAgentClient):
    """Background task to detect Enter key press."""
    while True:
        await asyncio.sleep(0.05)
        if keyboard.is_pressed("enter"):
            await client.force_end_of_utterance()  # Triggers immediate finalization
            await asyncio.sleep(0.3)  # Debounce

# Run as background task alongside audio streaming
enter_task = asyncio.create_task(check_for_enter_key(client))

Important

The end_of_utterance_silence_trigger setting only applies to FIXED mode. Using FIXED mode ensures the SDK properly handles the server's END_OF_UTTERANCE response. The server allows values between 0 and 2 seconds. Setting to 0 disables automatic detection entirely.

Enable Smart Turn Detection

Smart Turn dependencies are already included in requirements.txt. If installing manually:

# Install ML dependencies individually
pip install certifi>=2025.10.5
pip install onnxruntime>=1.19.0,<2
pip install transformers>=4.57.0,<5

# Or use the Voice SDK bundle
pip install speechmatics-voice[smart]

Understanding Turn Detection Behavior

How Turn Detection Works Under the Hood

The Voice SDK employs a multi-stage process to identify when a speaker has completed their turn:

Audio input - VAD (Voice Activity Detection) continuously monitors for speech activity
Silence identification - When speech stops, a timer begins counting down from the silence_trigger value
Multiplier adjustments - The countdown duration is scaled based on detected speech patterns
Smart Turn evaluation (if enabled) - An ML model predicts whether the utterance is semantically complete
Turn completion - Once the adjusted countdown expires, the turn ends and segments are finalized

Tip

Consider silence_trigger as your starting reference:

FIXED mode: Waits exactly this specified duration every time
ADAPTIVE mode: Treats this as an initial value, then scales it up or down based on conversational context

How Multipliers Adjust Wait Time

The system analyzes speech patterns and applies scaling factors to the reference silence duration:

Condition	Typical Multiplier	Effect
Very slow speaking pace	3.0x	Extends wait significantly
Moderately slow pace	2.0x	Doubles the wait time
Ends with filler words (um, uh)	2.5x	Allows more time (speaker likely formulating thoughts)
Missing sentence-ending punctuation	1.5x+	Extends wait for completion
Smart Turn: high confidence complete	0.1x	Rapid finalization
Smart Turn: likely incomplete	1.7x	Extended wait

Note

This multiplier system enables ADAPTIVE mode's intelligent behavior - it naturally extends wait times for hesitant speakers or those using filler words, while quickly finalizing clearly completed sentences.

Comparison: Sentence vs Turn Segmentation

Standard Presets (Turn-Based):

Input: "Hello. How are you today?"

Output:
[S1]: Hello. How are you today.
[END OF TURN]

Result: 1 segment per turn

SCRIBE/CAPTIONS Presets (Sentence-Based):

Input: "Hello. How are you today?"

Output:
[S1]: Hello.
[S1]: How are you today.
[END OF TURN]

Result: 2 segments (one per sentence)

Sentence-based segmentation is perfect for:

Structured note-taking
Creating bullet-point lists
Generating captions with line breaks
Separating distinct thoughts

Silence Threshold Impact

Preset	Threshold	Effect	Best For
FAST	0.25s	May split mid-sentence	Fast speakers, captions
FIXED	0.5s	Consistent timing	General conversation
ADAPTIVE	0.7s	Balances speed and accuracy	Voice assistants
SCRIBE	1.0s	Waits for complete thoughts	Dictation, notes

VAD (Voice Activity Detection)

The Voice SDK incorporates Silero VAD for detecting speech presence:

Threshold: Default 0.35 (increasing this value makes detection more strict, potentially missing quieter speech)
Min Duration: 150ms of sustained speech or silence required before confirming a state transition

Tip

In environments with background noise (traffic, air conditioning, etc.), consider adjusting the VAD threshold. Higher thresholds reduce false positives from noise but may occasionally miss softer speech.

When to Use Each Mode

Mode	Best For	How It Works
FIXED	Predictable timing, captions	Consistently waits the exact silence trigger duration
ADAPTIVE	Voice assistants, conversations	Dynamically modifies wait time based on speech patterns (pace, filler words)
ADAPTIVE + Smart Turn	Interviews, complex conversations	ADAPTIVE mode enhanced with ML model to identify semantic turn boundaries
EXTERNAL	Pipecat, LiveKit, custom VAD	Application code controls turn endings via `force_end_of_utterance()`

Note

EXTERNAL mode is intended for integration with frameworks such as Pipecat and LiveKit that implement their own voice activity detection. The SDK handles transcription while your application logic determines turn boundaries.

Next Steps

Pipecat Integration - Build voice agents with Pipecat framework
Real-time Translation - Add multilingual support
Audio Intelligence - Sentiment and topic detection
Turn Detection - Basic RT SDK turn detection

Troubleshooting

PyAudio Installation Issues

Windows:

# If pip install pyaudio fails, try:
pip install pipwin
pipwin install pyaudio

# Or download pre-built wheel from:
# https://www.lfd.uci.edu/~gohlke/pythonlibs/#pyaudio

Mac:

# Install portaudio first
brew install portaudio
pip install pyaudio

Linux (Ubuntu/Debian):

sudo apt-get install portaudio19-dev
pip install pyaudio

"Smart Turn not working"

# Dependencies should be installed from requirements.txt
# If you skipped them, install manually:
pip install certifi>=2025.10.5 onnxruntime>=1.19.0,<2 transformers>=4.57.0,<5

# Or use the Voice SDK bundle
pip install speechmatics-voice[smart]

"Microphone not available" message

Check that PyAudio is installed: pip list | grep PyAudio
Verify microphone permissions in system settings
Test microphone with another application

"Too many short segments with FIXED mode"

Speaker may have slow speech or frequent pauses
Try ADAPTIVE instead of FAST
Or use SCRIBE for longer silence threshold (1.0s)

"Not detecting turn endings"

Ensure you're pausing for the silence threshold duration
Check silence threshold for your selected preset
EXTERNAL mode requires pressing Enter key to trigger turn endings

"Keyboard library permission error" (EXTERNAL mode)

The keyboard library requires elevated permissions on some operating systems:

macOS:

# Run with sudo
sudo python main.py

# Or grant Terminal/IDE accessibility permissions:
# System Preferences > Security & Privacy > Privacy > Accessibility

Linux:

# Run with sudo
sudo python main.py

# Or add user to input group (logout required):
sudo usermod -aG input $USER

Note

The keyboard library is only used for the EXTERNAL preset demo (Enter key detection). Other presets work without elevated permissions.

"Authentication failed" error

Verify API key in .env file
Check your key at portal.speechmatics.com
Ensure no extra spaces in .env file

Resources

Feedback

Help us improve this guide:

Found an issue? Report it
Have suggestions? Open a discussion

Time to Complete: 15 minutes Difficulty: Intermediate API Mode: Voice Agent (Real-time) Languages: Python

Back to Basics | Back to Academy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Voice Agent Turn Detection

What You'll Learn

Prerequisites

Quick Start

Python

How It Works

Available Presets

Code Walkthrough

Expected Output

Startup

Running - ADAPTIVE

Running - SCRIBE (Sentence-Based)

Running - EXTERNAL (Manual Control)

Key Features Demonstrated

Configuration Options

Using Different Presets

Custom Overlay on Presets

Using EXTERNAL Mode (Manual Turn Control)

Enable Smart Turn Detection

Understanding Turn Detection Behavior

How Turn Detection Works Under the Hood

How Multipliers Adjust Wait Time

Comparison: Sentence vs Turn Segmentation

Silence Threshold Impact

VAD (Voice Activity Detection)

When to Use Each Mode

Next Steps

Troubleshooting

Resources

Feedback

FilesExpand file tree

08-voice-agent-turn-detection

Directory actions

More options

Directory actions

More options

Latest commit

History

08-voice-agent-turn-detection

Folders and files

parent directory

README.md

Voice Agent Turn Detection

What You'll Learn

Prerequisites

Quick Start

Python

How It Works

Available Presets

Code Walkthrough

Expected Output

Startup

Running - ADAPTIVE

Running - SCRIBE (Sentence-Based)

Running - EXTERNAL (Manual Control)

Key Features Demonstrated

Configuration Options

Using Different Presets

Custom Overlay on Presets

Using EXTERNAL Mode (Manual Turn Control)

Enable Smart Turn Detection

Understanding Turn Detection Behavior

How Turn Detection Works Under the Hood

How Multipliers Adjust Wait Time

Comparison: Sentence vs Turn Segmentation

Silence Threshold Impact

VAD (Voice Activity Detection)

When to Use Each Mode

Next Steps

Troubleshooting

Resources

Feedback