Kyutai STT-1B: Real Time Speech-to-Text Demo and Analysis

Why We Discarded Whisper and Explored Kyutai STT-1B: A Perspective on Real Time Speech Recognition

This repository demonstrates the use of Kyutai Labs' STT-1B model for real time speech recognition, highlighting its advantages over traditional models like OpenAI's Whisper. The included Jupyter notebook (STT_kyutai.ipynb) showcases installation, transcription, and real time Voice Activity Detection (VAD) simulation on sample audio files.

Why Kyutai STT-1B? A Shift to Streaming-First Speech Recognition

While developing an proprietary AI Bot, We encountered significant limitations with OpenAI's Whisper model. Despite its strong benchmark accuracy, Whisper's design optimized for complete audio sequences led to practical issues in production:

Latency Issues: Requires full audio segments for processing, introducing 3-5 second delays that hinder real time applications.
Memory Consumption: Spikes with longer recordings, making it inefficient for continuous streams.
Lack of Speech Boundaries: No built in detection for speech start/end, requiring additional tools for segmenting live calls.
User Feedback: Delays made the system "hard to use for live calls," reducing overall viability.

Kyutai Labs' STT-1B addresses these with a streaming-native architecture based on Delayed Streams Modeling (DSM). This model is built for low-latency, production-scale voice applications, offering:

True Streaming Inference: Processes audio in real time with configurable delays (as low as 0.5 seconds).
Semantic Voice Activity Detection (VAD): Integrated into the model for automatic speech boundary detection no extra post processing needed.
Word-Level Timestamps: Native extraction without forced alignment.
Efficiency and Scalability: Supports batching for 400+ concurrent streams, with 40% lower peak memory usage compared to Whisper.
Multilingual Capabilities: Handles English and French with automatic language detection.
Performance Gains: Reduced response latency significantly, improved user satisfaction, and 400x better scalability for concurrent processing.

This isn't merely an optimization; it's a paradigm shift toward models designed for real time constraints rather than batch processing.

Performance Impact

Response latency: 5 seconds → 0.5 seconds
Memory efficiency: 40% reduction in peak usage
User experience scores: Improved
Concurrent processing: 1x → 400x scalability

Model Architecture: Kyutai STT-1B Explained

Kyutai STT-1B leverages the Moshi audio tokenization framework combined with a Transformer decoder architecture, optimized for streaming:

Core Components:
- Moshi Tokenizer: Converts raw audio into discrete tokens, enabling efficient handling of streaming input. This framework supports multimodal audio processing and is key to the model's low-latency design.
- Transformer Decoder with DSM: Unlike Whisper's encoder-decoder setup (which needs full context), STT-1B uses Delayed Streams Modeling to process audio incrementally. It predicts tokens with a configurable delay, balancing accuracy and speed.
- Semantic VAD Integration: The model inherently detects voice activity by analyzing semantic context in the audio stream, eliminating the need for separate VAD models like Silero or WebRTC.
- Output Features: Generates transcriptions with word level timestamps and confidence scores, directly usable in applications.
Comparison to Whisper:
- Whisper: Encoder-decoder Transformer, excels in batch processing of pre-recorded audio (50+ languages), requires full sequences for best results; timestamps via post-hoc alignment.
- Kyutai STT: Decoder-only with streaming focus; optimized for live audio (English/French), native timestamps and VAD, ideal for low-latency apps but with a narrower language scope.

When to Choose:

Whisper: For offline, high accuracy transcription of diverse languages in batch workflows.
Kyutai STT: For real time voice apps needing streaming, built in VAD, and scalability.

The model is hosted on Hugging Face as kyutai/stt-1b-en_fr and requires the moshi library for inference.

What I Did in the Notebook (`STT_kyutai.ipynb`)

The notebook provides a hands on demo of installing and using Kyutai STT-1B via the Moshi library. It processes sample audio files, performs transcription, and simulates real time VAD. Here's a breakdown:

1. Installation and Setup

Install the moshi library: !pip install moshi.
Dependencies include numpy,huggingface-hub, bitsandbytes, sentencepiece, librosa, sphn, torch, aiohttp.
Environment: Python 3.12 with CUDA support for GPU acceleration.

2. Model Loading and Transcription

Load the model: kyutai/stt-1b-en_fr.
Process audio files (e.g., sample_fr_hibiki_crepes.mp3):
- Transcription output: Full text of the audio.
- Metrics calculated:
  - Total words
  - Estimated speaking time
  - Pauses
  - Speech density
  - Words per second
  - Voice activity
VAD Processing: Analyzes audio duration, speech/silence ratio, and silence percentage.

3. Real Time VAD Simulation

Function: real_time_vad_simulation(audio_file, chunk_duration=2.0).
Simulates streaming VAD by chunking audio into 2-second segments.
For sample_fr_hibiki_crepes.mp3:
- 29 chunks analyzed.
- Detects speech/silence based on energy thresholds (e.g., Speech if energy > 0.01).
- Output: Timestamped chunks with classifications (e.g., [1/29] 0.00s - 2.00s: Silence (Energy: 0.0045)).
Additional simulation on bria.mp3 (44.85s duration):
- 23 chunks, all classified as speech with varying energy levels.

This simulation demonstrates how STT-1B's streaming capabilities enable real time processing, mimicking live call scenarios.

Getting Started

Clone the repo: git clone https://github.com/imprasukjain/Kyutai-STT.
Install dependencies directly in the notebook.
Run the notebook: Use Google Colab (with GPU) or a local Jupyter environment.
Test with your audio: Replace sample files and run the transcription/VAD functions.

Requirements

Python 3.12+
CUDA-enabled GPU for optimal performance
Libraries: As listed in the notebook (no additional pip installs beyond moshi).

Limitations and Future Work

Currently supports English/French; expand to more languages as Kyutai evolves.
VAD simulation is energy based; integrate full semantic VAD for production.
Explore integration with web sockets for live streaming apps.

Our Perspective

From a model evaluation standpoint, accuracy metrics tell only part of the story. For production voice applications, system latency, memory efficiency, and scalability often matter more than marginal accuracy improvements.

Kyutai STT represents an important evolution in speech recognition optimizing for the constraints of real time applications rather than just benchmark performance.

The key insight: Sometimes the "best" model isn't the one with the highest accuracy, but the one that best fits your deployment constraints.

References

Kyutai Labs: Official Model
Moshi Framework: GitHub
Whisper: OpenAI

Contributions welcome! If you have experiences with real time STT, open an issue or PR.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
STT_kyutai.ipynb		STT_kyutai.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kyutai STT-1B: Real Time Speech-to-Text Demo and Analysis

Why We Discarded Whisper and Explored Kyutai STT-1B: A Perspective on Real Time Speech Recognition

Why Kyutai STT-1B? A Shift to Streaming-First Speech Recognition

Performance Impact

Model Architecture: Kyutai STT-1B Explained

What I Did in the Notebook (`STT_kyutai.ipynb`)

1. Installation and Setup

2. Model Loading and Transcription

3. Real Time VAD Simulation

Getting Started

Requirements

Limitations and Future Work

Our Perspective

References

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kyutai STT-1B: Real Time Speech-to-Text Demo and Analysis

Why We Discarded Whisper and Explored Kyutai STT-1B: A Perspective on Real Time Speech Recognition

Why Kyutai STT-1B? A Shift to Streaming-First Speech Recognition

Performance Impact

Model Architecture: Kyutai STT-1B Explained

What I Did in the Notebook (STT_kyutai.ipynb)

1. Installation and Setup

2. Model Loading and Transcription

3. Real Time VAD Simulation

Getting Started

Requirements

Limitations and Future Work

Our Perspective

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

What I Did in the Notebook (`STT_kyutai.ipynb`)

Packages