This repository demonstrates the use of Kyutai Labs' STT-1B model for real time speech recognition, highlighting its advantages over traditional models like OpenAI's Whisper. The included Jupyter notebook (STT_kyutai.ipynb) showcases installation, transcription, and real time Voice Activity Detection (VAD) simulation on sample audio files.
While developing an proprietary AI Bot, We encountered significant limitations with OpenAI's Whisper model. Despite its strong benchmark accuracy, Whisper's design optimized for complete audio sequences led to practical issues in production:
- Latency Issues: Requires full audio segments for processing, introducing 3-5 second delays that hinder real time applications.
- Memory Consumption: Spikes with longer recordings, making it inefficient for continuous streams.
- Lack of Speech Boundaries: No built in detection for speech start/end, requiring additional tools for segmenting live calls.
- User Feedback: Delays made the system "hard to use for live calls," reducing overall viability.
Kyutai Labs' STT-1B addresses these with a streaming-native architecture based on Delayed Streams Modeling (DSM). This model is built for low-latency, production-scale voice applications, offering:
- True Streaming Inference: Processes audio in real time with configurable delays (as low as 0.5 seconds).
- Semantic Voice Activity Detection (VAD): Integrated into the model for automatic speech boundary detection no extra post processing needed.
- Word-Level Timestamps: Native extraction without forced alignment.
- Efficiency and Scalability: Supports batching for 400+ concurrent streams, with 40% lower peak memory usage compared to Whisper.
- Multilingual Capabilities: Handles English and French with automatic language detection.
- Performance Gains: Reduced response latency significantly, improved user satisfaction, and 400x better scalability for concurrent processing.
This isn't merely an optimization; it's a paradigm shift toward models designed for real time constraints rather than batch processing.
- Response latency: 5 seconds → 0.5 seconds
- Memory efficiency: 40% reduction in peak usage
- User experience scores: Improved
- Concurrent processing: 1x → 400x scalability
Kyutai STT-1B leverages the Moshi audio tokenization framework combined with a Transformer decoder architecture, optimized for streaming:
-
Core Components:
- Moshi Tokenizer: Converts raw audio into discrete tokens, enabling efficient handling of streaming input. This framework supports multimodal audio processing and is key to the model's low-latency design.
- Transformer Decoder with DSM: Unlike Whisper's encoder-decoder setup (which needs full context), STT-1B uses Delayed Streams Modeling to process audio incrementally. It predicts tokens with a configurable delay, balancing accuracy and speed.
- Semantic VAD Integration: The model inherently detects voice activity by analyzing semantic context in the audio stream, eliminating the need for separate VAD models like Silero or WebRTC.
- Output Features: Generates transcriptions with word level timestamps and confidence scores, directly usable in applications.
-
Comparison to Whisper:
- Whisper: Encoder-decoder Transformer, excels in batch processing of pre-recorded audio (50+ languages), requires full sequences for best results; timestamps via post-hoc alignment.
- Kyutai STT: Decoder-only with streaming focus; optimized for live audio (English/French), native timestamps and VAD, ideal for low-latency apps but with a narrower language scope.
When to Choose:
- Whisper: For offline, high accuracy transcription of diverse languages in batch workflows.
- Kyutai STT: For real time voice apps needing streaming, built in VAD, and scalability.
The model is hosted on Hugging Face as kyutai/stt-1b-en_fr and requires the moshi library for inference.
The notebook provides a hands on demo of installing and using Kyutai STT-1B via the Moshi library. It processes sample audio files, performs transcription, and simulates real time VAD. Here's a breakdown:
- Install the
moshilibrary:!pip install moshi. - Dependencies include
numpy,huggingface-hub,bitsandbytes,sentencepiece,librosa,sphn,torch,aiohttp. - Environment: Python 3.12 with CUDA support for GPU acceleration.
- Load the model:
kyutai/stt-1b-en_fr. - Process audio files (e.g.,
sample_fr_hibiki_crepes.mp3):- Transcription output: Full text of the audio.
- Metrics calculated:
- Total words
- Estimated speaking time
- Pauses
- Speech density
- Words per second
- Voice activity
- VAD Processing: Analyzes audio duration, speech/silence ratio, and silence percentage.
- Function:
real_time_vad_simulation(audio_file, chunk_duration=2.0). - Simulates streaming VAD by chunking audio into 2-second segments.
- For
sample_fr_hibiki_crepes.mp3:- 29 chunks analyzed.
- Detects speech/silence based on energy thresholds (e.g., Speech if energy > 0.01).
- Output: Timestamped chunks with classifications (e.g., [1/29] 0.00s - 2.00s: Silence (Energy: 0.0045)).
- Additional simulation on
bria.mp3(44.85s duration):- 23 chunks, all classified as speech with varying energy levels.
This simulation demonstrates how STT-1B's streaming capabilities enable real time processing, mimicking live call scenarios.
- Clone the repo:
git clone https://github.com/imprasukjain/Kyutai-STT. - Install dependencies directly in the notebook.
- Run the notebook: Use Google Colab (with GPU) or a local Jupyter environment.
- Test with your audio: Replace sample files and run the transcription/VAD functions.
- Python 3.12+
- CUDA-enabled GPU for optimal performance
- Libraries: As listed in the notebook (no additional pip installs beyond
moshi).
- Currently supports English/French; expand to more languages as Kyutai evolves.
- VAD simulation is energy based; integrate full semantic VAD for production.
- Explore integration with web sockets for live streaming apps.
From a model evaluation standpoint, accuracy metrics tell only part of the story. For production voice applications, system latency, memory efficiency, and scalability often matter more than marginal accuracy improvements.
Kyutai STT represents an important evolution in speech recognition optimizing for the constraints of real time applications rather than just benchmark performance.
The key insight: Sometimes the "best" model isn't the one with the highest accuracy, but the one that best fits your deployment constraints.
- Kyutai Labs: Official Model
- Moshi Framework: GitHub
- Whisper: OpenAI
Contributions welcome! If you have experiences with real time STT, open an issue or PR.