Skip to content

rana-rohit/speech-non-speech-classifier

Repository files navigation

Speech & Non-Speech Audio Recognizer

A modular audio analysis pipeline that separates, transcribes, and classifies audio content using machine learning.

Features

  • Source Separation - Isolate vocals from background using Demucs
  • Speech Recognition - Transcribe speech (99+ languages) using OpenAI Whisper
  • Sound Classification - Identify 521 sound categories using YAMNet
  • Noise Reduction - Spectral gating for cleaner audio

Project Structure

├── main.py              # Pipeline orchestrator
├── config.py            # Configuration settings
├── requirements.txt     # Dependencies
├── src/
│   ├── separator.py     # Audio source separation
│   ├── speech_analyser.py
│   ├── non_speech_analyser.py
│   └── denoise.py       # Optional noise reduction
├── samples/             # Input audio files
└── output/
    ├── separated/       # Separated audio stems
    ├── transcriptions/  # Text output
    └── reports/         # Final analysis

Installation

git clone https://github.com/Harshita20052809/Speech_nonspeech_recognizer.git
cd Speech_nonspeech_recognizer
pip install -r requirements.txt

Usage

  1. Place your audio file in samples/ or update config.py:

    INPUT_AUDIO = os.path.join(BASE_DIR, "samples", "your_file.wav")
  2. Run the pipeline:

    python main.py
  3. Check output/reports/final_report.txt for results.

Pipeline

Input Audio (.wav)
    ↓
[1] Separation (Demucs) → vocals.wav, other.wav
    ↓
[2] Transcription (Whisper) → transcription.txt
    ↓
[3] Classification (YAMNet) → nonspeech_report.txt
    ↓
[4] Report Generation → final_report.txt

Models

Component Model Description
Separation Demucs (htdemucs) Hybrid transformer for source separation
Transcription Whisper (large) Multilingual speech recognition
Classification YAMNet Audio event classification (521 classes)

Configuration

Edit config.py to customize:

INPUT_AUDIO = "samples/1.wav"  # Input file
WHISPER_MODEL = "large"        # tiny, base, small, medium, large
TOP_N_SOUNDS = 10              # Number of sounds to report
NOISE_REDUCTION_STRENGTH = 0.9 # 0.0 to 1.0

Requirements

  • Python 3.8+
  • ~3GB disk space for models (downloaded on first run)
  • CUDA optional for GPU acceleration

License

MIT License - see LICENSE

Credits

Author: Rohit

Built with:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages