Skip to content

Memory leak: per-request voice tensor reload causes unbounded RSS growth #453

@christauff

Description

@christauff

Summary

The generate() method in kokoro_v1.py performs a full voice tensor round-trip on every TTS request: load from .pt file → deserialize to torch.Tensor → serialize back → write to new temp file. This creates ~20-30MB of transient allocations per request that fragment the Python heap, causing RSS to grow monotonically and never shrink.

Reproduction

  1. Run the CPU container: docker run -d -p 8880:8880 ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.0post4
  2. Monitor RSS: docker stats kokoro-tts --no-stream
  3. Send ~50-100 TTS requests over a few hours
  4. Observe RSS climbing from ~500MB baseline toward multi-GB without returning

In our case, RSS reached 7.6GB after 2.5 days of moderate use (~50-100 requests/day), triggering the Linux OOM killer on the host.

Root Cause

In api/src/inference/kokoro_v1.py, both generate() and generate_from_tokens():

  1. Call paths.load_voice_tensor(voice_path, device) — reads entire .pt file into BytesIO, deserializes
  2. Call paths.save_voice_tensor(voice_tensor, temp_path) — serializes back, writes to NEW temp file
  3. This happens every request, even when the same voice is used repeatedly

The temp files (temp_voice_*) are written to Python's tempfile.gettempdir() (system /tmp), NOT the app's configured temp_file_dir, so the app's cleanup_temp_files() never finds or cleans them.

Additionally, AudioService._writers in api/src/services/audio.py is a class-level dict that accumulates StreamingAudioWriter objects on client disconnect or error (the writer key is never removed if is_last_chunk is never reached).

Suggested Fixes

  1. Cache the voice tensor and temp file path in KokoroV1 — skip the load/save cycle when the same voice is used again
  2. Use settings.temp_file_dir for all temp files so the cleanup routine can find them
  3. Add a finally block in AudioService.convert_audio() to remove the writer key on exception

Environment

  • Image: ghcr.io/remsky/kokoro-fastapi-cpu:v0.2.0post4
  • Host: 31GB RAM, Linux 6.17.0
  • Usage pattern: ~50-100 TTS requests/day via local API calls

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions