Skip to content

fix: remove api_key_alias label from high-cardinality prometheus metrics #12099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

colesmcintosh
Copy link
Collaborator

Title

Remove api_key_alias label from high-cardinality prometheus metrics

Relevant issues

Fixes LIT-48

Pre-Submission checklist

Please complete all items before asking a LiteLLM maintainer to review your PR

  • I have Added testing in the tests/litellm/ directory, Adding at least 1 test is a hard requirement - see details
  • I have added a screenshot of my new test passing locally
  • My PR passes all unit tests on make test-unit
  • My PR's scope is as isolated as possible, it only solves 1 specific problem

Type

🐛 Bug Fix

Changes

  • Removed api_key_alias label from 4 high-cardinality Prometheus metrics to prevent Datadog throttling:
    • litellm_deployment_failure_responses
    • litellm_deployment_total_requests
    • litellm_proxy_total_requests
    • litellm_proxy_failed_requests
  • Updated corresponding unit tests to reflect the label removal
  • This improves reliability of system health dashboards by reducing metric cardinality

Test Results

  • All prometheus integration tests pass: poetry run pytest tests/test_litellm/integrations/test_prometheus.py -v (14 passed)
  • Updated unit tests pass: Fixed 4 out of 5 failing tests related to the label removal

Remove api_key_alias label from the following metrics to reduce cardinality
and prevent Datadog throttling:
- litellm_deployment_failure_responses
- litellm_deployment_total_requests
- litellm_proxy_total_requests
- litellm_proxy_failed_requests

This change helps create more reliable system health dashboards by reducing
metric cardinality caused by unique API key aliases.
Copy link

vercel bot commented Jun 27, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
litellm ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 27, 2025 4:13pm

@colesmcintosh colesmcintosh marked this pull request as ready for review June 27, 2025 11:44
@krrishdholakia
Copy link
Contributor

Why do this?

@colesmcintosh
Copy link
Collaborator Author

@krrishdholakia This fixes LIT-48. The api_key_alias label is causing high cardinality issues with these 4 metrics:

  1. litellm_proxy_total_requests_metric - Tracks every client request
  2. litellm_proxy_failed_requests_metric - Tracks every failed request
  3. litellm_deployment_failure_responses - Tracks every LLM API failure
  4. litellm_deployment_total_requests - Tracks every LLM API call

Since these metrics fire on every single request, and each unique api_key_alias value creates a new time series, the cardinality explodes quickly:

  • 100 unique API key aliases × 4 metrics = 400 base time series
  • With other label combinations (model, team, etc.), this multiplies further
  • Result: Potentially millions of unique time series

This causes:

  • Datadog throttling - Metrics get dropped when cardinality limits are exceeded
  • Incomplete dashboards - System health monitoring becomes unreliable
  • Increased costs - More time series = higher monitoring costs

The api_key_alias is still available on lower-volume metrics like spend tracking and budget metrics, so you can still track per-key usage where it matters most.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants