[metrics]: enable native histograms for Tempo's promauto-registered metrics#6910
Conversation
Add NativeHistogramBucketFactor=1.1 to all promauto-registered histograms that were missing it, and set MetricsNativeHistogramFactor on the dskit server config to enable native histograms for tempo_request_duration_seconds. Histograms now emit both classic (fixed-bucket) and native (exponential) formats simultaneously. Classic format continues to work unchanged for existing scrapers; native format is available to scrapers that request it via the OpenMetrics or protobuf Accept header. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Enables Prometheus native (exponential) histogram emission across Tempo’s promauto-registered histogram metrics, and turns on native histograms for the dskit server request-duration metric.
Changes:
- Adds
NativeHistogramBucketFactor,NativeHistogramMaxBucketNumber, andNativeHistogramMinResetDurationto multiple histogram registrations across modules/packages. - Sets
Server.MetricsNativeHistogramFactor = 1.1ininitServer()to enable native histograms for the server request duration metric.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/drain/metrics.go | Adds native histogram options to the drain tokens_per_line histogram. |
| pkg/dataquality/warnings.go | Adds native histogram options to span “distance in past/future” histograms. |
| modules/livestore/instance.go | Adds native histogram options to completion size + backpressure duration histograms. |
| modules/ingester/flush.go | Adds native histogram options to ingester flush size histogram. |
| modules/frontend/v1/frontend.go | Adds native histogram options to frontend v1 batch histograms. |
| modules/frontend/frontend.go | Adds native histogram options to jobs-per-query histogram. |
| modules/distributor/distributor.go | Adds native histogram options to Kafka request/latency histograms. |
| cmd/tempo/app/modules.go | Enables native histograms for dskit server request duration metric via server config. |
… config Prometheus client only applies DefBuckets as a default when NativeHistogramBucketFactor is not set. For histograms that had no explicit Buckets defined, adding NativeHistogramBucketFactor alone would suppress the classic _bucket series. Make the intent explicit by setting Buckets: prometheus.DefBuckets on the four affected histograms. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| func (t *App) initServer() (services.Service, error) { | ||
| t.cfg.Server.MetricsNamespace = metricsNamespace | ||
| t.cfg.Server.ExcludeRequestInLog = true | ||
| t.cfg.Server.MetricsNativeHistogramFactor = 1.1 | ||
|
|
There was a problem hiding this comment.
The PR description/checklist says CHANGELOG.md was updated, but this PR’s diff doesn’t include a changelog entry. Since enabling native histograms for core metrics is a user-visible operational change (additional native histogram emission for OpenMetrics/protobuf scrapes), could you add a CHANGELOG.md entry in the project’s required format (with PR number + link)?
| Namespace: "tempo_live_store", | ||
| Name: "completion_size_bytes", | ||
| Help: "Size in bytes of blocks completed.", | ||
| Buckets: prometheus.ExponentialBuckets(1024*1024, 2, 10), // from 1MB up to 1GB |
There was a problem hiding this comment.
Is this true, ie from 1MB to 1GB? Or is this more like 500MB?
There was a problem hiding this comment.
IE the first bucket is 1 MiB, so the last one would be 512MiB? Other than that looks good.
There was a problem hiding this comment.
Good catch. This is pre-existing, but I've pushed a commit to update the comment.
| Namespace: "tempo_live_store", | ||
| Name: "completion_size_bytes", | ||
| Help: "Size in bytes of blocks completed.", | ||
| Buckets: prometheus.ExponentialBuckets(1024*1024, 2, 10), // from 1MB up to 1GB |
There was a problem hiding this comment.
IE the first bucket is 1 MiB, so the last one would be 512MiB? Other than that looks good.
ExponentialBuckets(1MB, 2, 10) produces 10 buckets ending at 512MB. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| func (t *App) initServer() (services.Service, error) { | ||
| t.cfg.Server.MetricsNamespace = metricsNamespace | ||
| t.cfg.Server.ExcludeRequestInLog = true | ||
| t.cfg.Server.MetricsNativeHistogramFactor = 1.1 |
There was a problem hiding this comment.
Setting t.cfg.Server.MetricsNativeHistogramFactor = 1.1 enables native histograms for server request/throughput metrics by default, which adds per-request CPU/memory overhead and (because the upstream field is yaml:"-") can’t be tuned/disabled via Tempo config.
Could we consider plumbing a Tempo config/flag for this (even if defaulting to 1.1), so operators can turn it off or adjust the factor if they hit performance or cardinality/memory issues?
| t.cfg.Server.MetricsNativeHistogramFactor = 1.1 | |
| // Default to native histograms for server metrics, but do not overwrite | |
| // an explicit value that may have been provided via flags or code. | |
| if t.cfg.Server.MetricsNativeHistogramFactor <= 0 { | |
| t.cfg.Server.MetricsNativeHistogramFactor = 1.1 | |
| } |
There was a problem hiding this comment.
I don't believe this is necessary.
What this PR does:
Enables native (exponential) histogram emission alongside classic (fixed-bucket) histograms for all of Tempo's promauto-registered metrics.
Two changes:
Adds
NativeHistogramBucketFactor: 1.1,NativeHistogramMaxBucketNumber: 100, andNativeHistogramMinResetDuration: 1hto histogram registrations that were missing these fields acrossmodules/frontend,modules/ingester,modules/livestore,modules/distributor,pkg/dataquality, andpkg/drain.Sets
MetricsNativeHistogramFactor: 1.1on the dskit server config ininitServer(), which enables native histograms fortempo_request_duration_seconds— the primary HTTP/gRPC latency metric used in dashboards and latency SLOs.When both classic buckets and
NativeHistogramBucketFactorare set, the Prometheus client maintains both representations simultaneously. Scrapers using the classic text format continue to receive classic histograms unchanged. Scrapers that request OpenMetrics or protobuf format receive native histograms. This is fully backward-compatible — no existing queries or SLOs are affected by this change alone.Note for future follow-up: if/when the Prometheus scrape config is updated to ingest native histograms (protobuf format), the
_bucket-based expressions in the mixin dashboards andhistogramRulesinrules.libsonnetwill need updating. That should be a separate PR tied to the scrape config change.Which issue(s) this PR fixes:
N/A
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]