Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ To activate the `local-blocks` processor for all users, add it to the list of pr
```yaml
# Global overrides configuration.
overrides:
metrics_generator_processors: ['local-blocks']
metrics_generator_processors: ["local-blocks"]
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated
```

To configure the processor per tenant, use the `metrics_generator_processor` override.
Expand All @@ -49,7 +49,7 @@ Example for per-tenant in the per-tenant overrides:

```yaml
overrides:
'tenantID':
"tenantID":
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated
metrics_generator_processors:
- local-blocks
```
Expand Down Expand Up @@ -135,3 +135,31 @@ query_frontend:
concurrent_jobs: 8
target_bytes_per_job: 1.25e+09 # ~1.25GB
```

## Sampling and performance optimization

TraceQL metrics queries support sampling hints to improve performance on large datasets.

### Sampling configuration considerations

When using sampling in your TraceQL metrics queries, consider:

- **Timeout settings:** Sampled queries run faster but may still benefit from adequate timeouts
- **Concurrent jobs:** Sampling reduces per-job processing time, allowing higher concurrency
- **Job sizing:** With sampling, smaller job sizes may be more efficient
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

Example configuration optimized for sampling:

```yaml
query_frontend:
metrics:
concurrent_jobs: 1500 # Higher concurrency with sampling
target_bytes_per_job: 1.5e+08 # Smaller jobs with sampling
```
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

### Sampling best practices

- Use `sample=true` for dashboard queries requiring fast refresh
- Apply fixed sampling rates for consistent approximation levels
- Avoid sampling for alerts or precise measurements
- Test sampling accuracy against your specific data patterns
Original file line number Diff line number Diff line change
Expand Up @@ -335,3 +335,54 @@ This example means the attribute `resource.cluster` had too many values.
```
{ __meta_error="__too_many_values__", resource.cluster=<nil> }
```

## Query hints and sampling
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

TraceQL metrics queries support query hints using the `with()` syntax to optimize performance and control sampling behavior.

TraceQL metrics queries support sampling hints to improve performance by processing a subset of data.

Sampling is particularly effective for:

- Aggregation queries over large datasets
- Dashboard queries requiring fast refresh
- Exploratory data analysis

{{< admonition type="note" >}}
Sampling hints only work with TraceQL metrics queries (those using functions like `rate()`, `count_over_time()`, etc.).
{{< /admonition >}}

### Adaptive sampling: `with(sample=true)`

Automatically determines optimal sampling strategy based on query selectivity and data volume.

```
{ resource.service.name="frontend" } | rate() with(sample=true)
```

- **Use case:** Heavy queries with large result sets
- **Performance:** 2-4x improvement on queries like `{ } | rate()`
- **Accuracy:** Maintains high accuracy by adapting sampling rate

#### Fixed span sampling: `with(span_sample=0.xx)`

Samples a fixed percentage of spans for span-level aggregations.

```
{ status=error } | count_over_time() with(span_sample=0.1)
```

#### Fixed trace sampling: `with(trace_sample=0.xx)`

Samples a fixed percentage of traces for trace-level aggregations.

```
{ } | count() by (resource.service.name) with(trace_sample=0.05)
```

### When to use sampling

- **Heavy aggregation queries** with large datasets
- **Exploratory analysis** where approximate results are acceptable
- **Dashboard queries** that need faster refresh times
- **Avoid sampling** for precise metrics or rare event detection
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
---
title: TraceQL metrics sampling guide
menuTitle: Sampling guide
description: Optimize TraceQL metrics query performance using sampling hints
weight: 500
keywords:
- TraceQL metrics
- sampling
- performance optimization
- query optimization
---

# TraceQL metrics sampling guide

{{< docs/shared source="tempo" lookup="traceql-metrics-admonition.md" version="<TEMPO_VERSION>" >}}

TraceQL metrics sampling is a performance optimization feature that enables faster query execution by processing a subset of trace data while maintaining acceptable accuracy. Sampling delivers 2-4x performance improvements for heavy aggregation queries.
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

## Overview

TraceQL metrics sampling addresses the challenge of balancing query performance with data accuracy when working with large-scale trace datasets. Sampling intelligently selects a representative subset of data for processing, making it particularly valuable for:
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

- Real-time dashboards requiring fast refresh rates
- Exploratory data analysis where approximate results accelerate insights
- Resource-constrained environments with limited compute capacity
- Large-scale deployments processing terabytes of trace data daily

## Prerequisites

TraceQL metrics sampling requires:
Comment thread
knylander-grafana marked this conversation as resolved.
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

- Tempo 2.8+ with TraceQL metrics enabled
- `local-blocks` processor configured in metrics-generator
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated
- Grafana 10.4+ or Grafana Cloud for UI integration

## Choose a sampling method

Comment thread
knylander-grafana marked this conversation as resolved.
### Adaptive sampling: `with(sample=true)`

Adaptive sampling automatically determines the optimal sampling strategy based on query characteristics. It switches between span-level and trace-level sampling as needed and adjusts sampling rates dynamically.

```traceql
{ resource.service.name="checkout-service" } | rate() with(sample=true)
{ status=error } | count_over_time() by (resource.service.name) with(sample=true)
```
Comment thread
knylander-grafana marked this conversation as resolved.

**Best for:** Heavy aggregation queries, dashboard queries, and multi-service analysis with unpredictable data volumes.
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

**Limitations:** May over-sample rare events and results vary across blocks as new data arrives.
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

### Fixed span sampling: `with(span_sample=0.xx)`

Fixed span sampling selects a specified percentage of spans using consistent hashing of span IDs. Provides predictable performance improvements and deterministic results.
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

```traceql
{ status=error } | rate() by (resource.service.name) with(span_sample=0.1)
```

**Best for:** Consistent approximation, large-scale monitoring, and cost optimization scenarios.
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

**Limitations:** May miss important events during low-volume periods and not optimal for naturally selective queries.

### Fixed trace sampling: `with(trace_sample=0.xx)`

Fixed trace sampling selects complete traces for analysis, preserving trace context and relationships between spans within the same request flow.

```traceql
{ } | count() by (resource.service.name) with(trace_sample=0.1)
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated
```

**Best for:** Trace-level aggregations, service dependency mapping, and error correlation analysis.
Comment thread
knylander-grafana marked this conversation as resolved.

**Limitations:** May provide poor accuracy for span-level metrics and can introduce bias if trace volumes vary significantly across services.
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

## Implement sampling
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

### Get started

1. **Verify prerequisites:** Check Tempo version and ensure local-blocks processor is enabled
2. **Start with adaptive sampling:** Apply `with(sample=true)` to non-critical queries first
3. **Measure performance:** Compare execution times before and after sampling
4. **Validate accuracy:** Test sampled results against exact results for critical queries

### Grafana integration
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

Use sampling in dashboard panels:

```json
{
"expr": "{ resource.service.name=\"frontend\" } | rate() with(sample=true)"
}
```

For alerts, avoid sampling for critical alerts that trigger operational responses. Adaptive sampling is acceptable for warning alerts and trend monitoring.

### Configuration optimization

Increase query concurrency since sampling reduces per-job processing:

```yaml
query_frontend:
metrics:
concurrent_jobs: 1500
target_bytes_per_job: 1.5e+08
```
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated

## Best practices

### Query design

- **Use broad queries:** Sampling works best with queries that match many spans
- **Align sampling with aggregation scope:** Use span sampling for span-level aggregations, trace sampling for trace-level aggregations
- **Consider temporal patterns:** Adjust sampling rates based on data age and query frequency

### Select sampling rates by use case

- **Real-time monitoring (0-1h):** Adaptive sampling or 10%+ fixed rates
- **Recent analysis (1h-1d):** 5-10% sampling
- **Historical trends (1d+):** 1-5% sampling
- **Long-term analysis (30d+):** 0.1-1% sampling

### Decision framework

1. **Critical measurement needed?** → No sampling
2. **Dashboard or trend analysis?** → Adaptive sampling
3. **Historical analysis or capacity planning?** → Fixed sampling (1-5%)
4. **Cost optimization or exploration?** → Low fixed sampling (0.1-1%)

### Migration approach

1. Test all sampling configurations in development first
2. Migrate dashboard queries before alerting queries
3. Document sampling rationale and accuracy requirements
4. Configure monitoring for sampling effectiveness
5. Plan rollback procedures for accuracy issues

By following these practices, you can successfully integrate TraceQL metrics sampling into your observability workflows, achieving significant performance improvements while maintaining data quality for effective monitoring and analysis.
Comment thread
knylander-grafana marked this conversation as resolved.
Outdated