-
Notifications
You must be signed in to change notification settings - Fork 692
[DOC] Create TraceQL metrics sampling docs #5595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
knylander-grafana
merged 18 commits into
main
from
5469-traceql-sampling-doc-ai-experiment
Oct 3, 2025
Merged
Changes from 3 commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
a48d96e
AI experiment: Create TraceQL metrics sampling docs
knylander-grafana e59a5be
Take 2 on sampling guide using updated prompt
knylander-grafana 0c4ceb7
Updates to sampling guide using better prompts
knylander-grafana c69ec00
Apply suggestions from code review
knylander-grafana c375dbf
Apply suggestions from code review
knylander-grafana 5cf58c4
Update docs/sources/tempo/metrics-from-traces/metrics-queries/configu…
knylander-grafana f0f738c
Update docs/sources/tempo/metrics-from-traces/metrics-queries/configu…
knylander-grafana 3a1a179
Apply suggestions from code review
knylander-grafana b6601c0
Apply suggestions from code review
knylander-grafana 5ae101f
Update content based on PR description
knylander-grafana a59c3f1
Remove the word guide
knylander-grafana 9f1a10d
Add link to community call video
knylander-grafana 86305a3
Fix link to video:
knylander-grafana 931488a
Update the Config doc to remove unnecessary info
knylander-grafana 453a70b
Update the Functions doc
knylander-grafana 9546e1a
Update docs/sources/tempo/metrics-from-traces/metrics-queries/functio…
knylander-grafana 1be3b64
Apply suggestions from code review
knylander-grafana 2f56936
Apply suggestions from code review
knylander-grafana File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
137 changes: 137 additions & 0 deletions
137
docs/sources/tempo/metrics-from-traces/metrics-queries/sampling-guide.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,137 @@ | ||
| --- | ||
| title: TraceQL metrics sampling guide | ||
| menuTitle: Sampling guide | ||
| description: Optimize TraceQL metrics query performance using sampling hints | ||
| weight: 500 | ||
| keywords: | ||
| - TraceQL metrics | ||
| - sampling | ||
| - performance optimization | ||
| - query optimization | ||
| --- | ||
|
|
||
| # TraceQL metrics sampling guide | ||
|
|
||
| {{< docs/shared source="tempo" lookup="traceql-metrics-admonition.md" version="<TEMPO_VERSION>" >}} | ||
|
|
||
| TraceQL metrics sampling is a performance optimization feature that enables faster query execution by processing a subset of trace data while maintaining acceptable accuracy. Sampling delivers 2-4x performance improvements for heavy aggregation queries. | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
|
|
||
| ## Overview | ||
|
|
||
| TraceQL metrics sampling addresses the challenge of balancing query performance with data accuracy when working with large-scale trace datasets. Sampling intelligently selects a representative subset of data for processing, making it particularly valuable for: | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
|
|
||
| - Real-time dashboards requiring fast refresh rates | ||
| - Exploratory data analysis where approximate results accelerate insights | ||
| - Resource-constrained environments with limited compute capacity | ||
| - Large-scale deployments processing terabytes of trace data daily | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| TraceQL metrics sampling requires: | ||
|
knylander-grafana marked this conversation as resolved.
knylander-grafana marked this conversation as resolved.
Outdated
|
||
|
|
||
| - Tempo 2.8+ with TraceQL metrics enabled | ||
| - `local-blocks` processor configured in metrics-generator | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
| - Grafana 10.4+ or Grafana Cloud for UI integration | ||
|
|
||
| ## Choose a sampling method | ||
|
|
||
|
knylander-grafana marked this conversation as resolved.
|
||
| ### Adaptive sampling: `with(sample=true)` | ||
|
|
||
| Adaptive sampling automatically determines the optimal sampling strategy based on query characteristics. It switches between span-level and trace-level sampling as needed and adjusts sampling rates dynamically. | ||
|
|
||
| ```traceql | ||
| { resource.service.name="checkout-service" } | rate() with(sample=true) | ||
| { status=error } | count_over_time() by (resource.service.name) with(sample=true) | ||
| ``` | ||
|
knylander-grafana marked this conversation as resolved.
|
||
|
|
||
| **Best for:** Heavy aggregation queries, dashboard queries, and multi-service analysis with unpredictable data volumes. | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
|
|
||
| **Limitations:** May over-sample rare events and results vary across blocks as new data arrives. | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
|
|
||
| ### Fixed span sampling: `with(span_sample=0.xx)` | ||
|
|
||
| Fixed span sampling selects a specified percentage of spans using consistent hashing of span IDs. Provides predictable performance improvements and deterministic results. | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
|
|
||
| ```traceql | ||
| { status=error } | rate() by (resource.service.name) with(span_sample=0.1) | ||
| ``` | ||
|
|
||
| **Best for:** Consistent approximation, large-scale monitoring, and cost optimization scenarios. | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
|
|
||
| **Limitations:** May miss important events during low-volume periods and not optimal for naturally selective queries. | ||
|
|
||
| ### Fixed trace sampling: `with(trace_sample=0.xx)` | ||
|
|
||
| Fixed trace sampling selects complete traces for analysis, preserving trace context and relationships between spans within the same request flow. | ||
|
|
||
| ```traceql | ||
| { } | count() by (resource.service.name) with(trace_sample=0.1) | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
| ``` | ||
|
|
||
| **Best for:** Trace-level aggregations, service dependency mapping, and error correlation analysis. | ||
|
knylander-grafana marked this conversation as resolved.
|
||
|
|
||
| **Limitations:** May provide poor accuracy for span-level metrics and can introduce bias if trace volumes vary significantly across services. | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
|
|
||
| ## Implement sampling | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
|
|
||
| ### Get started | ||
|
|
||
| 1. **Verify prerequisites:** Check Tempo version and ensure local-blocks processor is enabled | ||
| 2. **Start with adaptive sampling:** Apply `with(sample=true)` to non-critical queries first | ||
| 3. **Measure performance:** Compare execution times before and after sampling | ||
| 4. **Validate accuracy:** Test sampled results against exact results for critical queries | ||
|
|
||
| ### Grafana integration | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
|
|
||
| Use sampling in dashboard panels: | ||
|
|
||
| ```json | ||
| { | ||
| "expr": "{ resource.service.name=\"frontend\" } | rate() with(sample=true)" | ||
| } | ||
| ``` | ||
|
|
||
| For alerts, avoid sampling for critical alerts that trigger operational responses. Adaptive sampling is acceptable for warning alerts and trend monitoring. | ||
|
|
||
| ### Configuration optimization | ||
|
|
||
| Increase query concurrency since sampling reduces per-job processing: | ||
|
|
||
| ```yaml | ||
| query_frontend: | ||
| metrics: | ||
| concurrent_jobs: 1500 | ||
| target_bytes_per_job: 1.5e+08 | ||
| ``` | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
|
|
||
| ## Best practices | ||
|
|
||
| ### Query design | ||
|
|
||
| - **Use broad queries:** Sampling works best with queries that match many spans | ||
| - **Align sampling with aggregation scope:** Use span sampling for span-level aggregations, trace sampling for trace-level aggregations | ||
| - **Consider temporal patterns:** Adjust sampling rates based on data age and query frequency | ||
|
|
||
| ### Select sampling rates by use case | ||
|
|
||
| - **Real-time monitoring (0-1h):** Adaptive sampling or 10%+ fixed rates | ||
| - **Recent analysis (1h-1d):** 5-10% sampling | ||
| - **Historical trends (1d+):** 1-5% sampling | ||
| - **Long-term analysis (30d+):** 0.1-1% sampling | ||
|
|
||
| ### Decision framework | ||
|
|
||
| 1. **Critical measurement needed?** → No sampling | ||
| 2. **Dashboard or trend analysis?** → Adaptive sampling | ||
| 3. **Historical analysis or capacity planning?** → Fixed sampling (1-5%) | ||
| 4. **Cost optimization or exploration?** → Low fixed sampling (0.1-1%) | ||
|
|
||
| ### Migration approach | ||
|
|
||
| 1. Test all sampling configurations in development first | ||
| 2. Migrate dashboard queries before alerting queries | ||
| 3. Document sampling rationale and accuracy requirements | ||
| 4. Configure monitoring for sampling effectiveness | ||
| 5. Plan rollback procedures for accuracy issues | ||
|
|
||
| By following these practices, you can successfully integrate TraceQL metrics sampling into your observability workflows, achieving significant performance improvements while maintaining data quality for effective monitoring and analysis. | ||
|
knylander-grafana marked this conversation as resolved.
Outdated
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.