TraceQL Metrics: Fix incorrect optimization for some queries by mdisibio · Pull Request #4887 · grafana/tempo

mdisibio · 2025-03-21T19:06:55Z

What this PR does:
A bug was identified with traceql metrics where some queries return incorrect data, see screenshot:

In this example it is returning other values for http.flavor even though it is clearly restricted to a single value in the query.

The bug is that traceql metrics engine has optimizations for performance that expect behavior from the storage layer that wasn't happening. Specifically, the parquet layer isn't (and never has) honored AllConditions=true when there are multiple attributes of the same type being filtered in the generic attribute columns. It requires the engine callback to finalize the resultset, which traceql metrics is trying to optimize away.

This fix excludes these queries from optimization and fixes their output. Because this is not the first bug in this area, see #4409, going ahead and over-excluding to err on the side of correctness, but still able to support many queries, including all of the ones in the benchmark.

Affected queries will definitely be slower, but it might not be too bad. Because it's also processing fewer spans, it could mostly even out. Add a benchmark for an affected query, and it's only 1.6% slower overall.

It's tempting to disable this optimization altogether, but it still has a significant benefit, ~50% faster for simple queries like `{} | rate() by (resource.service.name).

                                                                                                          │ before.txt  │             after.txt             │
                                                                                                          │   sec/op    │   sec/op     vs base              │
BackendBlockQueryRange/{}_|_rate()/5-14                                                                     575.4m ± 9%   569.1m ± 6%       ~ (p=0.818 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-14                                                            1.105 ± 1%    1.116 ± 2%       ~ (p=0.093 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-14                                          835.1m ± 9%   834.2m ± 7%       ~ (p=0.937 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-14                                                   1.336 ± 4%    1.352 ± 1%       ~ (p=0.240 n=6)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-14                                95.80m ± 1%   96.32m ± 1%       ~ (p=0.180 n=6)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-14   209.7m ± 0%   213.1m ± 1%  +1.60% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()/5-14                                                         75.89m ± 0%   76.54m ± 0%  +0.86% (p=0.002 n=6)
geomean                                                                                                     376.9m        379.1m       +0.57%

                                                                                                          │ before.txt │             after.txt              │
                                                                                                          │  MB_IO/op  │  MB_IO/op   vs base                │
BackendBlockQueryRange/{}_|_rate()/5-14                                                                     19.33 ± 0%   19.33 ± 0%       ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-14                                                           24.50 ± 0%   24.50 ± 0%       ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-14                                          19.81 ± 0%   19.81 ± 0%       ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-14                                                  22.05 ± 0%   22.05 ± 0%       ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-14                                19.79 ± 0%   19.79 ± 0%       ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-14   39.58 ± 0%   39.71 ± 0%  +0.33% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()/5-14                                                         20.48 ± 0%   20.48 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                                                                                     22.92        22.93       +0.05%
¹ all samples are equal

                                                                                                          │  before.txt  │              after.txt               │
                                                                                                          │   spans/op   │  spans/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()/5-14                                                                      5.243M ± 0%   5.243M ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-14                                                            5.243M ± 0%   5.243M ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-14                                           5.243M ± 0%   5.243M ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-14                                                   5.243M ± 0%   5.243M ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-14                                 278.3k ± 0%   278.3k ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-14   13.311k ± 0%   4.420k ± 0%  -66.79% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()/5-14                                                          50.24k ± 0%   50.24k ± 0%        ~ (p=1.000 n=6) ¹
geomean                                                                                                      755.6k        645.5k       -14.57%
¹ all samples are equal

                                                                                                          │  before.txt  │             after.txt              │
                                                                                                          │   spans/s    │   spans/s    vs base               │
BackendBlockQueryRange/{}_|_rate()/5-14                                                                     9.113M ±  8%   9.216M ± 7%        ~ (p=0.818 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-14                                                           4.745M ±  1%   4.700M ± 2%        ~ (p=0.093 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-14                                          6.280M ± 10%   6.287M ± 8%        ~ (p=0.937 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-14                                                  3.926M ±  4%   3.877M ± 1%        ~ (p=0.240 n=6)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-14                                2.905M ±  1%   2.889M ± 1%        ~ (p=0.180 n=6)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-14   63.47k ±  0%   20.74k ± 1%  -67.32% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()/5-14                                                         662.1k ±  0%   656.4k ± 0%   -0.85% (p=0.002 n=6)
geomean                                                                                                     2.005M         1.703M       -15.06%

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

mdisibio added 3 commits March 21, 2025 14:44

Fix incorrect optimization for some queries, add benchmark

be53e09

Add tests

379bc17

changelog

6666086

mdisibio changed the title ~~Fix incorrect optimization for some queries, add benchmark~~ TraceQL Metrics: Fix incorrect optimization for some queries Mar 24, 2025

mdisibio marked this pull request as ready for review March 24, 2025 23:37

mdisibio requested review from carles-grafana, electron0zero, ie-pham, javiermolinar, joe-elliott, mapno, stoewer, yvrhdn and zalegrala as code owners March 24, 2025 23:37

stoewer approved these changes Mar 25, 2025

View reviewed changes

mdisibio merged commit e8d5915 into grafana:main Mar 25, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TraceQL Metrics: Fix incorrect optimization for some queries#4887

TraceQL Metrics: Fix incorrect optimization for some queries#4887
mdisibio merged 3 commits intografana:mainfrom
mdisibio:fix-metrics-multiple-generic-attrs

mdisibio commented Mar 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mdisibio commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mdisibio commented Mar 21, 2025 •

edited

Loading