Skip to content

TraceQL Metrics: Fix incorrect optimization for some queries#4887

Merged
mdisibio merged 3 commits intografana:mainfrom
mdisibio:fix-metrics-multiple-generic-attrs
Mar 25, 2025
Merged

TraceQL Metrics: Fix incorrect optimization for some queries#4887
mdisibio merged 3 commits intografana:mainfrom
mdisibio:fix-metrics-multiple-generic-attrs

Conversation

@mdisibio
Copy link
Copy Markdown
Contributor

@mdisibio mdisibio commented Mar 21, 2025

What this PR does:
A bug was identified with traceql metrics where some queries return incorrect data, see screenshot:

image

In this example it is returning other values for http.flavor even though it is clearly restricted to a single value in the query.

The bug is that traceql metrics engine has optimizations for performance that expect behavior from the storage layer that wasn't happening. Specifically, the parquet layer isn't (and never has) honored AllConditions=true when there are multiple attributes of the same type being filtered in the generic attribute columns. It requires the engine callback to finalize the resultset, which traceql metrics is trying to optimize away.

This fix excludes these queries from optimization and fixes their output. Because this is not the first bug in this area, see #4409, going ahead and over-excluding to err on the side of correctness, but still able to support many queries, including all of the ones in the benchmark.

Affected queries will definitely be slower, but it might not be too bad. Because it's also processing fewer spans, it could mostly even out. Add a benchmark for an affected query, and it's only 1.6% slower overall.

It's tempting to disable this optimization altogether, but it still has a significant benefit, ~50% faster for simple queries like `{} | rate() by (resource.service.name).

                                                                                                          │ before.txt  │             after.txt             │
                                                                                                          │   sec/op    │   sec/op     vs base              │
BackendBlockQueryRange/{}_|_rate()/5-14                                                                     575.4m ± 9%   569.1m ± 6%       ~ (p=0.818 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-14                                                            1.105 ± 1%    1.116 ± 2%       ~ (p=0.093 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-14                                          835.1m ± 9%   834.2m ± 7%       ~ (p=0.937 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-14                                                   1.336 ± 4%    1.352 ± 1%       ~ (p=0.240 n=6)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-14                                95.80m ± 1%   96.32m ± 1%       ~ (p=0.180 n=6)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-14   209.7m ± 0%   213.1m ± 1%  +1.60% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()/5-14                                                         75.89m ± 0%   76.54m ± 0%  +0.86% (p=0.002 n=6)
geomean                                                                                                     376.9m        379.1m       +0.57%

                                                                                                          │ before.txt │             after.txt              │
                                                                                                          │  MB_IO/op  │  MB_IO/op   vs base                │
BackendBlockQueryRange/{}_|_rate()/5-14                                                                     19.33 ± 0%   19.33 ± 0%       ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-14                                                           24.50 ± 0%   24.50 ± 0%       ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-14                                          19.81 ± 0%   19.81 ± 0%       ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-14                                                  22.05 ± 0%   22.05 ± 0%       ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-14                                19.79 ± 0%   19.79 ± 0%       ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-14   39.58 ± 0%   39.71 ± 0%  +0.33% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()/5-14                                                         20.48 ± 0%   20.48 ± 0%       ~ (p=1.000 n=6) ¹
geomean                                                                                                     22.92        22.93       +0.05%
¹ all samples are equal

                                                                                                          │  before.txt  │              after.txt               │
                                                                                                          │   spans/op   │  spans/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()/5-14                                                                      5.243M ± 0%   5.243M ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-14                                                            5.243M ± 0%   5.243M ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-14                                           5.243M ± 0%   5.243M ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-14                                                   5.243M ± 0%   5.243M ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-14                                 278.3k ± 0%   278.3k ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-14   13.311k ± 0%   4.420k ± 0%  -66.79% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()/5-14                                                          50.24k ± 0%   50.24k ± 0%        ~ (p=1.000 n=6) ¹
geomean                                                                                                      755.6k        645.5k       -14.57%
¹ all samples are equal

                                                                                                          │  before.txt  │             after.txt              │
                                                                                                          │   spans/s    │   spans/s    vs base               │
BackendBlockQueryRange/{}_|_rate()/5-14                                                                     9.113M ±  8%   9.216M ± 7%        ~ (p=0.818 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(name)/5-14                                                           4.745M ±  1%   4.700M ± 2%        ~ (p=0.093 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)/5-14                                          6.280M ± 10%   6.287M ± 8%        ~ (p=0.937 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)/5-14                                                  3.926M ±  4%   3.877M ± 1%        ~ (p=0.240 n=6)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()/5-14                                2.905M ±  1%   2.889M ± 1%        ~ (p=0.180 n=6)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)/5-14   63.47k ±  0%   20.74k ± 1%  -67.32% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()/5-14                                                         662.1k ±  0%   656.4k ± 0%   -0.85% (p=0.002 n=6)
geomean                                                                                                     2.005M         1.703M       -15.06%

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@mdisibio mdisibio changed the title Fix incorrect optimization for some queries, add benchmark TraceQL Metrics: Fix incorrect optimization for some queries Mar 24, 2025
@mdisibio mdisibio marked this pull request as ready for review March 24, 2025 23:37
@mdisibio mdisibio merged commit e8d5915 into grafana:main Mar 25, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants