[vParquet5] Faster fetch layer for metrics queries by mdisibio · Pull Request #6359 · grafana/tempo

mdisibio · 2026-01-30T18:14:37Z

What this PR does:
This PR introduces a new fetch layer in vParquet5 to speed up metrics queries.

Core Concept
The current fetch layer is a one-size-fits-all approach. It's complex and designed to fulfill queries that require knowledge of the full trace, like structural analysis { foo } &>> { bar } !<< { baz}, or aggregation within-trace like: { foo } | avg(duration) > 10s. For searches this is not really a problem because search results are limited, and queries often exit early. But for metrics queries like { foo } | rate() by (bar), which are exhaustive and maybe process millions (billions) of spans, the overhead is significant. After adding the new compressed timestamp columns in vP5, this aspect of the fetch layer is now the bottleneck.

The key insight is that the query { foo } | rate() by (bar) doesn't need spansets, it only needs spans. However making a new fetch layer going from Iterator[Spanset] to Iterator[Span] is not the whole story.

Core Concept 2
This also introduces a new more efficient join interface in parquetquery. Join iterators are used to find and collect matches between different columns, and squash results across definition levels. This is how we bubble up event-level values with their matching span-level values, for example.

The current interface of GroupPredicate.Keep(result) had to go. It was also requiring the build up of slices internally, where JoinIterator builds up the slice and delivers the entire payload to the GroupPredicate. Instead, we have a new Collector interface which receives each column's value immediately with no buffer in between.

We go from:

Join foo, bar 
for each thing
    append(result, thing)
Keep(result) 
return from Next()

Join foo, bar
collect.Reset()
for each thing
   collector.Collect(thing)
return collector.Result() in Next()

Collector can have its own representation inside for optimal storage of the data. In this case it's writing directly to the span and there is no intermediate slice.

This also lets us flatten the entire tree of iterators, which offers better opportunities to improve performance even further by allowing to reordering any columns, even across levels, and is of course much more readable. Previously we could only reorder within scopes, and the scopes themselves.

Example of old iterator chain for the query { resource.service.name="foo" } | quantile_over_time(duration, 0.5) by (span.bar):

        rebatchIterator: 
        	LeftJoinIterator: oldStyleCollector: traceCollector()
        	required: 
        		LeftJoinIterator: oldStyleCollector: batchCollector(false, 0)
        		required: 
        			LeftJoinIterator: oldStyleCollector: instrumentationCollector(0)
        			required: 
        				LeftJoinIterator: oldStyleCollector: spanCollector(0)
        				required: 
        					bridgeIterator: 
        						rebatchIterator: 
        							LeftJoinIterator: oldStyleCollector: traceCollector()
        							required: 
        								LeftJoinIterator: oldStyleCollector: batchCollector(true, 1)
        								required: 
        									SyncIterator: rs.Resource.ServiceName : StringEqualPredicate{foo}
        									LeftJoinIterator: oldStyleCollector: instrumentationCollector(0)
        									required: 
        										LeftJoinIterator: oldStyleCollector: spanCollector(0)
        										required: 
        											virtualRowNumberIterator()
        										optional: 
        									optional: 
        								optional: 
        							optional: 
        				optional: 
        					SyncIterator: rs.ss.Spans.DurationNano : nil
        					SyncIterator: rs.ss.Spans.StartTimeUnixNano : nil
        					LeftJoinIterator: oldStyleCollector: attributeCollector{}
        					required: 
        						SyncIterator: rs.ss.Spans.Attrs.Key : StringInPredicate{bar}
        					optional: 
        						SyncIterator: rs.ss.Spans.Attrs.Value : nil
        						SyncIterator: rs.ss.Spans.Attrs.ValueInt : nil
        						SyncIterator: rs.ss.Spans.Attrs.ValueDouble : nil
        						SyncIterator: rs.ss.Spans.Attrs.ValueBool : nil
        			optional: 
        		optional: 
        	optional: 
        		SyncIterator: TraceID : CallbackPredicate{}

New iterator chain:

        LeftJoinIterator: spanCollector(0)
        required: 
        	LeftJoinIterator: spanCollector(1)
        	required: 
        		SyncIterator: rs.ss.Spans.StatusCode : nil
        		SyncIterator: rs.Resource.ServiceName : StringEqualPredicate{foo}
        	optional: 
        optional: 
        	SyncIterator: TraceID : CallbackPredicate{}
        	SyncIterator: rs.ss.Spans.DurationNano : nil
        	SyncIterator: rs.ss.Spans.StartTimeUnixNano : nil
        	LeftJoinIterator: scopedAttributeCollector(span)
        	required: 
        		SyncIterator: rs.ss.Spans.Attrs.Key : StringInPredicate{bar}
        	optional: 
        		SyncIterator: rs.ss.Spans.Attrs.Value : nil
        		SyncIterator: rs.ss.Spans.Attrs.ValueInt : nil
        		SyncIterator: rs.ss.Spans.Attrs.ValueDouble : nil
        		SyncIterator: rs.ss.Spans.Attrs.ValueBool : nil

How to use it
The new fetch layer is added to vp5 and accessible for metrics queries. It only applies to queries which don't need the full trace, i.e. using the NeedsFullTrace() classification we added previously. It's also behind a query hint with(new=true).

Examples:
{ } | rate() <-- old fetch
{ } | rate() with(new=true) <-- new fetch
{ } >> { } with(new=true) <-- old fetch, hint is ignored because this query requires the full trace

Key Performance Changes
Summarizing the key performance-related changes:

No longer building spansets internally
Join iterators no longer building slices internally
No more span pooling
No more iterator result pooling

Limitations and Considerations
First, it would be ideal to use this for search, but I haven't yet solved how to do that. Searches have queries like { } | by(resource.service.name) | count() > 10 | select(foo), which internally are re-grouping / re-ordering spans, and we handle this via the rebatch and bridge iterators. Those are based on spansets, and I'm not sure how to integrate them yet.

Second, just the complexity of an additional fetch layer that must be maintained, updating it in parallel and also a source for discrepancies. There are tests that verify all of the same queries and selectAll behavior on both fetch layers. I think that will catch most things. Based on the benchmarks, I think this is worth it, or else I wouldn't have submitted the PR.

Benchmarks

QueryRange

goos: darwin
goarch: arm64
pkg: github.com/grafana/tempo/tempodb/encoding/vparquet5
cpu: Apple M4 Pro
                                                                                                     │  before.txt   │              after.txt              │
                                                                                                     │    sec/op     │    sec/op     vs base               │
BackendBlockQueryRange/{}_|_rate()                                                                     2350.1m ± 13%   816.2m ±  1%  -65.27% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_with(sample=true)                                                    504.4m ±  2%   364.6m ±  3%  -27.71% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)                                            8.550 ±  4%    5.658 ±  2%  -33.83% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)                                            3.878 ± 13%    1.534 ±  1%  -60.45% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)                                                  10.427m ± 15%   7.305m ±  2%  -29.94% (p=0.002 n=6)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()                                1122.2m ±  3%   456.8m ±  3%  -59.29% (p=0.002 n=6)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)     2.060 ±  3%    1.699 ±  0%  -17.53% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()                                                          876.4m ±  3%   639.6m ±  0%  -27.03% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)                                    4.474 ± 13%    2.601 ±  2%  -41.87% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)                11.145 ±  8%    7.453 ±  1%  -33.12% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)                                                4.797 ±  4%    2.599 ±  1%  -45.82% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)                          10.368 ±  6%    6.612 ±  1%  -36.23% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)                          10.273 ±  4%    6.734 ±  1%  -34.45% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)                          10.332 ±  1%    6.739 ±  0%  -34.78% (p=0.002 n=6)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})                                        54.15m ± 58%   55.71m ± 22%        ~ (p=0.240 n=6)
BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name)                                                        3.149 ±  6%    3.083 ±  3%        ~ (p=0.485 n=6)
geomean                                                                                                  1.931          1.221        -36.77%

                                                                                                     │ before.txt │              after.txt              │
                                                                                                     │  MB_io/op  │  MB_io/op   vs base                 │
BackendBlockQueryRange/{}_|_rate()                                                                     93.56 ± 0%   93.54 ± 0%   -0.02% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_with(sample=true)                                                   93.54 ± 0%   93.54 ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)                                          412.1 ± 0%   412.1 ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)                                          94.86 ± 0%   94.86 ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)                                                  6.393 ± 0%   6.377 ± 0%   -0.27% (p=0.002 n=6)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()                                94.86 ± 0%   94.84 ± 0%   -0.02% (p=0.002 n=6)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)   286.7 ± 0%   286.6 ± 0%   -0.03% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()                                                         98.44 ± 0%   98.44 ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)                                  179.1 ± 0%   179.1 ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)               497.6 ± 0%   497.6 ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_histogram_over_time(duration)                                              179.1 ± 0%   179.1 ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)                         497.6 ± 0%   497.6 ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)                         497.6 ± 0%   497.6 ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)                         497.6 ± 0%   497.6 ± 0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})                                       15.13 ± 1%   17.40 ± 0%  +15.00% (p=0.002 n=6)
BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name)                                                      50.07 ± 0%   50.07 ± 0%        ~ (p=1.000 n=6) ¹
geomean                                                                                                132.2        133.4        +0.86%
¹ all samples are equal

                                                                                                     │  before.txt  │               after.txt               │
                                                                                                     │   spans/op   │   spans/op    vs base                 │
BackendBlockQueryRange/{}_|_rate()                                                                     26.07M ±  0%   26.07M ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_with(sample=true)                                                   2.607M ±  0%   2.607M ±  0%   -0.01% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)                                          26.07M ±  0%   26.07M ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)                                          26.07M ±  0%   26.07M ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)                                                  7.020k ± 26%   4.959k ±  3%  -29.36% (p=0.002 n=6)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()                                4.854M ±  0%   4.854M ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)   20.58k ±  0%   20.58k ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{status=error}_|_rate()                                                         899.8k ±  0%   899.8k ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)                                  26.07M ±  0%   26.07M ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)               26.07M ±  0%   26.07M ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_histogram_over_time(duration)                                              26.07M ±  0%   26.07M ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)                         26.07M ±  0%   26.07M ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)                         26.07M ±  0%   26.07M ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)                         26.07M ±  0%   26.07M ±  0%        ~ (p=1.000 n=6) ¹
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})                                       6.438k ± 58%   6.432k ± 27%        ~ (p=0.892 n=6)
BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name)                                                      4.117M ±  0%   4.117M ±  0%        ~ (p=1.000 n=6) ¹
geomean                                                                                                3.342M         3.270M         -2.15%
¹ all samples are equal

                                                                                                     │  before.txt  │              after.txt               │
                                                                                                     │   spans/s    │   spans/s     vs base                │
BackendBlockQueryRange/{}_|_rate()                                                                     11.10M ± 14%    31.94M ± 1%  +187.86% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_with(sample=true)                                                   5.168M ±  2%    7.149M ± 3%   +38.33% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)                                          3.049M ±  4%    4.607M ± 2%   +51.12% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)                                          6.724M ± 15%   16.997M ± 1%  +152.79% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)                                                  676.1k ±  9%    679.6k ± 2%         ~ (p=1.000 n=6)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()                                4.326M ±  3%   10.625M ± 3%  +145.62% (p=0.002 n=6)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)   9.990k ±  3%   12.114k ± 0%   +21.26% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()                                                         1.027M ±  3%    1.407M ± 0%   +37.03% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)                                  5.833M ± 14%   10.023M ± 2%   +71.84% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)               2.339M ±  9%    3.497M ± 1%   +49.52% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)                                              5.435M ±  4%   10.031M ± 1%   +84.57% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)                         2.514M ±  6%    3.942M ± 1%   +56.80% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)                         2.537M ±  4%    3.871M ± 1%   +52.55% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)                         2.523M ±  1%    3.868M ± 0%   +53.32% (p=0.002 n=6)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})                                       119.0k ±  4%    116.4k ± 3%         ~ (p=0.180 n=6)
BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name)                                                      1.307M ±  6%    1.336M ± 3%         ~ (p=0.485 n=6)
geomean                                                                                                1.732M          2.680M        +54.77%

                                                                                                     │    before.txt    │               after.txt                │
                                                                                                     │       B/op       │      B/op        vs base               │
BackendBlockQueryRange/{}_|_rate()                                                                     13.442Mi ± 1242%    5.967Mi ±  74%  -55.61% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_with(sample=true)                                                    5.934Mi ±    3%    5.787Mi ±   0%        ~ (p=0.394 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)                                          1069.5Mi ±   17%    996.9Mi ±   2%   -6.79% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)                                          15.170Mi ±  132%    7.227Mi ±  40%  -52.36% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)                                                   7.676Mi ±    4%    6.955Mi ±   0%   -9.39% (p=0.002 n=6)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()                                 653.3Mi ±    6%    681.1Mi ±   0%        ~ (p=0.065 n=6)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)    631.8Mi ±    2%    671.6Mi ±   6%   +6.31% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()                                                          472.3Mi ±    0%    521.5Mi ±   1%  +10.42% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)                                  14.135Mi ±  597%    6.130Mi ± 306%  -56.64% (p=0.037 n=6)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)               1047.4Mi ±   17%    996.0Mi ±   2%   -4.91% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)                                              14.135Mi ±   96%    6.129Mi ± 125%  -56.64% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)                         1055.4Mi ±   18%   1009.2Mi ±   1%   -4.38% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)                         1041.1Mi ±    1%   1001.5Mi ±   2%   -3.81% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)                         1044.5Mi ±    1%    992.2Mi ±   3%   -5.01% (p=0.002 n=6)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})                                        17.43Mi ±   72%    11.21Mi ±  63%  -35.72% (p=0.026 n=6)
BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name)                                                       506.8Mi ±   24%    506.1Mi ±  27%        ~ (p=0.937 n=6)
geomean                                                                                                 126.1Mi            99.19Mi         -21.36%

                                                                                                     │   before.txt   │              after.txt              │
                                                                                                     │   allocs/op    │  allocs/op    vs base               │
BackendBlockQueryRange/{}_|_rate()                                                                      71.71k ± 303%   71.64k ±  0%   -0.09% (p=0.035 n=6)
BackendBlockQueryRange/{}_|_rate()_with(sample=true)                                                    71.63k ±   0%   71.62k ±  0%        ~ (p=0.366 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code)                                          4276.5k ±   5%   235.7k ±  0%  -94.49% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name)                                           78.84k ± 274%   78.81k ±  0%   -0.04% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_rate()_by_(span.http.url)                                                   61.41k ±   2%   59.83k ±  0%   -2.58% (p=0.002 n=6)
BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate()                                 335.4k ±  58%   103.5k ±  0%  -69.15% (p=0.002 n=6)
BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor)    705.5k ±   0%   167.4k ±  0%  -76.27% (p=0.002 n=6)
BackendBlockQueryRange/{status=error}_|_rate()                                                          130.7k ±   8%   106.9k ±  0%  -18.23% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5)                                   84.61k ± 465%   84.57k ±  0%   -0.05% (p=0.037 n=6)
BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code)               4314.9k ±   9%   254.2k ±  0%  -94.11% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_histogram_over_time(duration)                                               84.61k ±  82%   84.57k ±  0%   -0.05% (p=0.028 n=6)
BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code)                         4325.2k ±   3%   247.5k ±  0%  -94.28% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code)                         4308.9k ±   1%   247.9k ±  0%  -94.25% (p=0.002 n=6)
BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code)                         4300.2k ±   1%   248.0k ±  0%  -94.23% (p=0.002 n=6)
BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error})                                       103.30k ±  20%   81.73k ±  5%  -20.89% (p=0.002 n=6)
BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name)                                                       598.3k ±  45%   529.5k ± 51%  -11.51% (p=0.026 n=6)
geomean                                                                                                 404.0k          135.3k        -66.50%

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…ble with new query path behind query hint. Fix issue with bad collection of string array attributes and unnecessary mapkey computations

…ebug name strings on iter and collector objects

…ator is exhausted, just remove it from the list. This is about 10% performance improvement when dealing with exemplars, which only read the first few values and then exit

…py-on-write approach for the input

… new iterators

stoewer · 2026-02-11T13:34:33Z

Code review

Found 2 issues:

Off-by-one error in fork function causing data loss (Bug in SpansetFilter.Evaluate)

When forking at index i, the code copies input[:i-1] instead of input[:i], which loses element i-1. This causes spansets that passed the filter to be silently dropped.

Example: If processing spansets [A, B, C] and C fails at index 2, current code copies only [A] instead of [A, B], losing spanset B.

tempo/pkg/traceql/ast.go

Lines 439 to 446 in cd4b3ee

    
           	if !forked { 
        
           		// First we are working the output, so copy all previously identical spansets into the output. 
        
           		if i > 0 { 
        
           			outputBuffer = append(outputBuffer, input[:i-1]...) 
        
           		} 
        
           	} 
        
           	forked = true 
        
           }

Race condition from missing mutex locks in DoSpansOnly() (Bug due to commented-out locks)

The DoSpansOnly() method has commented-out mutex locks (lines 1383, 1412) while accessing shared state (e.metricsPipeline, e.spansTotal, e.exemplarCount). The existing Do() method properly locks these operations. The MetricsEvaluator explicitly states it "isn't thread-safe" but "parallelization is required", so external locking is needed. Without locks, concurrent access from multiple goroutines causes race conditions and data corruption.

tempo/pkg/traceql/engine_metrics.go

Lines 1383 to 1414 in cd4b3ee

    
           // e.mtx.Lock() 
        
           if e.checkTime { 
        
           	st := s.StartTimeUnixNanos() 
        
           	if st <= e.start || st > e.end { 
        
           		// e.mtx.Unlock() 
        
           		continue 
        
           	} 
        
           } 
        
           if e.storageReq.SpanSampler != nil { 
        
           	e.storageReq.SpanSampler.Measured() 
        
           } 
        
           e.metricsPipeline.observe(s) 
        
           e.spansTotal++ 
        
           if e.maxExemplars > 0 && e.exemplarCount < e.maxExemplars { 
        
           	traceID, ok := s.AttributeFor(IntrinsicTraceIDAttribute) 
        
           	if ok { 
        
           		if e.sampleExemplar(traceID.valBytes) { 
        
           			e.metricsPipeline.observeExemplar(s) 
        
           		} 
        
           	} 
        
           } 
        
           seriesCount = e.metricsPipeline.length() 
        
           // e.mtx.Unlock() 
        
           if maxSeries > 0 && seriesCount >= maxSeries {

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

stoewer

Great PR. I think adding a second fetch function that retrieves only single spans is a valuable addition to Tempo’s fetch layer. It definitely introduces a fair amount of additional code and complexity. On the other hand, if increasing the complexity of the fetch layer gets us the extra performance we need to run metrics queries over larger time ranges, I’d say it is still worth it.

That said, it is probably still worth thinking about ways to make the added complexity more manageable and to make sure we get the best benefit from it. A few thoughts:

Unit and e2e tests that run a broad set of metrics queries in both modes and compare the results, to catch cases where a change is applied to one path but not the other
Consider using the span-only fetch function in other areas as well, for example tag search
Refactor the spanset fetch path so that the code becomes more similar to the span-only path, reducing cognitive load. Mainly thinking about the createXXXIterator() functions
Use the new capabilities of LeftJoinIterator in the spanset fetch path, maybe byflattening the iterator tree using child iterators on different definition levels

A lot of this can of course happen in follow-up PRs.

Also: when merging the latest changes from main, it may be necessary to adjust the span-only fetch path to properly support OpIn, OpNotIn, etc.

stoewer · 2026-02-19T09:24:26Z

 // uses the known time range of the data for last-minute optimizations. Time range is unix nanos

 func (e *MetricsEvaluator) Do(ctx context.Context, f SpansetFetcher, fetcherStart, fetcherEnd uint64, maxSeries int) error {
+	if !e.needsFullTrace && e.newFetch {


If I read this correctly e.newFetch is set by the query hint new and is false by default. Once the default is changed to true: do you think we need a config / overrides parameter to control this for all queries?

I think long-term no. Once we feel confident in the new fetch layer sufficiently, we should remove the hint use it for all applicable queries. An argument could be made to leave the hint to allow reverting a query to the old path, but I think long term this is treated like any other improvement - tested/expected to be correct.

stoewer · 2026-02-19T10:25:02Z

+
+	got := []IteratorResult{}
+
+	j, err := NewLeftJoinIterator(1, nil, nil, nil,


The NewLeftJoinIterator now basically has two ways to pass in iterators 1) via the parameters required and optional 2) using the WithIterator option, which add some complexity to the constructor. Maybe this could be a bit streamlined?

Hmm I was thinking the new signature was streamlined because it doesn't require churn in the existing code, and ensures that this new implementation is exactly correct behavior. The new options pattern is required in order to set the new options like callback parameter and iter-specific definition level. Eventually we can adjust the existing code to it.

We could make a separate initialization method NewLeftJoinIteratorWithOptions ? Or rename the old one to NewLeftJoinIteratorOld and make the new one the proper name? Need to see what the convention is.

On the long run it's probably best to remove parameters required and optional and move to the WithIterator option everywhere.

However, as I understand it now, the goal of the PR to purposefully not change / refactor other parts of the code. So leaving the constructor as is, is a good option for now

mdisibio · 2026-02-19T14:28:27Z

On the other hand, if increasing the complexity of the fetch layer gets us the extra performance we need to run metrics queries over larger time ranges, I’d say it is still worth it.

Right, I think that is the core question, and agree that it is worth it. For this PR I'd like to distinguish between the high-level questions (is this worth it, testing methodology, what do we accomplish in PR 1 versus followup), and the low-level (updating with main, bugs/lint, etc).

I think we are agreed on the first, it is worth it, and we will take steps to reduce/manage the complexity, i.e. by consolidating/refactoring other parts.
+1 to more tests to check results between the 2 paths, I think this is an area we need to improve already. Will work on that in this PR.
Is there anything else high-level we want in this PR? I think it is worth purposefully not refactoring the existing fetch layer in this PR yet.
Yep I will take care of updating with latest main and all the small stuff.

stoewer · 2026-02-19T16:34:28Z

I completely agree. It is completely fine to use this PR to leave the scope of the PR as is and do refactoring, test, etc in follow up PRs.
For reviewing this feature for the first time it was actually really helpful that changes to the existing parts were minimal

…ests

…layer

stoewer

Looks good. Just a couple of minor remarks

stoewer · 2026-02-25T13:17:31Z

+
+	got := []IteratorResult{}
+
+	j, err := NewLeftJoinIterator(1, nil, nil, nil,


On the long run it's probably best to remove parameters required and optional and move to the WithIterator option everywhere.

However, as I understand it now, the goal of the PR to purposefully not change / refactor other parts of the code. So leaving the constructor as is, is a good option for now

stoewer

LGTM

mdisibio added 23 commits October 17, 2025 13:59

Experiment with performance of span-only fetch

c23046d

Microoptimizations to remove unnecessary function calls from hot path

1e58b70

passing first round of tests

5a168bb

In progress

6b1f812

Mostly working on tests and benchmarks, just a few more mismatches

1dab655

Support for second pass callback. Fix a few remaining accuracy issues

3e27444

Support sampling

b01d077

Resolve all interface issues in other modules so Tempo is fully runna…

fc59618

…ble with new query path behind query hint. Fix issue with bad collection of string array attributes and unnecessary mapkey computations

New fetch layer passing select all, tweaks for exact compatibility. d…

a89fd84

…ebug name strings on iter and collector objects

Performance tweaks, completely adopt new-style collector interface

8a14818

Cleanup fetcher interfaces, denoise diff from main

fb48d78

Random stuff

bac3366

Clean/perf

1cb9106

Break out iter0 hotpath again for best performance, cleanup

091ffd7

Merge branch 'main' into fetch-spanonly

4b58289

Performance improvement: instead of remembering when an optional iter…

f078fed

…ator is exhausted, just remove it from the list. This is about 10% performance improvement when dealing with exemplars, which only read the first few values and then exit

Elminate unnecessary allocations in SpansetFilter.Evaluate using a co…

7a009ba

…py-on-write approach for the input

cleanup

8393ec4

cleanup

f130771

Reuse test cases across new and old fetches

4140b9a

lint

fd5176b

lint

0cb09e4

Merge branch 'main' into fetch-spanonly

d85feb7

mdisibio marked this pull request as ready for review February 6, 2026 18:43

mdisibio requested review from electron0zero, ie-pham, joe-elliott, mapno, yvrhdn and zalegrala as code owners February 6, 2026 18:43

mdisibio requested review from mattdurham and oleg-kozlyuk-grafana as code owners February 6, 2026 18:43

Fix to not check trace timestamp in second pass, fix string output of…

cd4b3ee

… new iterators

Readd locking in DoSpansOnly, but with batching to mitigate overhead

8ffcdbc

stoewer reviewed Feb 19, 2026

View reviewed changes

Merge branch 'main' into fetch-spanonly

e90b122

mdisibio added 12 commits February 19, 2026 14:18

update new fetch layer for new operators, not all tests passing yet

44a3cab

Fix implementation of NilSyncIterator.SeekTo, finish and repair vp5 t…

938ad93

…ests

Fix mutex deadlock and run tempodb-level tests against the new fetch …

51608cc

…layer

review feedback

dd23082

review feedback

e7ac767

review feedback

2421260

Update to use virtual row number iterator

56511cd

Fix off-by-one error in spanset copy-on-write

ed414bb

Comments/cleanup

b64094b

Comments/cleanup

8bc57af

Merge branch 'main' into fetch-spanonly

ae3dcf7

Lint, fix test

2f331b9

stoewer reviewed Feb 25, 2026

View reviewed changes

mdisibio added 3 commits February 25, 2026 15:03

Implement new fetch layer for vp5 wal blocks

06e38e0

cleanup

2c84f9a

changelog

c9d14b5

stoewer approved these changes Mar 10, 2026

View reviewed changes

Merge branch 'main' into fetch-spanonly

51082c4

mdisibio merged commit af64238 into grafana:main Mar 16, 2026
24 checks passed

electron0zero mentioned this pull request Mar 19, 2026

disable lagacy overrides electron0zero/tempo#9

Closed

knylander-grafana mentioned this pull request Mar 28, 2026

[DOCs] Add docs for PRs 6525, 6741, 6403, 6618, and 6359 #6818

Merged

3 tasks


		got := []IteratorResult{}

		j, err := NewLeftJoinIterator(1, nil, nil, nil,

Conversation

mdisibio commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stoewer commented Feb 11, 2026

Code review

Uh oh!

stoewer left a comment

Choose a reason for hiding this comment

Uh oh!

stoewer Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

mdisibio Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stoewer Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

mdisibio Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

stoewer Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mdisibio commented Feb 19, 2026

Uh oh!

stoewer commented Feb 19, 2026

Uh oh!

stoewer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

stoewer Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

stoewer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mdisibio commented Jan 30, 2026 •

edited

Loading