Skip to content

[vParquet5] Faster fetch layer for metrics queries#6359

Merged
mdisibio merged 42 commits intografana:mainfrom
mdisibio:fetch-spanonly
Mar 16, 2026
Merged

[vParquet5] Faster fetch layer for metrics queries#6359
mdisibio merged 42 commits intografana:mainfrom
mdisibio:fetch-spanonly

Conversation

@mdisibio
Copy link
Copy Markdown
Contributor

@mdisibio mdisibio commented Jan 30, 2026

What this PR does:
This PR introduces a new fetch layer in vParquet5 to speed up metrics queries.

Core Concept
The current fetch layer is a one-size-fits-all approach. It's complex and designed to fulfill queries that require knowledge of the full trace, like structural analysis { foo } &>> { bar } !<< { baz}, or aggregation within-trace like: { foo } | avg(duration) > 10s. For searches this is not really a problem because search results are limited, and queries often exit early. But for metrics queries like { foo } | rate() by (bar), which are exhaustive and maybe process millions (billions) of spans, the overhead is significant. After adding the new compressed timestamp columns in vP5, this aspect of the fetch layer is now the bottleneck.

The key insight is that the query { foo } | rate() by (bar) doesn't need spansets, it only needs spans. However making a new fetch layer going from Iterator[Spanset] to Iterator[Span] is not the whole story.

Core Concept 2
This also introduces a new more efficient join interface in parquetquery. Join iterators are used to find and collect matches between different columns, and squash results across definition levels. This is how we bubble up event-level values with their matching span-level values, for example.

The current interface of GroupPredicate.Keep(result) had to go. It was also requiring the build up of slices internally, where JoinIterator builds up the slice and delivers the entire payload to the GroupPredicate. Instead, we have a new Collector interface which receives each column's value immediately with no buffer in between.

We go from:

Join foo, bar 
for each thing
    append(result, thing)
Keep(result) 
return from Next()
Join foo, bar
collect.Reset()
for each thing
   collector.Collect(thing)
return collector.Result() in Next()

Collector can have its own representation inside for optimal storage of the data. In this case it's writing directly to the span and there is no intermediate slice.

This also lets us flatten the entire tree of iterators, which offers better opportunities to improve performance even further by allowing to reordering any columns, even across levels, and is of course much more readable. Previously we could only reorder within scopes, and the scopes themselves.

Example of old iterator chain for the query { resource.service.name="foo" } | quantile_over_time(duration, 0.5) by (span.bar):

        rebatchIterator: 
        	LeftJoinIterator: oldStyleCollector: traceCollector()
        	required: 
        		LeftJoinIterator: oldStyleCollector: batchCollector(false, 0)
        		required: 
        			LeftJoinIterator: oldStyleCollector: instrumentationCollector(0)
        			required: 
        				LeftJoinIterator: oldStyleCollector: spanCollector(0)
        				required: 
        					bridgeIterator: 
        						rebatchIterator: 
        							LeftJoinIterator: oldStyleCollector: traceCollector()
        							required: 
        								LeftJoinIterator: oldStyleCollector: batchCollector(true, 1)
        								required: 
        									SyncIterator: rs.Resource.ServiceName : StringEqualPredicate{foo}
        									LeftJoinIterator: oldStyleCollector: instrumentationCollector(0)
        									required: 
        										LeftJoinIterator: oldStyleCollector: spanCollector(0)
        										required: 
        											virtualRowNumberIterator()
        										optional: 
        									optional: 
        								optional: 
        							optional: 
        				optional: 
        					SyncIterator: rs.ss.Spans.DurationNano : nil
        					SyncIterator: rs.ss.Spans.StartTimeUnixNano : nil
        					LeftJoinIterator: oldStyleCollector: attributeCollector{}
        					required: 
        						SyncIterator: rs.ss.Spans.Attrs.Key : StringInPredicate{bar}
        					optional: 
        						SyncIterator: rs.ss.Spans.Attrs.Value : nil
        						SyncIterator: rs.ss.Spans.Attrs.ValueInt : nil
        						SyncIterator: rs.ss.Spans.Attrs.ValueDouble : nil
        						SyncIterator: rs.ss.Spans.Attrs.ValueBool : nil
        			optional: 
        		optional: 
        	optional: 
        		SyncIterator: TraceID : CallbackPredicate{}

New iterator chain:

        LeftJoinIterator: spanCollector(0)
        required: 
        	LeftJoinIterator: spanCollector(1)
        	required: 
        		SyncIterator: rs.ss.Spans.StatusCode : nil
        		SyncIterator: rs.Resource.ServiceName : StringEqualPredicate{foo}
        	optional: 
        optional: 
        	SyncIterator: TraceID : CallbackPredicate{}
        	SyncIterator: rs.ss.Spans.DurationNano : nil
        	SyncIterator: rs.ss.Spans.StartTimeUnixNano : nil
        	LeftJoinIterator: scopedAttributeCollector(span)
        	required: 
        		SyncIterator: rs.ss.Spans.Attrs.Key : StringInPredicate{bar}
        	optional: 
        		SyncIterator: rs.ss.Spans.Attrs.Value : nil
        		SyncIterator: rs.ss.Spans.Attrs.ValueInt : nil
        		SyncIterator: rs.ss.Spans.Attrs.ValueDouble : nil
        		SyncIterator: rs.ss.Spans.Attrs.ValueBool : nil

How to use it
The new fetch layer is added to vp5 and accessible for metrics queries. It only applies to queries which don't need the full trace, i.e. using the NeedsFullTrace() classification we added previously. It's also behind a query hint with(new=true).

Examples:
{ } | rate() <-- old fetch
{ } | rate() with(new=true) <-- new fetch
{ } >> { } with(new=true) <-- old fetch, hint is ignored because this query requires the full trace

Key Performance Changes
Summarizing the key performance-related changes:

  1. No longer building spansets internally
  2. Join iterators no longer building slices internally
  3. No more span pooling
  4. No more iterator result pooling

Limitations and Considerations
First, it would be ideal to use this for search, but I haven't yet solved how to do that. Searches have queries like { } | by(resource.service.name) | count() > 10 | select(foo), which internally are re-grouping / re-ordering spans, and we handle this via the rebatch and bridge iterators. Those are based on spansets, and I'm not sure how to integrate them yet.

Second, just the complexity of an additional fetch layer that must be maintained, updating it in parallel and also a source for discrepancies. There are tests that verify all of the same queries and selectAll behavior on both fetch layers. I think that will catch most things. Based on the benchmarks, I think this is worth it, or else I wouldn't have submitted the PR.

Benchmarks

QueryRangegoos: darwin goarch: arm64 pkg: github.com/grafana/tempo/tempodb/encoding/vparquet5 cpu: Apple M4 Pro │ before.txt │ after.txt │ │ sec/op │ sec/op vs base │ BackendBlockQueryRange/{}_|_rate() 2350.1m ± 13% 816.2m ± 1% -65.27% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_with(sample=true) 504.4m ± 2% 364.6m ± 3% -27.71% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code) 8.550 ± 4% 5.658 ± 2% -33.83% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name) 3.878 ± 13% 1.534 ± 1% -60.45% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_by_(span.http.url) 10.427m ± 15% 7.305m ± 2% -29.94% (p=0.002 n=6) BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate() 1122.2m ± 3% 456.8m ± 3% -59.29% (p=0.002 n=6) BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor) 2.060 ± 3% 1.699 ± 0% -17.53% (p=0.002 n=6) BackendBlockQueryRange/{status=error}_|_rate() 876.4m ± 3% 639.6m ± 0% -27.03% (p=0.002 n=6) BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5) 4.474 ± 13% 2.601 ± 2% -41.87% (p=0.002 n=6) BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code) 11.145 ± 8% 7.453 ± 1% -33.12% (p=0.002 n=6) BackendBlockQueryRange/{}_|_histogram_over_time(duration) 4.797 ± 4% 2.599 ± 1% -45.82% (p=0.002 n=6) BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code) 10.368 ± 6% 6.612 ± 1% -36.23% (p=0.002 n=6) BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code) 10.273 ± 4% 6.734 ± 1% -34.45% (p=0.002 n=6) BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code) 10.332 ± 1% 6.739 ± 0% -34.78% (p=0.002 n=6) BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error}) 54.15m ± 58% 55.71m ± 22% ~ (p=0.240 n=6) BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name) 3.149 ± 6% 3.083 ± 3% ~ (p=0.485 n=6) geomean 1.931 1.221 -36.77% │ before.txt │ after.txt │ │ MB_io/op │ MB_io/op vs base │ BackendBlockQueryRange/{}_|_rate() 93.56 ± 0% 93.54 ± 0% -0.02% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_with(sample=true) 93.54 ± 0% 93.54 ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code) 412.1 ± 0% 412.1 ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name) 94.86 ± 0% 94.86 ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_rate()_by_(span.http.url) 6.393 ± 0% 6.377 ± 0% -0.27% (p=0.002 n=6) BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate() 94.86 ± 0% 94.84 ± 0% -0.02% (p=0.002 n=6) BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor) 286.7 ± 0% 286.6 ± 0% -0.03% (p=0.002 n=6) BackendBlockQueryRange/{status=error}_|_rate() 98.44 ± 0% 98.44 ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5) 179.1 ± 0% 179.1 ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code) 497.6 ± 0% 497.6 ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_histogram_over_time(duration) 179.1 ± 0% 179.1 ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code) 497.6 ± 0% 497.6 ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code) 497.6 ± 0% 497.6 ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code) 497.6 ± 0% 497.6 ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error}) 15.13 ± 1% 17.40 ± 0% +15.00% (p=0.002 n=6) BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name) 50.07 ± 0% 50.07 ± 0% ~ (p=1.000 n=6) ¹ geomean 132.2 133.4 +0.86% ¹ all samples are equal │ before.txt │ after.txt │ │ spans/op │ spans/op vs base │ BackendBlockQueryRange/{}_|_rate() 26.07M ± 0% 26.07M ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_rate()_with(sample=true) 2.607M ± 0% 2.607M ± 0% -0.01% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code) 26.07M ± 0% 26.07M ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name) 26.07M ± 0% 26.07M ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_rate()_by_(span.http.url) 7.020k ± 26% 4.959k ± 3% -29.36% (p=0.002 n=6) BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate() 4.854M ± 0% 4.854M ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor) 20.58k ± 0% 20.58k ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{status=error}_|_rate() 899.8k ± 0% 899.8k ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5) 26.07M ± 0% 26.07M ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code) 26.07M ± 0% 26.07M ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_histogram_over_time(duration) 26.07M ± 0% 26.07M ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code) 26.07M ± 0% 26.07M ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code) 26.07M ± 0% 26.07M ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code) 26.07M ± 0% 26.07M ± 0% ~ (p=1.000 n=6) ¹ BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error}) 6.438k ± 58% 6.432k ± 27% ~ (p=0.892 n=6) BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name) 4.117M ± 0% 4.117M ± 0% ~ (p=1.000 n=6) ¹ geomean 3.342M 3.270M -2.15% ¹ all samples are equal │ before.txt │ after.txt │ │ spans/s │ spans/s vs base │ BackendBlockQueryRange/{}_|_rate() 11.10M ± 14% 31.94M ± 1% +187.86% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_with(sample=true) 5.168M ± 2% 7.149M ± 3% +38.33% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code) 3.049M ± 4% 4.607M ± 2% +51.12% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name) 6.724M ± 15% 16.997M ± 1% +152.79% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_by_(span.http.url) 676.1k ± 9% 679.6k ± 2% ~ (p=1.000 n=6) BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate() 4.326M ± 3% 10.625M ± 3% +145.62% (p=0.002 n=6) BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor) 9.990k ± 3% 12.114k ± 0% +21.26% (p=0.002 n=6) BackendBlockQueryRange/{status=error}_|_rate() 1.027M ± 3% 1.407M ± 0% +37.03% (p=0.002 n=6) BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5) 5.833M ± 14% 10.023M ± 2% +71.84% (p=0.002 n=6) BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code) 2.339M ± 9% 3.497M ± 1% +49.52% (p=0.002 n=6) BackendBlockQueryRange/{}_|_histogram_over_time(duration) 5.435M ± 4% 10.031M ± 1% +84.57% (p=0.002 n=6) BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code) 2.514M ± 6% 3.942M ± 1% +56.80% (p=0.002 n=6) BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code) 2.537M ± 4% 3.871M ± 1% +52.55% (p=0.002 n=6) BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code) 2.523M ± 1% 3.868M ± 0% +53.32% (p=0.002 n=6) BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error}) 119.0k ± 4% 116.4k ± 3% ~ (p=0.180 n=6) BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name) 1.307M ± 6% 1.336M ± 3% ~ (p=0.485 n=6) geomean 1.732M 2.680M +54.77% │ before.txt │ after.txt │ │ B/op │ B/op vs base │ BackendBlockQueryRange/{}_|_rate() 13.442Mi ± 1242% 5.967Mi ± 74% -55.61% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_with(sample=true) 5.934Mi ± 3% 5.787Mi ± 0% ~ (p=0.394 n=6) BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code) 1069.5Mi ± 17% 996.9Mi ± 2% -6.79% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name) 15.170Mi ± 132% 7.227Mi ± 40% -52.36% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_by_(span.http.url) 7.676Mi ± 4% 6.955Mi ± 0% -9.39% (p=0.002 n=6) BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate() 653.3Mi ± 6% 681.1Mi ± 0% ~ (p=0.065 n=6) BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor) 631.8Mi ± 2% 671.6Mi ± 6% +6.31% (p=0.002 n=6) BackendBlockQueryRange/{status=error}_|_rate() 472.3Mi ± 0% 521.5Mi ± 1% +10.42% (p=0.002 n=6) BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5) 14.135Mi ± 597% 6.130Mi ± 306% -56.64% (p=0.037 n=6) BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code) 1047.4Mi ± 17% 996.0Mi ± 2% -4.91% (p=0.002 n=6) BackendBlockQueryRange/{}_|_histogram_over_time(duration) 14.135Mi ± 96% 6.129Mi ± 125% -56.64% (p=0.002 n=6) BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code) 1055.4Mi ± 18% 1009.2Mi ± 1% -4.38% (p=0.002 n=6) BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code) 1041.1Mi ± 1% 1001.5Mi ± 2% -3.81% (p=0.002 n=6) BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code) 1044.5Mi ± 1% 992.2Mi ± 3% -5.01% (p=0.002 n=6) BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error}) 17.43Mi ± 72% 11.21Mi ± 63% -35.72% (p=0.026 n=6) BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name) 506.8Mi ± 24% 506.1Mi ± 27% ~ (p=0.937 n=6) geomean 126.1Mi 99.19Mi -21.36% │ before.txt │ after.txt │ │ allocs/op │ allocs/op vs base │ BackendBlockQueryRange/{}_|_rate() 71.71k ± 303% 71.64k ± 0% -0.09% (p=0.035 n=6) BackendBlockQueryRange/{}_|_rate()_with(sample=true) 71.63k ± 0% 71.62k ± 0% ~ (p=0.366 n=6) BackendBlockQueryRange/{}_|_rate()_by_(span.http.status_code) 4276.5k ± 5% 235.7k ± 0% -94.49% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_by_(resource.service.name) 78.84k ± 274% 78.81k ± 0% -0.04% (p=0.002 n=6) BackendBlockQueryRange/{}_|_rate()_by_(span.http.url) 61.41k ± 2% 59.83k ± 0% -2.58% (p=0.002 n=6) BackendBlockQueryRange/{resource.service.name=`loki-ingester`}_|_rate() 335.4k ± 58% 103.5k ± 0% -69.15% (p=0.002 n=6) BackendBlockQueryRange/{span.http.host_!=_``_&&_span.http.flavor=`2`}_|_rate()_by_(span.http.flavor) 705.5k ± 0% 167.4k ± 0% -76.27% (p=0.002 n=6) BackendBlockQueryRange/{status=error}_|_rate() 130.7k ± 8% 106.9k ± 0% -18.23% (p=0.002 n=6) BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99,_.9,_.5) 84.61k ± 465% 84.57k ± 0% -0.05% (p=0.037 n=6) BackendBlockQueryRange/{}_|_quantile_over_time(duration,_.99)_by_(span.http.status_code) 4314.9k ± 9% 254.2k ± 0% -94.11% (p=0.002 n=6) BackendBlockQueryRange/{}_|_histogram_over_time(duration) 84.61k ± 82% 84.57k ± 0% -0.05% (p=0.028 n=6) BackendBlockQueryRange/{}_|_avg_over_time(duration)_by_(span.http.status_code) 4325.2k ± 3% 247.5k ± 0% -94.28% (p=0.002 n=6) BackendBlockQueryRange/{}_|_max_over_time(duration)_by_(span.http.status_code) 4308.9k ± 1% 247.9k ± 0% -94.25% (p=0.002 n=6) BackendBlockQueryRange/{}_|_min_over_time(duration)_by_(span.http.status_code) 4300.2k ± 1% 248.0k ± 0% -94.23% (p=0.002 n=6) BackendBlockQueryRange/{_name_!=_nil_}_|_compare({status=error}) 103.30k ± 20% 81.73k ± 5% -20.89% (p=0.002 n=6) BackendBlockQueryRange/{}_>_{}_|_rate()_by_(name) 598.3k ± 45% 529.5k ± 51% -11.51% (p=0.026 n=6) geomean 404.0k 135.3k -66.50%

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

…ble with new query path behind query hint. Fix issue with bad collection of string array attributes and unnecessary mapkey computations
…ebug name strings on iter and collector objects
…ator is exhausted, just remove it from the list. This is about 10% performance improvement when dealing with exemplars, which only read the first few values and then exit
@mdisibio mdisibio marked this pull request as ready for review February 6, 2026 18:43
@stoewer
Copy link
Copy Markdown
Contributor

stoewer commented Feb 11, 2026

Code review

Found 2 issues:

  1. Off-by-one error in fork function causing data loss (Bug in SpansetFilter.Evaluate)

When forking at index i, the code copies input[:i-1] instead of input[:i], which loses element i-1. This causes spansets that passed the filter to be silently dropped.

Example: If processing spansets [A, B, C] and C fails at index 2, current code copies only [A] instead of [A, B], losing spanset B.

tempo/pkg/traceql/ast.go

Lines 439 to 446 in cd4b3ee

if !forked {
// First we are working the output, so copy all previously identical spansets into the output.
if i > 0 {
outputBuffer = append(outputBuffer, input[:i-1]...)
}
}
forked = true
}

  1. Race condition from missing mutex locks in DoSpansOnly() (Bug due to commented-out locks)

The DoSpansOnly() method has commented-out mutex locks (lines 1383, 1412) while accessing shared state (e.metricsPipeline, e.spansTotal, e.exemplarCount). The existing Do() method properly locks these operations. The MetricsEvaluator explicitly states it "isn't thread-safe" but "parallelization is required", so external locking is needed. Without locks, concurrent access from multiple goroutines causes race conditions and data corruption.

// e.mtx.Lock()
if e.checkTime {
st := s.StartTimeUnixNanos()
if st <= e.start || st > e.end {
// e.mtx.Unlock()
continue
}
}
if e.storageReq.SpanSampler != nil {
e.storageReq.SpanSampler.Measured()
}
e.metricsPipeline.observe(s)
e.spansTotal++
if e.maxExemplars > 0 && e.exemplarCount < e.maxExemplars {
traceID, ok := s.AttributeFor(IntrinsicTraceIDAttribute)
if ok {
if e.sampleExemplar(traceID.valBytes) {
e.metricsPipeline.observeExemplar(s)
}
}
}
seriesCount = e.metricsPipeline.length()
// e.mtx.Unlock()
if maxSeries > 0 && seriesCount >= maxSeries {

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Copy Markdown
Contributor

@stoewer stoewer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR. I think adding a second fetch function that retrieves only single spans is a valuable addition to Tempo’s fetch layer. It definitely introduces a fair amount of additional code and complexity. On the other hand, if increasing the complexity of the fetch layer gets us the extra performance we need to run metrics queries over larger time ranges, I’d say it is still worth it.

That said, it is probably still worth thinking about ways to make the added complexity more manageable and to make sure we get the best benefit from it. A few thoughts:

  • Unit and e2e tests that run a broad set of metrics queries in both modes and compare the results, to catch cases where a change is applied to one path but not the other
  • Consider using the span-only fetch function in other areas as well, for example tag search
  • Refactor the spanset fetch path so that the code becomes more similar to the span-only path, reducing cognitive load. Mainly thinking about the createXXXIterator() functions
  • Use the new capabilities of LeftJoinIterator in the spanset fetch path, maybe byflattening the iterator tree using child iterators on different definition levels

A lot of this can of course happen in follow-up PRs.

Also: when merging the latest changes from main, it may be necessary to adjust the span-only fetch path to properly support OpIn, OpNotIn, etc.

// uses the known time range of the data for last-minute optimizations. Time range is unix nanos

func (e *MetricsEvaluator) Do(ctx context.Context, f SpansetFetcher, fetcherStart, fetcherEnd uint64, maxSeries int) error {
if !e.needsFullTrace && e.newFetch {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this correctly e.newFetch is set by the query hint new and is false by default. Once the default is changed to true: do you think we need a config / overrides parameter to control this for all queries?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think long-term no. Once we feel confident in the new fetch layer sufficiently, we should remove the hint use it for all applicable queries. An argument could be made to leave the hint to allow reverting a query to the old path, but I think long term this is treated like any other improvement - tested/expected to be correct.

Comment thread pkg/traceql/storage.go Outdated
Comment thread tempodb/encoding/vparquet4/block_traceql_test.go Outdated
Comment thread tempodb/encoding/vparquet5/block_traceql_fetch.go Outdated
Comment thread pkg/parquetquery/iters.go Outdated

got := []IteratorResult{}

j, err := NewLeftJoinIterator(1, nil, nil, nil,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NewLeftJoinIterator now basically has two ways to pass in iterators 1) via the parameters required and optional 2) using the WithIterator option, which add some complexity to the constructor. Maybe this could be a bit streamlined?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I was thinking the new signature was streamlined because it doesn't require churn in the existing code, and ensures that this new implementation is exactly correct behavior. The new options pattern is required in order to set the new options like callback parameter and iter-specific definition level. Eventually we can adjust the existing code to it.

We could make a separate initialization method NewLeftJoinIteratorWithOptions ? Or rename the old one to NewLeftJoinIteratorOld and make the new one the proper name? Need to see what the convention is.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the long run it's probably best to remove parameters required and optional and move to the WithIterator option everywhere.

However, as I understand it now, the goal of the PR to purposefully not change / refactor other parts of the code. So leaving the constructor as is, is a good option for now

Comment thread tempodb/encoding/vparquet5/block_traceql_fetch.go Outdated
@mdisibio
Copy link
Copy Markdown
Contributor Author

On the other hand, if increasing the complexity of the fetch layer gets us the extra performance we need to run metrics queries over larger time ranges, I’d say it is still worth it.

Right, I think that is the core question, and agree that it is worth it. For this PR I'd like to distinguish between the high-level questions (is this worth it, testing methodology, what do we accomplish in PR 1 versus followup), and the low-level (updating with main, bugs/lint, etc).

  1. I think we are agreed on the first, it is worth it, and we will take steps to reduce/manage the complexity, i.e. by consolidating/refactoring other parts.
  2. +1 to more tests to check results between the 2 paths, I think this is an area we need to improve already. Will work on that in this PR.
  3. Is there anything else high-level we want in this PR? I think it is worth purposefully not refactoring the existing fetch layer in this PR yet.
  4. Yep I will take care of updating with latest main and all the small stuff.

@stoewer
Copy link
Copy Markdown
Contributor

stoewer commented Feb 19, 2026

I completely agree. It is completely fine to use this PR to leave the scope of the PR as is and do refactoring, test, etc in follow up PRs.
For reviewing this feature for the first time it was actually really helpful that changes to the existing parts were minimal

Copy link
Copy Markdown
Contributor

@stoewer stoewer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just a couple of minor remarks

Comment thread pkg/parquetquery/iters.go Outdated
Comment thread tempodb/encoding/vparquet5/wal_block.go Outdated

got := []IteratorResult{}

j, err := NewLeftJoinIterator(1, nil, nil, nil,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the long run it's probably best to remove parameters required and optional and move to the WithIterator option everywhere.

However, as I understand it now, the goal of the PR to purposefully not change / refactor other parts of the code. So leaving the constructor as is, is a good option for now

Copy link
Copy Markdown
Contributor

@stoewer stoewer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mdisibio mdisibio merged commit af64238 into grafana:main Mar 16, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants