Fix slice backing-array race in live-store FindTraceByID#6968
Fix slice backing-array race in live-store FindTraceByID#6968MukundaKatta wants to merge 4 commits intografana:mainfrom
Conversation
|
|
||
| tr := &tempopb.Trace{ | ||
| ResourceSpans: liveTrace.Batches, | ||
| // Cap the slice to its length so appends below (and in any downstream |
There was a problem hiding this comment.
Feels a bit verbose comment here, can we tone it down a bit.
There was a problem hiding this comment.
Applies to other comments as well.
| @@ -1,5 +1,6 @@ | |||
| ## main / unreleased | |||
|
|
|||
| * [BUGFIX] Fix a panic in live-store `FindTraceByID` caused by a slice backing-array race between the trace combiner's append and concurrent proto.Marshal. The live trace's `Batches` slice is now handed to the combiner with its capacity capped to its length so downstream appends cannot mutate the backing array still referenced by the live trace. [#6958](https://github.com/grafana/tempo/issues/6958) | |||
There was a problem hiding this comment.
changelog entry not in order.
|
recheck |
FindTraceByID previously assigned the live trace's Batches slice header
directly to the temporary tempopb.Trace handed to the combiner:
tempTrace.ResourceSpans = liveTrace.Batches
Because the slice kept its original capacity, a later append by the
combiner (or by the push path that holds writeHeadBlock's tr.ResourceSpans)
would reuse the same backing array still referenced by liveTrace.Batches.
Concurrent proto.Marshal calls could then observe a slot being rewritten
between Size() and MarshalToSizedBuffer() and panic with a negative-index
`slice bounds out of range` error. The same pattern existed in
writeHeadBlock.
Cap the capacity to the length at both call sites so every consumer gets
its own backing array on append:
tempTrace.ResourceSpans = liveTrace.Batches[:len:len]
Add a regression test that pushes a trace, keeps it in liveTraces
(no cut), calls FindByTraceID, and asserts the returned ResourceSpans
slice has cap == len — the exact invariant the fix guarantees.
Fixes grafana#6958
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0f94966 to
f2589bc
Compare
|
Hello, are you able to sign the CLA? |
|
Pushed 198a1ac addressing the review:
Also: on the CLA — I'll sign / chase the status so this can proceed. Thanks for the review! |
Align with the CHANGELOG convention used by the other entries in this section (PR link instead of issue link + trailing @author).
|
@mattdurham @electron0zero — comment tone-down + CHANGELOG ordering were already addressed in 198a1ac (4-line comment in both |
Summary
FindTraceByIDin the live store handed the combiner the live trace'sBatchesslice header directly. The slice kept its original capacity, so a later append by the combiner (or by the push path inwriteHeadBlock) reused the same backing array still referenced byliveTrace.Batches. Concurrentproto.Marshalcould then see a slot being rewritten betweenSize()andMarshalToSizedBuffer()and panic withslice bounds out of range [-N:].instance_search.goandinstance.go) so every consumer gets its own backing array on append.FindByTraceID, and asserts the returnedResourceSpansslice hascap == len— the invariant the fix guarantees.[BUGFIX]CHANGELOG entry.Fixes #6958. Re-submits the fix proposed in the now-closed #6964 (closed for CLA reasons) with the test and changelog the reviewer requested.
Test plan
TestInstanceFindByTraceIDLiveTraceSliceIsolationassertscap(resp.Trace.ResourceSpans) == len(resp.Trace.ResourceSpans)when the trace is served fromliveTraces.make test, lint, CLA)