Skip to content

Fix slice backing-array race in live-store FindTraceByID#6968

Open
MukundaKatta wants to merge 4 commits intografana:mainfrom
MukundaKatta:fix/livestore-findtracebyid-slice-race
Open

Fix slice backing-array race in live-store FindTraceByID#6968
MukundaKatta wants to merge 4 commits intografana:mainfrom
MukundaKatta:fix/livestore-findtracebyid-slice-race

Conversation

@MukundaKatta
Copy link
Copy Markdown

@MukundaKatta MukundaKatta commented Apr 15, 2026

Summary

  • FindTraceByID in the live store handed the combiner the live trace's Batches slice header directly. The slice kept its original capacity, so a later append by the combiner (or by the push path in writeHeadBlock) reused the same backing array still referenced by liveTrace.Batches. Concurrent proto.Marshal could then see a slot being rewritten between Size() and MarshalToSizedBuffer() and panic with slice bounds out of range [-N:].
  • Fix by capping the slice to its length at both call sites (instance_search.go and instance.go) so every consumer gets its own backing array on append.
  • Add a regression test that pushes a trace, keeps it in liveTraces (no cut), calls FindByTraceID, and asserts the returned ResourceSpans slice has cap == len — the invariant the fix guarantees.
  • Add a [BUGFIX] CHANGELOG entry.

Fixes #6958. Re-submits the fix proposed in the now-closed #6964 (closed for CLA reasons) with the test and changelog the reviewer requested.

Test plan

  • TestInstanceFindByTraceIDLiveTraceSliceIsolation asserts cap(resp.Trace.ResourceSpans) == len(resp.Trace.ResourceSpans) when the trace is served from liveTraces.
  • CI (make test, lint, CLA)

@cla-assistant
Copy link
Copy Markdown

cla-assistant Bot commented Apr 15, 2026

CLA assistant check
All committers have signed the CLA.

Comment thread modules/livestore/instance.go Outdated

tr := &tempopb.Trace{
ResourceSpans: liveTrace.Batches,
// Cap the slice to its length so appends below (and in any downstream
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a bit verbose comment here, can we tone it down a bit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applies to other comments as well.

Comment thread CHANGELOG.md Outdated
@@ -1,5 +1,6 @@
## main / unreleased

* [BUGFIX] Fix a panic in live-store `FindTraceByID` caused by a slice backing-array race between the trace combiner's append and concurrent proto.Marshal. The live trace's `Batches` slice is now handed to the combiner with its capacity capped to its length so downstream appends cannot mutate the backing array still referenced by the live trace. [#6958](https://github.com/grafana/tempo/issues/6958)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changelog entry not in order.

@MukundaKatta
Copy link
Copy Markdown
Author

recheck

FindTraceByID previously assigned the live trace's Batches slice header
directly to the temporary tempopb.Trace handed to the combiner:

    tempTrace.ResourceSpans = liveTrace.Batches

Because the slice kept its original capacity, a later append by the
combiner (or by the push path that holds writeHeadBlock's tr.ResourceSpans)
would reuse the same backing array still referenced by liveTrace.Batches.
Concurrent proto.Marshal calls could then observe a slot being rewritten
between Size() and MarshalToSizedBuffer() and panic with a negative-index
`slice bounds out of range` error. The same pattern existed in
writeHeadBlock.

Cap the capacity to the length at both call sites so every consumer gets
its own backing array on append:

    tempTrace.ResourceSpans = liveTrace.Batches[:len:len]

Add a regression test that pushes a trace, keeps it in liveTraces
(no cut), calls FindByTraceID, and asserts the returned ResourceSpans
slice has cap == len — the exact invariant the fix guarantees.

Fixes grafana#6958

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@MukundaKatta MukundaKatta force-pushed the fix/livestore-findtracebyid-slice-race branch from 0f94966 to f2589bc Compare April 18, 2026 09:13
@mapno
Copy link
Copy Markdown
Contributor

mapno commented Apr 21, 2026

Hello, are you able to sign the CLA?

@MukundaKatta
Copy link
Copy Markdown
Author

Pushed 198a1ac addressing the review:

Also: on the CLA — I'll sign / chase the status so this can proceed. Thanks for the review!

Align with the CHANGELOG convention used by the other entries in this
section (PR link instead of issue link + trailing @author).
@MukundaKatta
Copy link
Copy Markdown
Author

@mattdurham @electron0zero — comment tone-down + CHANGELOG ordering were already addressed in 198a1ac (4-line comment in both instance.go and instance_search.go collapsed to 1 line each, test docstring tightened, BUGFIX entry moved up next to the other BUGFIX). Pushed one more changelog polish in acd2aa7 to match repo convention: PR link (/pull/6968) instead of issue link, and (@MukundaKatta) attribution. Ready for re-review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Panic in live-store FindTraceByID: slice backing-array race

4 participants