enhancement(livestore): skipped wal complete op during shutdown by zhxiaogg · Pull Request #6839 · grafana/tempo

zhxiaogg · 2026-03-31T20:42:58Z

What this PR does:

for parquet v4/v5, stop draining the source iters when context cancelled
for the livestore,
- stop retrying completion ops when context cancelled during shutdown
- relaxed the WAL backpressure limit to 4 as a best effort to parallelize the WAL completion and message consumption

As a result, when livestore move to shutdown phase, it will flush WALs, but not complete WALs. The outstanding WALs will be pickup during next livestore startup.

Which issue(s) this PR fixes:
Fixes N/A

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copilot

Pull request overview

This PR improves shutdown/cancellation behavior in LiveStore and parquet block creation so work is abandoned promptly when the service is shutting down, avoiding unnecessary draining/retries.

Changes:

Add early ctx.Err() checks in parquet v4/v5 CreateBlock loops to stop iterating when canceled.
Update LiveStore completion processing to abandon WAL completion ops when the service context is canceled.
Add/extend tests for the new cancellation behavior in vparquet4/vparquet5 and LiveStore.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`tempodb/encoding/vparquet5/create.go`	Exits block creation loop early on canceled context.
`tempodb/encoding/vparquet5/create_test.go`	Adds a test ensuring `CreateBlock` returns `context.Canceled` when ctx is pre-canceled.
`tempodb/encoding/vparquet4/create.go`	Same early-exit cancellation handling as vparquet5.
`tempodb/encoding/vparquet4/create_test.go`	Adds a matching canceled-context test for vparquet4.
`modules/livestore/live_store_background.go`	Abandons WAL completion retries when shutdown cancels the service context.
`modules/livestore/live_store_background_test.go`	Adds a test intended to validate “abandon on shutdown” behavior for completion ops.

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Copilot · 2026-03-31T21:21:47Z

+	// If the context is cancelled (shutdown), abandon the completion. The WAL block remains on
+	// disk and will be re-enqueued by reloadBlocks() on next startup.
+	if ctx.Err() != nil {
+		level.Info(s.logger).Log("msg", "abandoning WAL block completion on shutdown, will replay on restart", "tenant", op.tenantID, "block", op.blockID)
+		return nil
+	}


In the ctx-cancelled branch, the dequeued op’s key is never cleared from completeQueues (the ExclusiveQueues contract expects Clear or Requeue after dequeue). During shutdown this can leave activeKeys populated until process exit and can block future enqueues if this code path is ever hit outside full shutdown. Consider calling s.completeQueues.Clear(op) before returning, and (optionally) short-circuiting earlier by checking s.ctx.Err() before invoking inst.completeBlock to avoid doing work after shutdown has begun.

I don't feel that's necessary, when ctx.Err() != nil, the livestore is already on shutdown path.

zalegrala · 2026-03-31T21:32:19Z

+	// If the context is cancelled (shutdown), abandon the completion. The WAL block remains on
+	// disk and will be re-enqueued by reloadBlocks() on next startup.
+	if ctx.Err() != nil {
+		level.Info(s.logger).Log("msg", "abandoning WAL block completion on shutdown, will replay on restart", "tenant", op.tenantID, "block", op.blockID)


Should we also call the span.RecordError I see above?

Livestore stops processing WALs proactively during shutdown, so probably not recording as error?

oleg-kozlyuk-grafana · 2026-04-01T07:47:35Z

We discussed abandoning WALs before. The problem with this approach is that completion then - as pointed out in comments in your PR - would happen on startup. This means that overall downtime window does not change, as the completion queue is shared between Kafka processing and abandoned WAL processing, and abandoned WAL sections cause backpressure. So unfortunately this change would not solve the core issue in isolation

To reformulate: this change might make sense, but due to the way WAL replay on startup works, it might not produce the effect you expect. You may want to look at startup process as well

mattdurham · 2026-04-01T18:33:45Z

This should make rollouts cleaner/faster though. We are generally also under more pressure on shutdown with the grace period.

zhxiaogg · 2026-04-02T15:46:51Z

To reformulate: this change might make sense, but due to the way WAL replay on startup works, it might not produce the effect you expect. You may want to look at startup process as well

I think there are trade-offs where we put the WAL completion: shutdown vs. startup processes. For example, if remain in shutdown process, there is no Warpstream consuming during the WAL completion. While put in the startup process, it's possible to parallelize them. To do that, we can change the WAL backpressure limit from 1 to N.

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings March 31, 2026 20:42

Copilot started reviewing on behalf of zhxiaogg March 31, 2026 20:43 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

Comment thread modules/livestore/live_store_background.go Outdated

Comment thread modules/livestore/live_store_background_test.go

Comment thread modules/livestore/live_store_background.go Outdated

Comment thread modules/livestore/live_store_background_test.go

zhxiaogg marked this pull request as ready for review March 31, 2026 21:16

zhxiaogg requested review from electron0zero, ie-pham, mapno, mdisibio, yvrhdn and zalegrala as code owners March 31, 2026 21:16

Copilot AI review requested due to automatic review settings March 31, 2026 21:16

zhxiaogg requested review from carles-grafana, javiermolinar, mattdurham, oleg-kozlyuk-grafana, ruslan-mikhailov and stoewer as code owners March 31, 2026 21:16

Copilot started reviewing on behalf of zhxiaogg March 31, 2026 21:17 View session

Copilot AI reviewed Mar 31, 2026

View reviewed changes

zalegrala reviewed Mar 31, 2026

View reviewed changes

zhxiaogg requested a review from zalegrala March 31, 2026 21:54

Copilot AI review requested due to automatic review settings April 9, 2026 19:08

Copilot started reviewing on behalf of zhxiaogg April 9, 2026 19:08 View session

zhxiaogg added 4 commits April 9, 2026 12:09

enhancement(livestore): skipped wal complete op during shutdown

12adf14

CHANGELOG.md

eae3fdd

address pr comments

f40ad7f

relax livestore WAL backpressure from 1 to 4

d5bc681

minor update to log block size

9463acd

zhxiaogg force-pushed the live-store-skip-complete-op-during-shutdown branch from f0e1cd6 to 9463acd Compare April 9, 2026 19:10

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Comment thread modules/livestore/live_store_background.go

Comment thread modules/livestore/live_store_background_test.go

address pr comment

7910736

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 9, 2026 19:57

Copilot started reviewing on behalf of zhxiaogg April 9, 2026 19:57 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Comment thread modules/livestore/instance_test.go

Comment thread CHANGELOG.md

update changelog

5e15ee3

mdisibio reviewed Apr 9, 2026

View reviewed changes

Comment thread tempodb/encoding/vparquet4/create.go Outdated

Copilot AI review requested due to automatic review settings April 10, 2026 05:18

Copilot started reviewing on behalf of zhxiaogg April 10, 2026 05:18 View session

revert vparquet4/5 ctx check in CreateBlock

e3b8f2a

zhxiaogg force-pushed the live-store-skip-complete-op-during-shutdown branch from 8103842 to e3b8f2a Compare April 10, 2026 05:19

Copilot AI reviewed Apr 10, 2026

View reviewed changes

mdisibio approved these changes Apr 10, 2026

View reviewed changes

mdisibio merged commit dc7b913 into grafana:main Apr 10, 2026
27 checks passed

zhxiaogg deleted the live-store-skip-complete-op-during-shutdown branch April 10, 2026 13:22

Conversation

zhxiaogg commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

zhxiaogg Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zalegrala Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

zhxiaogg Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oleg-kozlyuk-grafana commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdurham commented Apr 1, 2026

Uh oh!

zhxiaogg commented Apr 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhxiaogg commented Mar 31, 2026 •

edited

Loading

oleg-kozlyuk-grafana commented Apr 1, 2026 •

edited

Loading