Skip to content

enhancement(livestore): skipped wal complete op during shutdown#6839

Merged
mdisibio merged 8 commits intografana:mainfrom
zhxiaogg:live-store-skip-complete-op-during-shutdown
Apr 10, 2026
Merged

enhancement(livestore): skipped wal complete op during shutdown#6839
mdisibio merged 8 commits intografana:mainfrom
zhxiaogg:live-store-skip-complete-op-during-shutdown

Conversation

@zhxiaogg
Copy link
Copy Markdown
Contributor

@zhxiaogg zhxiaogg commented Mar 31, 2026

What this PR does:

  • for parquet v4/v5, stop draining the source iters when context cancelled
  • for the livestore,
    • stop retrying completion ops when context cancelled during shutdown
    • relaxed the WAL backpressure limit to 4 as a best effort to parallelize the WAL completion and message consumption

As a result, when livestore move to shutdown phase, it will flush WALs, but not complete WALs. The outstanding WALs will be pickup during next livestore startup.

Which issue(s) this PR fixes:
Fixes N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copilot AI review requested due to automatic review settings March 31, 2026 20:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves shutdown/cancellation behavior in LiveStore and parquet block creation so work is abandoned promptly when the service is shutting down, avoiding unnecessary draining/retries.

Changes:

  • Add early ctx.Err() checks in parquet v4/v5 CreateBlock loops to stop iterating when canceled.
  • Update LiveStore completion processing to abandon WAL completion ops when the service context is canceled.
  • Add/extend tests for the new cancellation behavior in vparquet4/vparquet5 and LiveStore.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tempodb/encoding/vparquet5/create.go Exits block creation loop early on canceled context.
tempodb/encoding/vparquet5/create_test.go Adds a test ensuring CreateBlock returns context.Canceled when ctx is pre-canceled.
tempodb/encoding/vparquet4/create.go Same early-exit cancellation handling as vparquet5.
tempodb/encoding/vparquet4/create_test.go Adds a matching canceled-context test for vparquet4.
modules/livestore/live_store_background.go Abandons WAL completion retries when shutdown cancels the service context.
modules/livestore/live_store_background_test.go Adds a test intended to validate “abandon on shutdown” behavior for completion ops.

Comment thread modules/livestore/live_store_background.go Outdated
Comment thread modules/livestore/live_store_background_test.go
Comment thread modules/livestore/live_store_background.go Outdated
Comment thread modules/livestore/live_store_background_test.go
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Comment on lines +126 to +131
// If the context is cancelled (shutdown), abandon the completion. The WAL block remains on
// disk and will be re-enqueued by reloadBlocks() on next startup.
if ctx.Err() != nil {
level.Info(s.logger).Log("msg", "abandoning WAL block completion on shutdown, will replay on restart", "tenant", op.tenantID, "block", op.blockID)
return nil
}
Copy link

Copilot AI Mar 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the ctx-cancelled branch, the dequeued op’s key is never cleared from completeQueues (the ExclusiveQueues contract expects Clear or Requeue after dequeue). During shutdown this can leave activeKeys populated until process exit and can block future enqueues if this code path is ever hit outside full shutdown. Consider calling s.completeQueues.Clear(op) before returning, and (optionally) short-circuiting earlier by checking s.ctx.Err() before invoking inst.completeBlock to avoid doing work after shutdown has begun.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel that's necessary, when ctx.Err() != nil, the livestore is already on shutdown path.

Comment thread modules/livestore/live_store_background_test.go
// If the context is cancelled (shutdown), abandon the completion. The WAL block remains on
// disk and will be re-enqueued by reloadBlocks() on next startup.
if ctx.Err() != nil {
level.Info(s.logger).Log("msg", "abandoning WAL block completion on shutdown, will replay on restart", "tenant", op.tenantID, "block", op.blockID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also call the span.RecordError I see above?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Livestore stops processing WALs proactively during shutdown, so probably not recording as error?

Comment thread tempodb/encoding/vparquet5/create.go Outdated
@zhxiaogg zhxiaogg requested a review from zalegrala March 31, 2026 21:54
@oleg-kozlyuk-grafana
Copy link
Copy Markdown
Contributor

oleg-kozlyuk-grafana commented Apr 1, 2026

We discussed abandoning WALs before. The problem with this approach is that completion then - as pointed out in comments in your PR - would happen on startup. This means that overall downtime window does not change, as the completion queue is shared between Kafka processing and abandoned WAL processing, and abandoned WAL sections cause backpressure. So unfortunately this change would not solve the core issue in isolation

To reformulate: this change might make sense, but due to the way WAL replay on startup works, it might not produce the effect you expect. You may want to look at startup process as well

@mattdurham
Copy link
Copy Markdown
Contributor

This should make rollouts cleaner/faster though. We are generally also under more pressure on shutdown with the grace period.

@zhxiaogg
Copy link
Copy Markdown
Contributor Author

zhxiaogg commented Apr 2, 2026

To reformulate: this change might make sense, but due to the way WAL replay on startup works, it might not produce the effect you expect. You may want to look at startup process as well

I think there are trade-offs where we put the WAL completion: shutdown vs. startup processes. For example, if remain in shutdown process, there is no Warpstream consuming during the WAL completion. While put in the startup process, it's possible to parallelize them. To do that, we can change the WAL backpressure limit from 1 to N.

Copilot AI review requested due to automatic review settings April 9, 2026 19:08
@zhxiaogg zhxiaogg force-pushed the live-store-skip-complete-op-during-shutdown branch from f0e1cd6 to 9463acd Compare April 9, 2026 19:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comment thread modules/livestore/live_store_background.go
Comment thread modules/livestore/live_store_background_test.go
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 9, 2026 19:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comment thread modules/livestore/instance_test.go
Comment thread CHANGELOG.md
Comment thread tempodb/encoding/vparquet4/create.go Outdated
Copilot AI review requested due to automatic review settings April 10, 2026 05:18
@zhxiaogg zhxiaogg force-pushed the live-store-skip-complete-op-during-shutdown branch from 8103842 to e3b8f2a Compare April 10, 2026 05:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@mdisibio mdisibio merged commit dc7b913 into grafana:main Apr 10, 2026
27 checks passed
@zhxiaogg zhxiaogg deleted the live-store-skip-complete-op-during-shutdown branch April 10, 2026 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants