enhancement(livestore): skipped wal complete op during shutdown#6839
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves shutdown/cancellation behavior in LiveStore and parquet block creation so work is abandoned promptly when the service is shutting down, avoiding unnecessary draining/retries.
Changes:
- Add early
ctx.Err()checks in parquet v4/v5CreateBlockloops to stop iterating when canceled. - Update LiveStore completion processing to abandon WAL completion ops when the service context is canceled.
- Add/extend tests for the new cancellation behavior in vparquet4/vparquet5 and LiveStore.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
tempodb/encoding/vparquet5/create.go |
Exits block creation loop early on canceled context. |
tempodb/encoding/vparquet5/create_test.go |
Adds a test ensuring CreateBlock returns context.Canceled when ctx is pre-canceled. |
tempodb/encoding/vparquet4/create.go |
Same early-exit cancellation handling as vparquet5. |
tempodb/encoding/vparquet4/create_test.go |
Adds a matching canceled-context test for vparquet4. |
modules/livestore/live_store_background.go |
Abandons WAL completion retries when shutdown cancels the service context. |
modules/livestore/live_store_background_test.go |
Adds a test intended to validate “abandon on shutdown” behavior for completion ops. |
| // If the context is cancelled (shutdown), abandon the completion. The WAL block remains on | ||
| // disk and will be re-enqueued by reloadBlocks() on next startup. | ||
| if ctx.Err() != nil { | ||
| level.Info(s.logger).Log("msg", "abandoning WAL block completion on shutdown, will replay on restart", "tenant", op.tenantID, "block", op.blockID) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
In the ctx-cancelled branch, the dequeued op’s key is never cleared from completeQueues (the ExclusiveQueues contract expects Clear or Requeue after dequeue). During shutdown this can leave activeKeys populated until process exit and can block future enqueues if this code path is ever hit outside full shutdown. Consider calling s.completeQueues.Clear(op) before returning, and (optionally) short-circuiting earlier by checking s.ctx.Err() before invoking inst.completeBlock to avoid doing work after shutdown has begun.
There was a problem hiding this comment.
I don't feel that's necessary, when ctx.Err() != nil, the livestore is already on shutdown path.
| // If the context is cancelled (shutdown), abandon the completion. The WAL block remains on | ||
| // disk and will be re-enqueued by reloadBlocks() on next startup. | ||
| if ctx.Err() != nil { | ||
| level.Info(s.logger).Log("msg", "abandoning WAL block completion on shutdown, will replay on restart", "tenant", op.tenantID, "block", op.blockID) |
There was a problem hiding this comment.
Should we also call the span.RecordError I see above?
There was a problem hiding this comment.
Livestore stops processing WALs proactively during shutdown, so probably not recording as error?
|
We discussed abandoning WALs before. The problem with this approach is that completion then - as pointed out in comments in your PR - would happen on startup. This means that overall downtime window does not change, as the completion queue is shared between Kafka processing and abandoned WAL processing, and abandoned WAL sections cause backpressure. So unfortunately this change would not solve the core issue in isolation To reformulate: this change might make sense, but due to the way WAL replay on startup works, it might not produce the effect you expect. You may want to look at startup process as well |
|
This should make rollouts cleaner/faster though. We are generally also under more pressure on shutdown with the grace period. |
I think there are trade-offs where we put the WAL completion: shutdown vs. startup processes. For example, if remain in shutdown process, there is no Warpstream consuming during the WAL completion. While put in the startup process, it's possible to parallelize them. To do that, we can change the WAL backpressure limit from 1 to N. |
f0e1cd6 to
9463acd
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
8103842 to
e3b8f2a
Compare
What this PR does:
As a result, when livestore move to shutdown phase, it will flush WALs, but not complete WALs. The outstanding WALs will be pickup during next livestore startup.
Which issue(s) this PR fixes:
Fixes N/A
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]