Skip to content

bugfix(livestore): skip lookback replay when partition is Inactive#7101

Merged
zhxiaogg merged 3 commits intografana:mainfrom
zhxiaogg:livestore-skip-lookback-inactive-partition
May 5, 2026
Merged

bugfix(livestore): skip lookback replay when partition is Inactive#7101
zhxiaogg merged 3 commits intografana:mainfrom
zhxiaogg:livestore-skip-lookback-inactive-partition

Conversation

@zhxiaogg
Copy link
Copy Markdown
Contributor

@zhxiaogg zhxiaogg commented May 4, 2026

What this PR does:

  • If a Livestore partition is marked as Inactive, due to /prepare-partition-downscale, then the Livestore instance should not replay from the beginning of the partition for any intermediate restarts.

Why:
This is to avoid partition lag due to concurrent scaling down and rolling restarts:

  1. when downscaling starts, the /prepare-partition-downscale would set target partitions to be inactive
  2. the corresponding livestore pod is turning into readonly until all the WAL blocks are expired
  3. if there happen to be any rolling restarts and all WAL blocks are expired, the livestore might replay from the beginning of the partition. This is unexpected.
  4. subsequent downscaling procedures still could succeed, but there could be partition lag due to the unexpected lookback replaying. This PR is to avoid the unexpected replaying for any intermediate rolling restarts.

Which issue(s) this PR fixes:
Fixes N/A

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts live-store Kafka startup behavior so that when a live-store’s ingest partition is already marked Inactive (typical during downscaling/draining), the process does not force a lookback replay solely because local WAL-derived state is empty after restart. This prevents expensive/unnecessary replay during intermediate restarts while a scale-down is in progress.

Changes:

  • Refactors the “force lookback replay when no local instances exist” decision into shouldForceFromLookback(ctx).
  • Skips forced lookback replay when the partition ring reports the partition state as PartitionInactive.
  • Adds unit tests covering the three key branches (instances exist, no instances + inactive, no instances + non-inactive) and updates the changelog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
modules/livestore/live_store.go Introduces shouldForceFromLookback() and uses partition state to skip lookback replay when Inactive.
modules/livestore/live_store_test.go Adds tests validating the new decision logic for lookback replay.
CHANGELOG.md Adds a [BUGFIX] entry documenting the behavior change.

Copy link
Copy Markdown
Contributor

@mattdurham mattdurham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@zhxiaogg zhxiaogg merged commit 82242b9 into grafana:main May 5, 2026
31 checks passed
@zhxiaogg zhxiaogg deleted the livestore-skip-lookback-inactive-partition branch May 5, 2026 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants