[backendscheduler]: fix three post-ship redaction bugs by zalegrala · Pull Request #6992 · grafana/tempo

zalegrala · 2026-04-16T20:47:09Z

What this PR does:

Fixes three bugs observed in the redaction implementation after initial ship.

Bug 1 — Batch not cleaned up after dead-job timeout

Prune() calls j.Fail() directly on the job struct to avoid re-acquiring the shard lock. This bypassed the UpdateJob code path that normally calls cleanupBatchIfDone, leaving the tenant's RedactionBatch in batchStore forever. A tenant whose redaction worker was killed (e.g. scale-down) would be permanently blocked from new SubmitRedaction calls (AlreadyExists) and all compaction until the scheduler restarted.

Fix: collect timed-out jobs per-shard (zero-contention, one slice per shard goroutine), clean up runningBlocks and workerJobs after the WaitGroup, and add cleanupOrphanedBatches called on every maintenance tick after Prune.

Bug 2 — Outstanding-blocks metric suppressed during redaction

measureTenants() called newBlockSelector(), which returns an empty selector when TenantPending is true. The outstanding-blocks metric dropped to zero for the affected tenant during a redaction, causing the autoscaler to see no work and scale workers down mid-redaction — slowing completion and potentially triggering more dead-job timeouts.

Fix: add newBlockSelectorForMeasurement that skips the TenantPending guard; measureTenants uses it. The guard is preserved in newBlockSelector (used for job creation and tenant prioritization).

Bug 3 — O(N×ShardCount) GetJobForWorker lock contention

GetJobForWorker acquired all 256 shard locks sequentially and scanned every job in each shard to find the one owned by the requesting worker. With hundreds of workers calling Next() concurrently after a large redaction submission, this became a heavily contended hot path.

Fix: add a workerJobs map[string]string index (workerID → jobID) under pendingMtx. GetJobForWorker is now a single pendingMtx lock + O(1) map lookup. Index is populated in AddJob (caller must SetWorkerID before AddJob, matching the existing production path), cleared by CompleteJob, FailJob, and the Prune timeout path, and rebuilt in rebuildPendingIndexes/Unmarshal.

Which issue(s) this PR fixes:

Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

1. Batch not cleaned up after dead-job timeout Prune() calls j.Fail() directly to avoid re-acquiring the shard lock, which bypassed the UpdateJob path that normally calls cleanupBatchIfDone. Result: a timed-out running job left the tenant's RedactionBatch in batchStore forever, permanently blocking new SubmitRedaction calls and all compaction for that tenant. Fix: collect timed-out jobs per shard, clean up runningBlocks + workerJobs after the WaitGroup, and call cleanupOrphanedBatches on every maintenance tick. 2. Outstanding-blocks metric suppressed during redaction measureTenants() called newBlockSelector() which returns an empty selector when TenantPending is true. The outstanding-blocks metric dropped to zero for the tenant, causing autoscaling to see no work and scale workers down mid-redaction. Fix: add newBlockSelectorForMeasurement that bypasses the TenantPending guard; use it in measureTenants only. 3. O(N) GetJobForWorker lock contention after redaction GetJobForWorker acquired and scanned all 256 shard locks sequentially. With hundreds of workers polling concurrently this became a hot path. Fix: add a workerJobs map[string]string index (workerID->jobID) under pendingMtx; GetJobForWorker is now a single lock + O(1) map lookup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Fixes three post-ship issues in the backend-scheduler redaction flow by ensuring timed-out jobs clean up associated in-memory indexes and batch manifests, preventing metrics suppression during active redaction, and eliminating shard-wide lock scans when re-associating jobs with workers.

Changes:

Add an O(1) workerID -> jobID index to avoid scanning all shards in GetJobForWorker, and ensure it’s maintained on add/complete/fail/prune.
Ensure the dead-job timeout path in Prune() cleans runningBlocks and workerJobs, and add a maintenance sweep to remove orphaned redaction batches.
Add a measurement-only block selector for outstanding-blocks metrics that ignores the TenantPending guard used to gate compaction job creation.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
modules/backendscheduler/work/work.go	Adds `workerJobs` index, updates `Prune()` to collect timed-out jobs for post-pass index cleanup, and changes `GetJobForWorker()` to O(1) lookup.
modules/backendscheduler/work/work_sharded_test.go	Updates worker-job tests to match new indexing behavior and adds coverage for prune timeout cleanup of indexes.
modules/backendscheduler/provider/compaction.go	Uses a new measurement-only selector in `measureTenants()` to avoid suppressing outstanding-blocks metrics during redaction.
modules/backendscheduler/provider/compaction_test.go	Adds a test to verify measurement ignores `TenantPending` while compaction gating remains in effect.
modules/backendscheduler/backendscheduler.go	Adds `cleanupOrphanedBatches()` to the maintenance tick to ensure batches are removed even when `Prune()` fails jobs directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 16, 2026 20:47

Copilot started reviewing on behalf of zalegrala April 16, 2026 20:47 View session

Copilot AI reviewed Apr 16, 2026

View reviewed changes

[backendscheduler]: update CHANGELOG for grafana#6992

bdef3b3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[backendscheduler]: fix three post-ship redaction bugs#6992

[backendscheduler]: fix three post-ship redaction bugs#6992
zalegrala wants to merge 2 commits intografana:mainfrom
zalegrala:backendscheduler_redaction_postship_fixes

zalegrala commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zalegrala commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants