Skip to content

[backendscheduler]: fix three post-ship redaction bugs#6992

Draft
zalegrala wants to merge 2 commits intografana:mainfrom
zalegrala:backendscheduler_redaction_postship_fixes
Draft

[backendscheduler]: fix three post-ship redaction bugs#6992
zalegrala wants to merge 2 commits intografana:mainfrom
zalegrala:backendscheduler_redaction_postship_fixes

Conversation

@zalegrala
Copy link
Copy Markdown
Contributor

What this PR does:

Fixes three bugs observed in the redaction implementation after initial ship.

Bug 1 — Batch not cleaned up after dead-job timeout

Prune() calls j.Fail() directly on the job struct to avoid re-acquiring the shard lock. This bypassed the UpdateJob code path that normally calls cleanupBatchIfDone, leaving the tenant's RedactionBatch in batchStore forever. A tenant whose redaction worker was killed (e.g. scale-down) would be permanently blocked from new SubmitRedaction calls (AlreadyExists) and all compaction until the scheduler restarted.

Fix: collect timed-out jobs per-shard (zero-contention, one slice per shard goroutine), clean up runningBlocks and workerJobs after the WaitGroup, and add cleanupOrphanedBatches called on every maintenance tick after Prune.

Bug 2 — Outstanding-blocks metric suppressed during redaction

measureTenants() called newBlockSelector(), which returns an empty selector when TenantPending is true. The outstanding-blocks metric dropped to zero for the affected tenant during a redaction, causing the autoscaler to see no work and scale workers down mid-redaction — slowing completion and potentially triggering more dead-job timeouts.

Fix: add newBlockSelectorForMeasurement that skips the TenantPending guard; measureTenants uses it. The guard is preserved in newBlockSelector (used for job creation and tenant prioritization).

Bug 3 — O(N×ShardCount) GetJobForWorker lock contention

GetJobForWorker acquired all 256 shard locks sequentially and scanned every job in each shard to find the one owned by the requesting worker. With hundreds of workers calling Next() concurrently after a large redaction submission, this became a heavily contended hot path.

Fix: add a workerJobs map[string]string index (workerID → jobID) under pendingMtx. GetJobForWorker is now a single pendingMtx lock + O(1) map lookup. Index is populated in AddJob (caller must SetWorkerID before AddJob, matching the existing production path), cleared by CompleteJob, FailJob, and the Prune timeout path, and rebuilt in rebuildPendingIndexes/Unmarshal.

Which issue(s) this PR fixes:

Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

1. Batch not cleaned up after dead-job timeout
Prune() calls j.Fail() directly to avoid re-acquiring the shard lock,
which bypassed the UpdateJob path that normally calls cleanupBatchIfDone.
Result: a timed-out running job left the tenant's RedactionBatch in
batchStore forever, permanently blocking new SubmitRedaction calls and
all compaction for that tenant. Fix: collect timed-out jobs per shard,
clean up runningBlocks + workerJobs after the WaitGroup, and call
cleanupOrphanedBatches on every maintenance tick.

2. Outstanding-blocks metric suppressed during redaction
measureTenants() called newBlockSelector() which returns an empty
selector when TenantPending is true. The outstanding-blocks metric
dropped to zero for the tenant, causing autoscaling to see no work and
scale workers down mid-redaction. Fix: add newBlockSelectorForMeasurement
that bypasses the TenantPending guard; use it in measureTenants only.

3. O(N) GetJobForWorker lock contention after redaction
GetJobForWorker acquired and scanned all 256 shard locks sequentially.
With hundreds of workers polling concurrently this became a hot path.
Fix: add a workerJobs map[string]string index (workerID->jobID) under
pendingMtx; GetJobForWorker is now a single lock + O(1) map lookup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 16, 2026 20:47
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes three post-ship issues in the backend-scheduler redaction flow by ensuring timed-out jobs clean up associated in-memory indexes and batch manifests, preventing metrics suppression during active redaction, and eliminating shard-wide lock scans when re-associating jobs with workers.

Changes:

  • Add an O(1) workerID -> jobID index to avoid scanning all shards in GetJobForWorker, and ensure it’s maintained on add/complete/fail/prune.
  • Ensure the dead-job timeout path in Prune() cleans runningBlocks and workerJobs, and add a maintenance sweep to remove orphaned redaction batches.
  • Add a measurement-only block selector for outstanding-blocks metrics that ignores the TenantPending guard used to gate compaction job creation.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
modules/backendscheduler/work/work.go Adds workerJobs index, updates Prune() to collect timed-out jobs for post-pass index cleanup, and changes GetJobForWorker() to O(1) lookup.
modules/backendscheduler/work/work_sharded_test.go Updates worker-job tests to match new indexing behavior and adds coverage for prune timeout cleanup of indexes.
modules/backendscheduler/provider/compaction.go Uses a new measurement-only selector in measureTenants() to avoid suppressing outstanding-blocks metrics during redaction.
modules/backendscheduler/provider/compaction_test.go Adds a test to verify measurement ignores TenantPending while compaction gating remains in effect.
modules/backendscheduler/backendscheduler.go Adds cleanupOrphanedBatches() to the maintenance tick to ensure batches are removed even when Prune() fails jobs directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants