[backendscheduler]: fix three post-ship redaction bugs#6992
Draft
zalegrala wants to merge 2 commits intografana:mainfrom
Draft
[backendscheduler]: fix three post-ship redaction bugs#6992zalegrala wants to merge 2 commits intografana:mainfrom
zalegrala wants to merge 2 commits intografana:mainfrom
Conversation
1. Batch not cleaned up after dead-job timeout Prune() calls j.Fail() directly to avoid re-acquiring the shard lock, which bypassed the UpdateJob path that normally calls cleanupBatchIfDone. Result: a timed-out running job left the tenant's RedactionBatch in batchStore forever, permanently blocking new SubmitRedaction calls and all compaction for that tenant. Fix: collect timed-out jobs per shard, clean up runningBlocks + workerJobs after the WaitGroup, and call cleanupOrphanedBatches on every maintenance tick. 2. Outstanding-blocks metric suppressed during redaction measureTenants() called newBlockSelector() which returns an empty selector when TenantPending is true. The outstanding-blocks metric dropped to zero for the tenant, causing autoscaling to see no work and scale workers down mid-redaction. Fix: add newBlockSelectorForMeasurement that bypasses the TenantPending guard; use it in measureTenants only. 3. O(N) GetJobForWorker lock contention after redaction GetJobForWorker acquired and scanned all 256 shard locks sequentially. With hundreds of workers polling concurrently this became a hot path. Fix: add a workerJobs map[string]string index (workerID->jobID) under pendingMtx; GetJobForWorker is now a single lock + O(1) map lookup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes three post-ship issues in the backend-scheduler redaction flow by ensuring timed-out jobs clean up associated in-memory indexes and batch manifests, preventing metrics suppression during active redaction, and eliminating shard-wide lock scans when re-associating jobs with workers.
Changes:
- Add an O(1)
workerID -> jobIDindex to avoid scanning all shards inGetJobForWorker, and ensure it’s maintained on add/complete/fail/prune. - Ensure the dead-job timeout path in
Prune()cleansrunningBlocksandworkerJobs, and add a maintenance sweep to remove orphaned redaction batches. - Add a measurement-only block selector for outstanding-blocks metrics that ignores the
TenantPendingguard used to gate compaction job creation.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| modules/backendscheduler/work/work.go | Adds workerJobs index, updates Prune() to collect timed-out jobs for post-pass index cleanup, and changes GetJobForWorker() to O(1) lookup. |
| modules/backendscheduler/work/work_sharded_test.go | Updates worker-job tests to match new indexing behavior and adds coverage for prune timeout cleanup of indexes. |
| modules/backendscheduler/provider/compaction.go | Uses a new measurement-only selector in measureTenants() to avoid suppressing outstanding-blocks metrics during redaction. |
| modules/backendscheduler/provider/compaction_test.go | Adds a test to verify measurement ignores TenantPending while compaction gating remains in effect. |
| modules/backendscheduler/backendscheduler.go | Adds cleanupOrphanedBatches() to the maintenance tick to ensure batches are removed even when Prune() fails jobs directly. |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does:
Fixes three bugs observed in the redaction implementation after initial ship.
Bug 1 — Batch not cleaned up after dead-job timeout
Prune()callsj.Fail()directly on the job struct to avoid re-acquiring the shard lock. This bypassed theUpdateJobcode path that normally callscleanupBatchIfDone, leaving the tenant'sRedactionBatchinbatchStoreforever. A tenant whose redaction worker was killed (e.g. scale-down) would be permanently blocked from newSubmitRedactioncalls (AlreadyExists) and all compaction until the scheduler restarted.Fix: collect timed-out jobs per-shard (zero-contention, one slice per shard goroutine), clean up
runningBlocksandworkerJobsafter theWaitGroup, and addcleanupOrphanedBatchescalled on every maintenance tick afterPrune.Bug 2 — Outstanding-blocks metric suppressed during redaction
measureTenants()callednewBlockSelector(), which returns an empty selector whenTenantPendingis true. The outstanding-blocks metric dropped to zero for the affected tenant during a redaction, causing the autoscaler to see no work and scale workers down mid-redaction — slowing completion and potentially triggering more dead-job timeouts.Fix: add
newBlockSelectorForMeasurementthat skips theTenantPendingguard;measureTenantsuses it. The guard is preserved innewBlockSelector(used for job creation and tenant prioritization).Bug 3 — O(N×ShardCount)
GetJobForWorkerlock contentionGetJobForWorkeracquired all 256 shard locks sequentially and scanned every job in each shard to find the one owned by the requesting worker. With hundreds of workers callingNext()concurrently after a large redaction submission, this became a heavily contended hot path.Fix: add a
workerJobs map[string]stringindex (workerID → jobID) underpendingMtx.GetJobForWorkeris now a singlependingMtxlock + O(1) map lookup. Index is populated inAddJob(caller mustSetWorkerIDbeforeAddJob, matching the existing production path), cleared byCompleteJob,FailJob, and the Prune timeout path, and rebuilt inrebuildPendingIndexes/Unmarshal.Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]