-
Notifications
You must be signed in to change notification settings - Fork 1.3k
proportion plugin double-counts resources for jobs with tasks stuck in Binding, starves queue #5099
Description
Description
So I've been debugging a scheduling stall that's been driving me crazy and finally nailed it down to the proportion plugin.
What's happening
When job-a gets allocated and its tasks move to Binding (pods are being created, not yet Running), the plugin counts those resources in both attr.allocated AND attr.inqueue. So a queue that's actually at 75% looks full at 150% to the scheduler, and any new job trying to get in gets rejected.
The window is usually short; 20ms to a few seconds, but under any real load it's long enough to matter. And if you have a steady stream of jobs binding, it basically never clears.
Why it happens
AllocatedStatus includes Binding. ScheduledStatus doesn't. The PodGroup phase check uses ScheduledStatus, so a job with all tasks in Binding still reads as PodGroupInqueue, meaning it gets counted in attr.inqueue on top of attr.allocated.
Binding → in AllocatedStatus ✅ but NOT in ScheduledStatus
→ tasks counted in attr.allocated ✅
→ PodGroup phase stays Inqueue → also counted in attr.inqueue ❌
Steps to reproduce the issue
To reproduce
4-CPU queue. Submit job-a with 3×1CPU tasks. While its pods are still in ContainerCreating, submit job-b with 1×1CPU. It'll sit Pending even though there's a free CPU. Clears up the moment job-a's pods hit Running.
Scheduler logs during that window will show something like:
queue overused, used=6CPU > capacity=4CPU
...on a queue that physically has 4 CPUs in use at most.
Describe the results you received and expected
the bug: job-b times out waiting to be scheduled. It has capacity (queue: 4 CPU, job-a uses 3 CPU, job-b needs 1 CPU) but the scheduler incorrectly reports the queue as over-capacity (6 CPU used = 3 allocated + 3 double-counted inqueue), so job-b is permanently stuck in Pending.
after the fix: job-b is scheduled correctly because the already-allocated 3 CPU is deducted from the inqueue calculation, giving 0 inqueue + 3 allocated = 3 CPU ≤ 4 CPU capacity.
What version of Volcano are you using?
I reproduced this on v1.14.1 (the latest release), against the master branch at commit 62ebc87. The bug exists in the current master as well.
Any other relevant information
Unit test
The race window is small on a real cluster so I wrote a test that reproduces it deterministically:
go test ./pkg/scheduler/plugins/proportion/... \
-run TestNoDoubleCountingForInqueueJobWithBindingTasks -v -count=1 -timeout 30sCycle 1 allocates job-a (3×1CPU) so its tasks enter Binding while the PodGroup stays Inqueue. Cycle 2 tries to enqueue job-b (1×1CPU) on the same 4-CPU queue. With the bug, attr.allocated=3 + attr.inqueue=3 = 6 > 4 and job-b gets blocked. With the fix, attr.inqueue=0 and job-b binds fine.
The fix
The PodGroupRunning branch already handles this correctly using util.GetInqueueResource which does max(0, minResources - already_allocated). The PodGroupInqueue branch just needs the same treatment:
// currently
attr.inqueue.Add(job.DeductSchGatedResources(job.GetMinResources()))
// should be
inqueued := util.GetInqueueResource(job, job.Allocated)
attr.inqueue.Add(job.DeductSchGatedResources(inqueued))Opening a PR for the fix!😁

