Skip to content

proportion plugin double-counts resources for jobs with tasks stuck in Binding, starves queue #5099

@Aman-Cool

Description

@Aman-Cool

Description

So I've been debugging a scheduling stall that's been driving me crazy and finally nailed it down to the proportion plugin.

What's happening

When job-a gets allocated and its tasks move to Binding (pods are being created, not yet Running), the plugin counts those resources in both attr.allocated AND attr.inqueue. So a queue that's actually at 75% looks full at 150% to the scheduler, and any new job trying to get in gets rejected.

The window is usually short; 20ms to a few seconds, but under any real load it's long enough to matter. And if you have a steady stream of jobs binding, it basically never clears.

Why it happens

AllocatedStatus includes Binding. ScheduledStatus doesn't. The PodGroup phase check uses ScheduledStatus, so a job with all tasks in Binding still reads as PodGroupInqueue, meaning it gets counted in attr.inqueue on top of attr.allocated.

Binding → in AllocatedStatus ✅  but NOT in ScheduledStatus
→ tasks counted in attr.allocated ✅
→ PodGroup phase stays Inqueue → also counted in attr.inqueue ❌

Steps to reproduce the issue

To reproduce

4-CPU queue. Submit job-a with 3×1CPU tasks. While its pods are still in ContainerCreating, submit job-b with 1×1CPU. It'll sit Pending even though there's a free CPU. Clears up the moment job-a's pods hit Running.

Scheduler logs during that window will show something like:

queue overused, used=6CPU > capacity=4CPU

...on a queue that physically has 4 CPUs in use at most.

Describe the results you received and expected

Without fix:
Image

the bug: job-b times out waiting to be scheduled. It has capacity (queue: 4 CPU, job-a uses 3 CPU, job-b needs 1 CPU) but the scheduler incorrectly reports the queue as over-capacity (6 CPU used = 3 allocated + 3 double-counted inqueue), so job-b is permanently stuck in Pending.

With fix:
Image

after the fix: job-b is scheduled correctly because the already-allocated 3 CPU is deducted from the inqueue calculation, giving 0 inqueue + 3 allocated = 3 CPU ≤ 4 CPU capacity.

What version of Volcano are you using?

I reproduced this on v1.14.1 (the latest release), against the master branch at commit 62ebc87. The bug exists in the current master as well.

Any other relevant information

Unit test

The race window is small on a real cluster so I wrote a test that reproduces it deterministically:

go test ./pkg/scheduler/plugins/proportion/... \
  -run TestNoDoubleCountingForInqueueJobWithBindingTasks -v -count=1 -timeout 30s

Cycle 1 allocates job-a (3×1CPU) so its tasks enter Binding while the PodGroup stays Inqueue. Cycle 2 tries to enqueue job-b (1×1CPU) on the same 4-CPU queue. With the bug, attr.allocated=3 + attr.inqueue=3 = 6 > 4 and job-b gets blocked. With the fix, attr.inqueue=0 and job-b binds fine.

The fix

The PodGroupRunning branch already handles this correctly using util.GetInqueueResource which does max(0, minResources - already_allocated). The PodGroupInqueue branch just needs the same treatment:

// currently
attr.inqueue.Add(job.DeductSchGatedResources(job.GetMinResources()))

// should be
inqueued := util.GetInqueueResource(job, job.Allocated)
attr.inqueue.Add(job.DeductSchGatedResources(inqueued))

Opening a PR for the fix!😁

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions