proportion plugin double-counts resources for jobs with tasks stuck in Binding, starves queue

### Description

So I've been debugging a scheduling stall that's been driving me crazy and finally nailed it down to the proportion plugin.

**What's happening**

When job-a gets allocated and its tasks move to `Binding` (pods are being created, not yet `Running`), the plugin counts those resources in *both* `attr.allocated` AND `attr.inqueue`. So a queue that's actually at 75% looks full at 150% to the scheduler, and any new job trying to get in gets rejected.

The window is usually short; 20ms to a few seconds, but under any real load it's long enough to matter. And if you have a steady stream of jobs binding, it basically never clears.

**Why it happens**

`AllocatedStatus` includes `Binding`. `ScheduledStatus` doesn't. The PodGroup phase check uses `ScheduledStatus`, so a job with all tasks in `Binding` still reads as `PodGroupInqueue`, meaning it gets counted in `attr.inqueue` on top of `attr.allocated`.

```
Binding → in AllocatedStatus ✅  but NOT in ScheduledStatus
→ tasks counted in attr.allocated ✅
→ PodGroup phase stays Inqueue → also counted in attr.inqueue ❌
```

### Steps to reproduce the issue

**To reproduce**

4-CPU queue. Submit job-a with 3×1CPU tasks. While its pods are still in `ContainerCreating`, submit job-b with 1×1CPU. It'll sit Pending even though there's a free CPU. Clears up the moment job-a's pods hit Running.

Scheduler logs during that window will show something like:
```
queue overused, used=6CPU > capacity=4CPU
```
...on a queue that physically has 4 CPUs in use at most.

### Describe the results you received and expected

Without fix:
<img width="1180" height="117" alt="Image" src="https://github.com/user-attachments/assets/5c010ed5-5f71-43eb-a418-cba954a2730d" />

the bug: `job-b` times out waiting to be scheduled. It has capacity (queue: 4 CPU, job-a uses 3 CPU, job-b needs 1 CPU) but the scheduler incorrectly reports the queue as over-capacity (6 CPU used = 3 allocated + 3 double-counted inqueue), so `job-b` is permanently stuck in Pending.

With fix:
<img width="1193" height="92" alt="Image" src="https://github.com/user-attachments/assets/44c0c7d0-6c4c-4959-89d5-bab1c3ff1d29" />

after the fix: `job-b` is scheduled correctly because the already-allocated 3 CPU is deducted from the inqueue calculation, giving 0 inqueue + 3 allocated = 3 CPU ≤ 4 CPU capacity.

### What version of Volcano are you using?

I reproduced this on v1.14.1 (the latest release), against the master branch at commit 62ebc877b. The bug exists in the current master as well.

### Any other relevant information

**Unit test**

The race window is small on a real cluster so I wrote a test that reproduces it deterministically:

```bash
go test ./pkg/scheduler/plugins/proportion/... \
  -run TestNoDoubleCountingForInqueueJobWithBindingTasks -v -count=1 -timeout 30s
```

Cycle 1 allocates job-a (3×1CPU) so its tasks enter `Binding` while the PodGroup stays `Inqueue`. Cycle 2 tries to enqueue job-b (1×1CPU) on the same 4-CPU queue. With the bug, `attr.allocated=3 + attr.inqueue=3 = 6 > 4` and job-b gets blocked. With the fix, `attr.inqueue=0` and job-b binds fine.

**The fix**

The `PodGroupRunning` branch already handles this correctly using `util.GetInqueueResource` which does `max(0, minResources - already_allocated)`. The `PodGroupInqueue` branch just needs the same treatment:

```go
// currently
attr.inqueue.Add(job.DeductSchGatedResources(job.GetMinResources()))

// should be
inqueued := util.GetInqueueResource(job, job.Allocated)
attr.inqueue.Add(job.DeductSchGatedResources(inqueued))
```

Opening a PR for the fix!😁

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proportion plugin double-counts resources for jobs with tasks stuck in Binding, starves queue #5099

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of Volcano are you using?

Any other relevant information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proportion plugin double-counts resources for jobs with tasks stuck in Binding, starves queue #5099

Description

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of Volcano are you using?

Any other relevant information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions