Skip to content

enhancement(scheduler): honor QueueOrderFn in preempt action#5142

Open
hajnalmt wants to merge 4 commits intovolcano-sh:masterfrom
hajnalmt:fix/preempt-queue-order
Open

enhancement(scheduler): honor QueueOrderFn in preempt action#5142
hajnalmt wants to merge 4 commits intovolcano-sh:masterfrom
hajnalmt:fix/preempt-queue-order

Conversation

@hajnalmt
Copy link
Copy Markdown
Member

@hajnalmt hajnalmt commented Mar 30, 2026

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR makes the preempt action honor QueueOrderFn via util.NewPriorityQueue, aligning preempt queue processing with allocate/reclaim and removing non-deterministic map-order queue traversal.

The PR fixes the underRequest array traversal too to honor queue ordering by switching to a mapped structure with underRequestByQueue. (and not looping trough it at every processed queue, which was a bug I think)

Which issue(s) this PR fixes:

Fixes #5139

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Improve scheduler preemption fairness and determinism by honoring queue order functions when selecting queues for preemption. @Osykov and @hajnalmt 

hajnalmt and others added 2 commits March 30, 2026 12:23
…ption

The "Preemption between Task within Job" loop overwrites
preemptorTasks[job.UID] for every starving job in underRequest, which
is shared across all queues. In multi-queue scenarios this mutates the
same preemptor state used later by the between-jobs preemption phase.

Because queue traversal previously depended on Go map iteration order,
the behavior becomes non-deterministic. If queue Q1 (with no relevant
preemptors) is visited first, its intra-job pass still iterates shared
underRequest entries and can drain/replace Q2's preemptorTasks entry.
When Q2 is visited later, the between-jobs loop sees empty preemptor
state and skips valid preemption, so starvation can persist.

Use a scoped local queue (intraJobPreemptors) for the intra-job pass
instead of reusing preemptorTasks[job.UID]. This preserves the original
preemptorTasks map populated during job discovery for between-jobs
preemption while keeping intra-job behavior isolated.

Add and document a multi-queue regression test that models this cross-
queue interference path and verifies that intra-job processing in one
queue does not invalidate between-jobs preemption in another queue.
The test also captures the prior flaky characteristic (majority pass,
intermittent fail) caused by non-deterministic queue iteration.

Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
Use util.NewPriorityQueue(ssn.QueueOrderFn) in preempt so queue processing
order is deterministic and aligned with allocate/reclaim behavior.

While preserving Osykov's original queue-order implementation intent,
keep the preemptorTasks overwrite fix behavior from the stacked base commit
by using scoped intra-job preemptor queues during intra-job preemption.

Also include the topology-aware multi-queue test coverage from the original
change to validate queue-ordered preemption behavior.

Signed-off-by: Vitalii Osykov <vitaliyosykov@gmail.com>
Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
@volcano-sh-bot volcano-sh-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Mar 30, 2026
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wangyang0616 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 30, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug in the preemption action where preemptorTasks were being overwritten during intra-job preemption, which led to lost preemptors in multi-queue scenarios. It also introduces ordered queue processing using a priority queue and adds comprehensive regression tests. A review comment highlights that the intra-job preemption logic is currently redundant because it resides within the queue iteration loop, suggesting it be moved outside to improve performance.

Build underRequest entries per queue and process only the current queue's
starving jobs in the intra-job preemption loop. This avoids cross-queue
iteration in each queue pass and keeps intra-job processing aligned with
the active queue context.

Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
@hajnalmt hajnalmt marked this pull request as ready for review March 30, 2026 12:12
Copilot AI review requested due to automatic review settings March 30, 2026 12:12
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the scheduler’s preempt action to process queues deterministically using QueueOrderFn (via util.NewPriorityQueue), aligning behavior with other actions and addressing multi-queue preemption correctness.

Changes:

  • Replace Go map iteration over queues in preempt with a PriorityQueue ordered by ssn.QueueOrderFn.
  • Fix multi-queue intra-job preemption so per-queue processing doesn’t overwrite/drain preemptorTasks for jobs in other queues.
  • Extend preempt action tests with multi-queue regression coverage and add a queue-order-focused topology-aware scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/scheduler/actions/preempt/preempt.go Switch queue traversal to a PriorityQueue (honoring QueueOrderFn) and scope intra-job preemptor queues to avoid cross-queue state corruption.
pkg/scheduler/actions/preempt/preempt_test.go Add regression test for multi-queue preemptor task overwrite and add a topology-aware scenario with queue order enabled.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Iterate only relevant starving-job queues when building the preempt queue
priority structure to avoid unnecessary queue scans and extra no-preemptor
iterations. Also fix the topology-aware test case name typo.

Signed-off-by: Hajnal Máté <hajnalmt@gmail.com>
@hajnalmt
Copy link
Copy Markdown
Member Author

/area scheduling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/scheduling kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enhancement(scheduler): Honor QueueOrderFn in preempt action for deterministic queue processing

4 participants