Skip to content

Atomic Decoupling and Physical Isolation for AI Scheduling Primitives#5136

Open
wangyang0616 wants to merge 1 commit intovolcano-sh:masterfrom
wangyang0616:feature_controller_independent_proposal
Open

Atomic Decoupling and Physical Isolation for AI Scheduling Primitives#5136
wangyang0616 wants to merge 1 commit intovolcano-sh:masterfrom
wangyang0616:feature_controller_independent_proposal

Conversation

@wangyang0616
Copy link
Copy Markdown
Member

@wangyang0616 wangyang0616 commented Mar 30, 2026

What type of PR is this?

Feature proposal

What this PR does / why we need it:

This PR implements the physical decoupling of HyperNode and PodGroup controllers as proposed in #5133. It transitions these core AI scheduling primitives into a Staging Mode, enabling independent evolution and atomic delivery without fragmenting the unified API library.

Key Changes:

  • Physical Relocation: Moved HyperNode and PodGroup logic from pkg/controllers/ to staging/src/volcano.sh/.
  • Independent Modules: Introduced standalone go.mod for each primitive in staging/, supporting independent compilation and lightweight image builds.
  • Dependency Cleanup: Decoupled primitives from vc-job and other business logic to ensure a clean, hardware-vendor-friendly contribution path.
  • Dual-Delivery Support:
    • Monolithic: vc-controller-manager continues to import and manage all controllers (Backward Compatible).
    • Atomic: Enabled building standalone vc-hypernode and vc-podgroup binaries/images.

Which issue(s) this PR fixes:

Refs #5133

Special notes for your reviewer:

  • Backward Compatibility: Confirmed vc-controller-manager --controllers=* functions identically to the previous version.
  • Build Validation: Verified that both the monolithic controller-manager and independent staging modules pass compilation.
  • Unit Tests: All migrated tests for PodGroup and HyperNode passed within the new staging structure.

… AI Scheduling Primitives

Signed-off-by: wangyang0616 <wangyang8126@gmail.com>
Copilot AI review requested due to automatic review settings March 30, 2026 07:20
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from wangyang0616. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 30, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for the atomic decoupling and physical isolation of HyperNode and PodGroup controllers in Volcano to support independent compilation and atomic delivery. The review feedback highlights a significant inconsistency between the design document's claims and the actual file changes, noting that the described code movements to the staging directory are missing from the PR. Additionally, it is suggested to define the acronym 'TTFT' (Time To First Token) to improve the clarity of the documentation.

Comment on lines +9 to +10
## 1. Summary
This proposal suggests a structural refactoring of the Volcano controller architecture to achieve physical decoupling of **HyperNode** and **PodGroup**. By leveraging a **Staging Mode** and independent Go modules—while maintaining a **Unified Volcano API Library**—we aim to transform core scheduling units into atomic primitives. This evolution empowers hardware ecosystem co-construction and supports complex AI scenarios such as **PD (Prefill & Decoding) separation** and **multi-dimensional Gang scheduling**.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are inconsistencies between the PR description, this design document, and the apparent file changes.

  • The PR description and design doc state that HyperNode and PodGroup controllers are moved to the staging directory, but these changes are not present in the PR.
  • The volcano.sh/apis package seems to have been moved to staging (based on the go.mod file), but this is not mentioned in the PR description, and the design doc states the API library will not be split.

To avoid confusion, please ensure the PR's content, description, and design document are all aligned. If this PR is only for the design document, the description should be updated to reflect that.

* **Multi-dimensional Gang Scheduling**: In **PD Separation** scenarios, AI workloads consist of multiple logical groups. PodGroup must evolve to provide **multi-layered resource coordination**.
* **Primitive-level Joint Scheduling**: Enabling dual-axis concurrency of "Topology Awareness + Gang Coordination."
* **Training**: Improves resource alignment efficiency in large-scale clusters.
* **Inference**: Ensures shards are placed on optimal paths (NVLink/RDMA) to reduce **TTFT** and increase throughput.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The acronym TTFT is used here but not defined in the document. For improved clarity, please consider defining it on its first use. For example: '...to reduce Time To First Token (TTFT) and increase throughput.'

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design proposal document describing how to physically decouple the HyperNode and PodGroup controllers into independent staging modules while preserving a unified Volcano API contract.

Changes:

  • Introduces a new design doc outlining staging-based physical isolation and “atomic delivery” goals for HyperNode/PodGroup.
  • Documents a proposed directory layout and dual-delivery (monolithic + standalone) build approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +44 to +58
```text
volcano/
├── apis/ # Unified API Library (Contains all CRDs)
├── pkg/
│ └── controllers/ # Monolithic business logic (Job, Queue, etc.)
├── staging/src/volcano.sh/
│ ├── hypernode/ # HyperNode Independent Module
│ │ ├── go.mod # Independent Module Definition
│ │ ├── cmd/ # Entry point for vc-hypernode
│ │ └── pkg/controller/ # Core Topology Primitive Logic
│ └── podgroup/ # PodGroup Independent Module
│ ├── go.mod # Independent Module Definition
│ ├── cmd/ # Entry point for vc-podgroup
│ └── pkg/controller/ # Core Gang Scheduling Logic
├── go.mod # Main Repo Entry (replace directives for staging)
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repository layout shown here lists a top-level apis/ directory as the unified API library, but in this repo the canonical APIs module is maintained under staging/src/volcano.sh/apis/ (per existing docs like docs/design/adapt-k8s-todo.md). Updating this tree to reflect the current staging-based API layout would prevent readers from looking for (or creating) a non-existent/incorrect top-level apis/ directory.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +6
# Volcano AI Scheduling Primitives Atomic Decoupling and Physical Isolation

**Authors:** wangyang0616

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filename includes an & character (hypernode&podgroup-controller-independent.md). This commonly causes quoting/URL-encoding issues in shells, static-site generators, and links. Consider renaming the file to use a hyphen (e.g., hypernode-podgroup-controller-independent.md) and updating any references accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +18
## 1. Summary
This proposal suggests a structural refactoring of the Volcano controller architecture to achieve physical decoupling of **HyperNode** and **PodGroup**. By leveraging a **Staging Mode** and independent Go modules—while maintaining a **Unified Volcano API Library**—we aim to transform core scheduling units into atomic primitives. This evolution empowers hardware ecosystem co-construction and supports complex AI scenarios such as **PD (Prefill & Decoding) separation** and **multi-dimensional Gang scheduling**.

## 2. Goals & Non-Goals

### 2.1 Goals
* **Physical Isolation**: Move HyperNode and PodGroup logic into `staging/` with independent `go.mod` files to support standalone compilation.
* **Atomic Delivery**: Provide the capability to build and deploy lightweight, independent images (`vc-hypernode`, `vc-podgroup`) for inference and edge scenarios.
* **Ecosystem Empowerment**: Simplify the contribution path for hardware vendors to implement topology discovery within a decoupled HyperNode module.
* **Maintain Monolithic Compatibility**: Ensure the existing `vc-controller-manager` remains the primary delivery vehicle with zero changes to its default behavior.
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says the physical relocation/standalone modules are already implemented and validated (build + unit tests), but the changes in this PR appear to only add a design doc (no staging hypernode/podgroup modules or controller-manager import changes are present in the repo). Either adjust the PR description to reflect that this is a proposal-only change, or include the actual implementation changes in the PR so the description matches the diff.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +4


# Volcano AI Scheduling Primitives Atomic Decoupling and Physical Isolation

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new design doc starts with leading blank lines (and has trailing whitespace after the title line). Most other files under docs/design/ start directly with the top-level # header; trimming the leading blank lines/extra whitespace will keep formatting consistent and avoid markdown-lint noise.

Suggested change
# Volcano AI Scheduling Primitives Atomic Decoupling and Physical Isolation
# Volcano AI Scheduling Primitives Atomic Decoupling and Physical Isolation

Copilot uses AI. Check for mistakes.
Comment on lines +61 to +63
### 4.3 Engineering Implementation
* **Independent Modules**: Create `go.mod` files within staging. Use `replace` directives to point to `../../../../apis` locally.
* **Dependency Cleanup**: Remove strong type references from `HyperNode/PodGroup` to business packages like `pkg/controllers/job`.
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document’s proposed module wiring says staging modules should replace the APIs module via a relative path like ../../../../apis, but in this repo the APIs module lives at staging/src/volcano.sh/apis (and main go.mod already uses replace volcano.sh/apis => ./staging/src/volcano.sh/apis). For new staging modules under staging/src/volcano.sh/, the correct relative path to the APIs module would be adjacent (e.g., ../apis) or you should explicitly reference the existing staging location to avoid confusing/incorrect instructions.

Copilot uses AI. Check for mistakes.
@wangyang0616
Copy link
Copy Markdown
Member Author

@hzxuzhonghu @JesseStutler @hajnalmt Please help review this as well. Thanks.

@hzxuzhonghu
Copy link
Copy Markdown
Member

One scope/traceability concern: this PR currently says Fixes #5133, but the diff is proposal-only (design doc) and #5133 is functioning as the umbrella tracking issue for multiple follow-up items. If this merges as-is, it would incorrectly close the umbrella issue before the implementation work lands.

I suggest changing the linkage to something like Part of #5133 or Refs #5133 instead of Fixes #5133.

@wangyang0616
Copy link
Copy Markdown
Member Author

One scope/traceability concern: this PR currently says Fixes #5133, but the diff is proposal-only (design doc) and #5133 is functioning as the umbrella tracking issue for multiple follow-up items. If this merges as-is, it would incorrectly close the umbrella issue before the implementation work lands.

I suggest changing the linkage to something like Part of #5133 or Refs #5133 instead of Fixes #5133.

Yes, it has been updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants