Skip to content

[oep] add proposal for accelerator aware runtime selection #129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

slin1237
Copy link
Collaborator

@slin1237 slin1237 commented Jul 4, 2025

What type of PR is this?

/kind feature
/kind design

What this PR does / why we need it:

This PR propose OEP-0003: Accelerator-Aware Runtime Selection for Heterogeneous GPU Environments.

Problem

Currently, OME's runtime selection mechanism lacks awareness of underlying hardware accelerators. In clusters with mixed GPU types (H100, A100, B200, H200), operators
must create separate runtime configurations for each GPU model, leading to:

  • Runtime proliferation (e.g., sglang-h100, sglang-a100, sglang-b200, sglang-h200)
  • Operational complexity and configuration drift
  • Suboptimal GPU placement and resource utilization
  • Difficulty managing PD-disaggregated deployments with different GPU types

Solution

This PR introduces AcceleratorClass, a vendor-agnostic abstraction layer that:

  • Defines GPU capabilities and requirements declaratively
  • Enables automatic runtime selection based on model needs and GPU availability
  • Supports cost optimization and performance policies
  • Integrates seamlessly with existing Kueue ResourceFlavors
  • Works with OME's engine/decoder/router architecture

Key Components

  1. AcceleratorClass CRD: Cluster-scoped resource defining GPU types and capabilities
  2. Runtime Extensions: AcceleratorRequirements field for GPU compatibility
  3. InferenceService Extensions: AcceleratorSelector for GPU selection preferences
  4. Smart Selection Logic: Matches models to optimal GPUs based on memory, compute, and features
  5. Kueue Integration: Optional synchronization with ResourceFlavors

Example Usage

# Define GPU type
apiVersion: ome.io/v1beta1
kind: AcceleratorClass
metadata:
  name: nvidia-h100
spec:
  deviceSelector:
    deviceResourceName: "nvidia.com/gpu"
    nodeLabelSelector:
      nvidia.com/gpu.product: "NVIDIA-H100-80GB-HBM3"
  capabilities:
    memoryCapacity: 80
    computeCapability: 1979

---
# Deploy with automatic GPU selection
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: llama-70b
spec:
  model:
    name: llama-70b
  acceleratorSelector:
    policy: BestFit  # Automatically selects appropriate GPU

Special notes for your reviewer:

Architecture Decisions

  1. Cluster-scoped AcceleratorClass: GPUs are cluster resources, making cluster scope appropriate
  2. Kueue Optional: Integration is optional to avoid forcing Kueue adoption
  3. Backward Compatible: All new fields are optional, existing InferenceServices work unchanged
  4. Component-level Overrides: Engine and Decoder can have different accelerators for PD-disaggregated deployments
  5. Router Exclusion: Router components remain CPU-only as they don't require GPU resources

Does this PR introduce a user-facing change?

Added AcceleratorClass resource for GPU-aware runtime selection in heterogeneous environments.

  • New AcceleratorClass CRD to define GPU types and capabilities
  • Extended InferenceService with acceleratorSelector for GPU selection
  • Added acceleratorRequirements to ServingRuntime for GPU compatibility
  • Automatic GPU discovery and runtime matching
  • Support for constraint-based and policy-driven GPU selection
  • Optional Kueue integration for ResourceFlavor synchronization
  • Component-level accelerator overrides for PD-disaggregated deployments
  • Comprehensive examples and migration guide

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request proposes a significant enhancement to OME's runtime selection capabilities by introducing a robust, accelerator-aware mechanism. The core idea is to abstract GPU hardware differences through a new AcceleratorClass resource, allowing OME to intelligently match inference workloads with the most suitable GPU types. This will drastically reduce the need for numerous, hardware-specific runtime configurations, streamline operations in heterogeneous GPU environments, and provide advanced optimization opportunities for model serving.

Highlights

  • New OEP (OEP-0003) Proposed: This pull request introduces a new OME Enhancement Proposal (OEP-0003) titled 'Accelerator-Aware Runtime Selection for Heterogeneous GPU Environments'. This OEP aims to address the current limitations in OME's runtime selection mechanism, which lacks awareness of underlying hardware accelerators, leading to operational complexities in mixed GPU clusters.
  • Introduction of AcceleratorClass CRD: The core of the proposal is the introduction of a new cluster-scoped Custom Resource Definition (CRD) called AcceleratorClass. This CRD provides a vendor-agnostic abstraction layer to define GPU capabilities, discovery patterns, resources, and cost information, enabling OME to understand and utilize diverse GPU types more effectively.
  • Extensions to InferenceService and ServingRuntime: To facilitate accelerator-aware selection, the InferenceService API is extended with an acceleratorSelector field, allowing users to specify preferred GPU classes or required capabilities. Similarly, the ServingRuntime API is extended with acceleratorRequirements to declare GPU compatibility and acceleratorConfigurations for GPU-specific optimizations (e.g., environment variables, runner arguments, resource limits).
  • Intelligent Runtime Selection Logic: The proposed system will automatically discover available accelerators, match them with runtime requirements, and select the optimal configuration based on model characteristics, user preferences, and defined policies (e.g., performance, cost, balanced). A clear override hierarchy is established for configuration precedence.
  • Kueue Integration and Component-Level Overrides: The design includes seamless, optional integration with Kueue ResourceFlavors, allowing OME to leverage existing Kubernetes resource management infrastructure. It also supports component-level accelerator overrides, enabling different GPU selections for Engine and Decoder components, while ensuring Router components remain CPU-only for cost optimization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces OEP-0003, a design proposal for accelerator-aware runtime selection in OME. The review identifies areas for improvement in the design document to enhance clarity and correctness, including fixing an error in an API struct definition.

Comment on lines +1057 to +1079
type AcceleratorSelector struct {
// PreferredClasses in order of preference
// +optional
PreferredClasses []string `json:"preferredClasses,omitempty"`

// RequiredCapabilities that must be met
// +optional
RequiredCapabilities *AcceleratorCapabilities `json:"requiredCapabilities,omitempty"`

// Strategy for selection (performance, cost, balanced)
// +kubebuilder:default="balanced"
// +optional
Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"`

// NodeSelector for specific node targeting
// +optional
NodeSelector map[string]string `json:"nodeSelector,omitempty"`

// Strategy for selection (performance, cost, balanced)
// +kubebuilder:default="balanced"
// +optional
Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"`
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The AcceleratorSelector struct definition has a duplicate Strategy field. This will cause a compilation error. Remove the duplicate field.

Suggested change
type AcceleratorSelector struct {
// PreferredClasses in order of preference
// +optional
PreferredClasses []string `json:"preferredClasses,omitempty"`
// RequiredCapabilities that must be met
// +optional
RequiredCapabilities *AcceleratorCapabilities `json:"requiredCapabilities,omitempty"`
// Strategy for selection (performance, cost, balanced)
// +kubebuilder:default="balanced"
// +optional
Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"`
// NodeSelector for specific node targeting
// +optional
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
// Strategy for selection (performance, cost, balanced)
// +kubebuilder:default="balanced"
// +optional
Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"`
}
type AcceleratorSelector struct {
// PreferredClasses in order of preference
// +optional
PreferredClasses []string `json:"preferredClasses,omitempty"`
// RequiredCapabilities that must be met
// +optional
RequiredCapabilities *AcceleratorCapabilities `json:"requiredCapabilities,omitempty"`
// Strategy for selection (performance, cost, balanced)
// +kubebuilder:default="balanced"
// +optional
Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"`
// NodeSelector for specific node targeting
// +optional
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
}

Comment on lines +568 to +572
6. **Router Component Handling:** The router component is CPU-only and doesn't require GPUs:
- AcceleratorClass constraints are **only applied to Engine and Decoder** components
- Router pods maintain their own independent nodeSelector/affinity settings
- This prevents routers from being unnecessarily scheduled on expensive GPU nodes
- Router can be explicitly scheduled on CPU-only nodes for cost optimization

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a numbering error in the "Notes/Constraints/Caveats" section. There are two points numbered '6' (lines 553 and 568) and two points numbered '7' (lines 558 and 574). Renumber the list items sequentially to improve readability.

// +kubebuilder:printcolumn:name="Vendor",type=string,JSONPath=`.spec.vendor`
// +kubebuilder:printcolumn:name="Family",type=string,JSONPath=`.spec.family`
// +kubebuilder:printcolumn:name="Memory",type=string,JSONPath=`.spec.capabilities.memoryGB`
// +kubebuilder:printcolumn:name="Nodes",type=integer,JSONPath=`.status.availableNodes`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The printcolumn for "Nodes" uses JSONPath=.status.availableNodes, but the AcceleratorClassStatus struct (lines 1017-1037) does not have an availableNodes field. Update the JSONPath to point to a valid field, such as .status.availableAccelerators.

}
}

// Rule 4: Merge resources (take maximum)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pseudo-code for mergeContainerConfig under "Rule 4: Merge resources (take maximum)" only handles merging Resources.Limits. Clarify the merging strategy for Resources.Requests as well (e.g., should it also take the maximum, or follow a different rule?).

Copy link
Collaborator

@YouNeedCryDear YouNeedCryDear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a high level perspective, I agree that the explosive combination of model and GPU type is not manageable. I wonder instead of introducing the accelerator suite, which seems like a pretty big change to both the cluster management as well as orchestration, would it be possible to have one runtime to represent one type of GPU, and leave the heavy lifting, such as the resource management to inference service? Basically just get rid of the model from current runtime.

acceleratorClass: nvidia-a100-40gb
env:
- name: TENSOR_PARALLEL_SIZE
value: "2"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this sglang-universal runtime is model agnostic, how does it know what is the expected tensor parallel size before model is specified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its still one runtime per model

- "--enable-prefix-caching"

- selector:
acceleratorClass: nvidia-h100-80gb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have multiple acceleratorClass with the same name, say nvidia-h100-80gb, but different env and runner specs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no

acceleratorSelector:
strategy: cost # Prefer cost over performance
annotations:
ome.ai/cost-threshold: "0.5" # $/hour threshold
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming this cost-threshold will be defined on node level? Then it will be divided to per accelerator/GPU level?

# Bob creates one universal runtime
kind: ServingRuntime
metadata:
name: sglang-universal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a single runtime like sglang-universal, those different acceleratorRequirements underneath might not necessarily produce the same level of performance for a given model. Would it be a problem if model randomly got allocated to different accelerator settings but the performance is not consistent?

@slin1237 slin1237 force-pushed the accelerator-aware-runtime-selection/proposal branch from 0abc698 to e5fceda Compare July 7, 2025 05:16
@slin1237 slin1237 force-pushed the accelerator-aware-runtime-selection/proposal branch from e5fceda to b31defe Compare July 7, 2025 05:18
@slin1237
Copy link
Collaborator Author

slin1237 commented Jul 7, 2025

From a high level perspective, I agree that the explosive combination of model and GPU type is not manageable. I wonder instead of introducing the accelerator suite, which seems like a pretty big change to both the cluster management as well as orchestration, would it be possible to have one runtime to represent one type of GPU, and leave the heavy lifting, such as the resource management to inference service? Basically just get rid of the model from current runtime.

I'm not entirely following on getting rid of the model from runtime. as getting rid of the supported Model from runtime?

@slin1237 slin1237 requested a review from YouNeedCryDear July 8, 2025 00:11

capabilities:
memoryGB: "128Gi"
levelZeroVersion: "1.3"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? I don't see this attribute defined in the AcceleratorCapabilities struct.

DeviceIDs []string `json:"deviceIDs,omitempty"`
}

type AcceleratorCapabilities struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be a pre-defined struct or a free form list?


// Performance metrics
// +optional
Performance *AcceleratorPerformance `json:"performance,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think AcceleratorPerformance is missing definition.


// Conditions when this preference applies
// +optional
Conditions []PreferenceCondition `json:"conditions,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing definition for PreferenceCondition?

acceleratorRequirements:
preferenceOrder:
- class: nvidia-a100-40gb
score: 100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the score and conditions are applied? condition takes first and then pick the higher score?

acceleratorSelector:
strategy: cost # Prefer cost over performance
annotations:
ome.io/cost-threshold: "0.5" # $/hour threshold
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the total cost for the inference service? I still don't have a clear idea how this is calculated or applied.


Introduce new API resources and extensions to enable accelerator-aware runtime selection:

1. **AcceleratorClass** (Cluster-scoped) - Defines accelerator capabilities and discovery patterns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need some RBAC or mutation webhook to control this AcceleratorClass doesn't get modified by unexpected parties.

AcceleratorRequirements *AcceleratorRequirements `json:"acceleratorRequirements,omitempty"`
}

type AcceleratorRequirements struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally LLM tensor parallelism depends on aggregated memory across multiple shards. Would it be a good idea to add something like minTotalMemoryGB?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants