[oep] add proposal for accelerator aware runtime selection #129

slin1237 · 2025-07-04T18:09:59Z

What type of PR is this?

/kind feature
/kind design

What this PR does / why we need it:

This PR propose OEP-0003: Accelerator-Aware Runtime Selection for Heterogeneous GPU Environments.

Problem

Currently, OME's runtime selection mechanism lacks awareness of underlying hardware accelerators. In clusters with mixed GPU types (H100, A100, B200, H200), operators
must create separate runtime configurations for each GPU model, leading to:

Runtime proliferation (e.g., sglang-h100, sglang-a100, sglang-b200, sglang-h200)
Operational complexity and configuration drift
Suboptimal GPU placement and resource utilization
Difficulty managing PD-disaggregated deployments with different GPU types

Solution

This PR introduces AcceleratorClass, a vendor-agnostic abstraction layer that:

Defines GPU capabilities and requirements declaratively
Enables automatic runtime selection based on model needs and GPU availability
Supports cost optimization and performance policies
Integrates seamlessly with existing Kueue ResourceFlavors
Works with OME's engine/decoder/router architecture

Key Components

AcceleratorClass CRD: Cluster-scoped resource defining GPU types and capabilities
Runtime Extensions: AcceleratorRequirements field for GPU compatibility
InferenceService Extensions: AcceleratorSelector for GPU selection preferences
Smart Selection Logic: Matches models to optimal GPUs based on memory, compute, and features
Kueue Integration: Optional synchronization with ResourceFlavors

Example Usage

# Define GPU type
apiVersion: ome.io/v1beta1
kind: AcceleratorClass
metadata:
  name: nvidia-h100
spec:
  deviceSelector:
    deviceResourceName: "nvidia.com/gpu"
    nodeLabelSelector:
      nvidia.com/gpu.product: "NVIDIA-H100-80GB-HBM3"
  capabilities:
    memoryCapacity: 80
    computeCapability: 1979

---
# Deploy with automatic GPU selection
apiVersion: ome.io/v1beta1
kind: InferenceService
metadata:
  name: llama-70b
spec:
  model:
    name: llama-70b
  acceleratorSelector:
    policy: BestFit  # Automatically selects appropriate GPU

Special notes for your reviewer:

Architecture Decisions

Cluster-scoped AcceleratorClass: GPUs are cluster resources, making cluster scope appropriate
Kueue Optional: Integration is optional to avoid forcing Kueue adoption
Backward Compatible: All new fields are optional, existing InferenceServices work unchanged
Component-level Overrides: Engine and Decoder can have different accelerators for PD-disaggregated deployments
Router Exclusion: Router components remain CPU-only as they don't require GPU resources

Does this PR introduce a user-facing change?

Added AcceleratorClass resource for GPU-aware runtime selection in heterogeneous environments.

New AcceleratorClass CRD to define GPU types and capabilities
Extended InferenceService with acceleratorSelector for GPU selection
Added acceleratorRequirements to ServingRuntime for GPU compatibility
Automatic GPU discovery and runtime matching
Support for constraint-based and policy-driven GPU selection
Optional Kueue integration for ResourceFlavor synchronization
Component-level accelerator overrides for PD-disaggregated deployments
Comprehensive examples and migration guide

gemini-code-assist

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request proposes a significant enhancement to OME's runtime selection capabilities by introducing a robust, accelerator-aware mechanism. The core idea is to abstract GPU hardware differences through a new AcceleratorClass resource, allowing OME to intelligently match inference workloads with the most suitable GPU types. This will drastically reduce the need for numerous, hardware-specific runtime configurations, streamline operations in heterogeneous GPU environments, and provide advanced optimization opportunities for model serving.

Highlights

New OEP (OEP-0003) Proposed: This pull request introduces a new OME Enhancement Proposal (OEP-0003) titled 'Accelerator-Aware Runtime Selection for Heterogeneous GPU Environments'. This OEP aims to address the current limitations in OME's runtime selection mechanism, which lacks awareness of underlying hardware accelerators, leading to operational complexities in mixed GPU clusters.
Introduction of AcceleratorClass CRD: The core of the proposal is the introduction of a new cluster-scoped Custom Resource Definition (CRD) called AcceleratorClass. This CRD provides a vendor-agnostic abstraction layer to define GPU capabilities, discovery patterns, resources, and cost information, enabling OME to understand and utilize diverse GPU types more effectively.
Extensions to InferenceService and ServingRuntime: To facilitate accelerator-aware selection, the InferenceService API is extended with an acceleratorSelector field, allowing users to specify preferred GPU classes or required capabilities. Similarly, the ServingRuntime API is extended with acceleratorRequirements to declare GPU compatibility and acceleratorConfigurations for GPU-specific optimizations (e.g., environment variables, runner arguments, resource limits).
Intelligent Runtime Selection Logic: The proposed system will automatically discover available accelerators, match them with runtime requirements, and select the optimal configuration based on model characteristics, user preferences, and defined policies (e.g., performance, cost, balanced). A clear override hierarchy is established for configuration precedence.
Kueue Integration and Component-Level Overrides: The design includes seamless, optional integration with Kueue ResourceFlavors, allowing OME to leverage existing Kubernetes resource management infrastructure. It also supports component-level accelerator overrides, enabling different GPU selections for Engine and Decoder components, while ensuring Router components remain CPU-only for cost optimization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces OEP-0003, a design proposal for accelerator-aware runtime selection in OME. The review identifies areas for improvement in the design document to enhance clarity and correctness, including fixing an error in an API struct definition.

gemini-code-assist · 2025-07-04T18:12:25Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+type AcceleratorSelector struct {
+    // PreferredClasses in order of preference
+    // +optional
+    PreferredClasses []string `json:"preferredClasses,omitempty"`
+
+    // RequiredCapabilities that must be met
+    // +optional
+    RequiredCapabilities *AcceleratorCapabilities `json:"requiredCapabilities,omitempty"`
+
+    // Strategy for selection (performance, cost, balanced)
+    // +kubebuilder:default="balanced"
+    // +optional
+    Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"`
+
+    // NodeSelector for specific node targeting
+    // +optional
+    NodeSelector map[string]string `json:"nodeSelector,omitempty"`
+
+    // Strategy for selection (performance, cost, balanced)
+    // +kubebuilder:default="balanced"
+    // +optional
+    Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"`
+}


The AcceleratorSelector struct definition has a duplicate Strategy field. This will cause a compilation error. Remove the duplicate field.

Suggested change

type AcceleratorSelector struct {

// PreferredClasses in order of preference

// +optional

PreferredClasses []string `json:"preferredClasses,omitempty"`

// RequiredCapabilities that must be met

// +optional

RequiredCapabilities *AcceleratorCapabilities `json:"requiredCapabilities,omitempty"`

// Strategy for selection (performance, cost, balanced)

// +kubebuilder:default="balanced"

// +optional

Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"`

// NodeSelector for specific node targeting

// +optional

NodeSelector map[string]string `json:"nodeSelector,omitempty"`

// Strategy for selection (performance, cost, balanced)

// +kubebuilder:default="balanced"

// +optional

Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"`

}

type AcceleratorSelector struct {

// PreferredClasses in order of preference

// +optional

PreferredClasses []string `json:"preferredClasses,omitempty"`

// RequiredCapabilities that must be met

// +optional

RequiredCapabilities *AcceleratorCapabilities `json:"requiredCapabilities,omitempty"`

// Strategy for selection (performance, cost, balanced)

// +kubebuilder:default="balanced"

// +optional

Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"`

// NodeSelector for specific node targeting

// +optional

NodeSelector map[string]string `json:"nodeSelector,omitempty"`

}

oeps/0003-accelerator-aware-runtime-selection/README.md

gemini-code-assist · 2025-07-04T18:12:26Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+6. **Router Component Handling:** The router component is CPU-only and doesn't require GPUs:
+   - AcceleratorClass constraints are **only applied to Engine and Decoder** components
+   - Router pods maintain their own independent nodeSelector/affinity settings
+   - This prevents routers from being unnecessarily scheduled on expensive GPU nodes
+   - Router can be explicitly scheduled on CPU-only nodes for cost optimization


There is a numbering error in the "Notes/Constraints/Caveats" section. There are two points numbered '6' (lines 553 and 568) and two points numbered '7' (lines 558 and 574). Renumber the list items sequentially to improve readability.

gemini-code-assist · 2025-07-04T18:12:26Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+// +kubebuilder:printcolumn:name="Vendor",type=string,JSONPath=`.spec.vendor`
+// +kubebuilder:printcolumn:name="Family",type=string,JSONPath=`.spec.family`
+// +kubebuilder:printcolumn:name="Memory",type=string,JSONPath=`.spec.capabilities.memoryGB`
+// +kubebuilder:printcolumn:name="Nodes",type=integer,JSONPath=`.status.availableNodes`


The printcolumn for "Nodes" uses JSONPath=.status.availableNodes, but the AcceleratorClassStatus struct (lines 1017-1037) does not have an availableNodes field. Update the JSONPath to point to a valid field, such as .status.availableAccelerators.

gemini-code-assist · 2025-07-04T18:12:26Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+        }
+    }
+
+    // Rule 4: Merge resources (take maximum)


The pseudo-code for mergeContainerConfig under "Rule 4: Merge resources (take maximum)" only handles merging Resources.Limits. Clarify the merging strategy for Resources.Requests as well (e.g., should it also take the maximum, or follow a different rule?).

YouNeedCryDear

From a high level perspective, I agree that the explosive combination of model and GPU type is not manageable. I wonder instead of introducing the accelerator suite, which seems like a pretty big change to both the cluster management as well as orchestration, would it be possible to have one runtime to represent one type of GPU, and leave the heavy lifting, such as the resource management to inference service? Basically just get rid of the model from current runtime.

YouNeedCryDear · 2025-07-07T04:12:49Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+        acceleratorClass: nvidia-a100-40gb
+      env:
+        - name: TENSOR_PARALLEL_SIZE
+          value: "2"


If this sglang-universal runtime is model agnostic, how does it know what is the expected tensor parallel size before model is specified?

its still one runtime per model

YouNeedCryDear · 2025-07-07T04:15:15Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+          - "--enable-prefix-caching"
+
+    - selector:
+        acceleratorClass: nvidia-h100-80gb


Is it possible to have multiple acceleratorClass with the same name, say nvidia-h100-80gb, but different env and runner specs?

oeps/0003-accelerator-aware-runtime-selection/README.md

YouNeedCryDear · 2025-07-07T04:24:09Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+  acceleratorSelector:
+    strategy: cost  # Prefer cost over performance
+    annotations:
+      ome.ai/cost-threshold: "0.5"  # $/hour threshold


assuming this cost-threshold will be defined on node level? Then it will be divided to per accelerator/GPU level?

YouNeedCryDear · 2025-07-07T04:31:06Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+# Bob creates one universal runtime
+kind: ServingRuntime
+metadata:
+  name: sglang-universal


For a single runtime like sglang-universal, those different acceleratorRequirements underneath might not necessarily produce the same level of performance for a given model. Would it be a problem if model randomly got allocated to different accelerator settings but the performance is not consistent?

slin1237 · 2025-07-07T06:37:46Z

From a high level perspective, I agree that the explosive combination of model and GPU type is not manageable. I wonder instead of introducing the accelerator suite, which seems like a pretty big change to both the cluster management as well as orchestration, would it be possible to have one runtime to represent one type of GPU, and leave the heavy lifting, such as the resource management to inference service? Basically just get rid of the model from current runtime.

I'm not entirely following on getting rid of the model from runtime. as getting rid of the supported Model from runtime?

YouNeedCryDear · 2025-07-08T01:38:21Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+
+  capabilities:
+    memoryGB: "128Gi"
+    levelZeroVersion: "1.3"


What is this? I don't see this attribute defined in the AcceleratorCapabilities struct.

YouNeedCryDear · 2025-07-08T01:39:36Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+    DeviceIDs []string `json:"deviceIDs,omitempty"`
+}
+
+type AcceleratorCapabilities struct {


Do we want this to be a pre-defined struct or a free form list?

YouNeedCryDear · 2025-07-08T01:40:05Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+
+    // Performance metrics
+    // +optional
+    Performance *AcceleratorPerformance `json:"performance,omitempty"`


I think AcceleratorPerformance is missing definition.

YouNeedCryDear · 2025-07-08T01:44:12Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+
+    // Conditions when this preference applies
+    // +optional
+    Conditions []PreferenceCondition `json:"conditions,omitempty"`


Missing definition for PreferenceCondition?

YouNeedCryDear · 2025-07-08T01:44:55Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+  acceleratorRequirements:
+    preferenceOrder:
+      - class: nvidia-a100-40gb
+        score: 100


How the score and conditions are applied? condition takes first and then pick the higher score?

YouNeedCryDear · 2025-07-08T01:48:17Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+  acceleratorSelector:
+    strategy: cost  # Prefer cost over performance
+    annotations:
+      ome.io/cost-threshold: "0.5"  # $/hour threshold


Is this the total cost for the inference service? I still don't have a clear idea how this is calculated or applied.

YouNeedCryDear · 2025-07-08T01:50:51Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+
+Introduce new API resources and extensions to enable accelerator-aware runtime selection:
+
+1. **AcceleratorClass** (Cluster-scoped) - Defines accelerator capabilities and discovery patterns


We probably need some RBAC or mutation webhook to control this AcceleratorClass doesn't get modified by unexpected parties.

YouNeedCryDear · 2025-07-08T01:54:09Z

oeps/0003-accelerator-aware-runtime-selection/README.md

+    AcceleratorRequirements *AcceleratorRequirements `json:"acceleratorRequirements,omitempty"`
+}
+
+type AcceleratorRequirements struct {


Normally LLM tensor parallelism depends on aggregated memory across multiple shards. Would it be a good idea to add something like minTotalMemoryGB?

github-actions bot added design feature labels Jul 4, 2025

gemini-code-assist bot reviewed Jul 4, 2025

View reviewed changes

slin1237 requested review from speedmancs, HaiShaw, jingle2008, YouNeedCryDear, heymrbox, CatherineSue, davidsnahm, whybeyoung, key4ng, pallasathena92, zhyncs, frankzhouhr, beiguo218 and chengjieyao July 4, 2025 18:10

gemini-code-assist bot reviewed Jul 4, 2025

View reviewed changes

YouNeedCryDear reviewed Jul 7, 2025

View reviewed changes

slin1237 force-pushed the accelerator-aware-runtime-selection/proposal branch from 0abc698 to e5fceda Compare July 7, 2025 05:16

[oep] add proposal for accelerator aware runtime selection

b31defe

slin1237 force-pushed the accelerator-aware-runtime-selection/proposal branch from e5fceda to b31defe Compare July 7, 2025 05:18

slin1237 requested a review from YouNeedCryDear July 8, 2025 00:11

YouNeedCryDear reviewed Jul 8, 2025

View reviewed changes


		Introduce new API resources and extensions to enable accelerator-aware runtime selection:

		1. AcceleratorClass (Cluster-scoped) - Defines accelerator capabilities and discovery patterns

[oep] add proposal for accelerator aware runtime selection #129

Are you sure you want to change the base?

[oep] add proposal for accelerator aware runtime selection #129

Conversation

slin1237 commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Problem

Solution

Key Components

Example Usage

Special notes for your reviewer:

Architecture Decisions

Does this PR introduce a user-facing change?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

YouNeedCryDear left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slin1237 commented Jul 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

slin1237 commented Jul 4, 2025 •

edited

Loading

YouNeedCryDear left a comment •

edited

Loading