-
Notifications
You must be signed in to change notification settings - Fork 12
[oep] add proposal for accelerator aware runtime selection #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request proposes a significant enhancement to OME's runtime selection capabilities by introducing a robust, accelerator-aware mechanism. The core idea is to abstract GPU hardware differences through a new AcceleratorClass
resource, allowing OME to intelligently match inference workloads with the most suitable GPU types. This will drastically reduce the need for numerous, hardware-specific runtime configurations, streamline operations in heterogeneous GPU environments, and provide advanced optimization opportunities for model serving.
Highlights
- New OEP (OEP-0003) Proposed: This pull request introduces a new OME Enhancement Proposal (OEP-0003) titled 'Accelerator-Aware Runtime Selection for Heterogeneous GPU Environments'. This OEP aims to address the current limitations in OME's runtime selection mechanism, which lacks awareness of underlying hardware accelerators, leading to operational complexities in mixed GPU clusters.
- Introduction of AcceleratorClass CRD: The core of the proposal is the introduction of a new cluster-scoped Custom Resource Definition (CRD) called
AcceleratorClass
. This CRD provides a vendor-agnostic abstraction layer to define GPU capabilities, discovery patterns, resources, and cost information, enabling OME to understand and utilize diverse GPU types more effectively. - Extensions to InferenceService and ServingRuntime: To facilitate accelerator-aware selection, the
InferenceService
API is extended with anacceleratorSelector
field, allowing users to specify preferred GPU classes or required capabilities. Similarly, theServingRuntime
API is extended withacceleratorRequirements
to declare GPU compatibility andacceleratorConfigurations
for GPU-specific optimizations (e.g., environment variables, runner arguments, resource limits). - Intelligent Runtime Selection Logic: The proposed system will automatically discover available accelerators, match them with runtime requirements, and select the optimal configuration based on model characteristics, user preferences, and defined policies (e.g., performance, cost, balanced). A clear override hierarchy is established for configuration precedence.
- Kueue Integration and Component-Level Overrides: The design includes seamless, optional integration with Kueue ResourceFlavors, allowing OME to leverage existing Kubernetes resource management infrastructure. It also supports component-level accelerator overrides, enabling different GPU selections for Engine and Decoder components, while ensuring Router components remain CPU-only for cost optimization.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces OEP-0003, a design proposal for accelerator-aware runtime selection in OME. The review identifies areas for improvement in the design document to enhance clarity and correctness, including fixing an error in an API struct definition.
type AcceleratorSelector struct { | ||
// PreferredClasses in order of preference | ||
// +optional | ||
PreferredClasses []string `json:"preferredClasses,omitempty"` | ||
|
||
// RequiredCapabilities that must be met | ||
// +optional | ||
RequiredCapabilities *AcceleratorCapabilities `json:"requiredCapabilities,omitempty"` | ||
|
||
// Strategy for selection (performance, cost, balanced) | ||
// +kubebuilder:default="balanced" | ||
// +optional | ||
Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"` | ||
|
||
// NodeSelector for specific node targeting | ||
// +optional | ||
NodeSelector map[string]string `json:"nodeSelector,omitempty"` | ||
|
||
// Strategy for selection (performance, cost, balanced) | ||
// +kubebuilder:default="balanced" | ||
// +optional | ||
Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"` | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The AcceleratorSelector
struct definition has a duplicate Strategy
field. This will cause a compilation error. Remove the duplicate field.
type AcceleratorSelector struct { | |
// PreferredClasses in order of preference | |
// +optional | |
PreferredClasses []string `json:"preferredClasses,omitempty"` | |
// RequiredCapabilities that must be met | |
// +optional | |
RequiredCapabilities *AcceleratorCapabilities `json:"requiredCapabilities,omitempty"` | |
// Strategy for selection (performance, cost, balanced) | |
// +kubebuilder:default="balanced" | |
// +optional | |
Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"` | |
// NodeSelector for specific node targeting | |
// +optional | |
NodeSelector map[string]string `json:"nodeSelector,omitempty"` | |
// Strategy for selection (performance, cost, balanced) | |
// +kubebuilder:default="balanced" | |
// +optional | |
Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"` | |
} | |
type AcceleratorSelector struct { | |
// PreferredClasses in order of preference | |
// +optional | |
PreferredClasses []string `json:"preferredClasses,omitempty"` | |
// RequiredCapabilities that must be met | |
// +optional | |
RequiredCapabilities *AcceleratorCapabilities `json:"requiredCapabilities,omitempty"` | |
// Strategy for selection (performance, cost, balanced) | |
// +kubebuilder:default="balanced" | |
// +optional | |
Strategy AcceleratorSelectionStrategy `json:"strategy,omitempty"` | |
// NodeSelector for specific node targeting | |
// +optional | |
NodeSelector map[string]string `json:"nodeSelector,omitempty"` | |
} |
6. **Router Component Handling:** The router component is CPU-only and doesn't require GPUs: | ||
- AcceleratorClass constraints are **only applied to Engine and Decoder** components | ||
- Router pods maintain their own independent nodeSelector/affinity settings | ||
- This prevents routers from being unnecessarily scheduled on expensive GPU nodes | ||
- Router can be explicitly scheduled on CPU-only nodes for cost optimization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// +kubebuilder:printcolumn:name="Vendor",type=string,JSONPath=`.spec.vendor` | ||
// +kubebuilder:printcolumn:name="Family",type=string,JSONPath=`.spec.family` | ||
// +kubebuilder:printcolumn:name="Memory",type=string,JSONPath=`.spec.capabilities.memoryGB` | ||
// +kubebuilder:printcolumn:name="Nodes",type=integer,JSONPath=`.status.availableNodes` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
} | ||
|
||
// Rule 4: Merge resources (take maximum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a high level perspective, I agree that the explosive combination of model and GPU type is not manageable. I wonder instead of introducing the accelerator suite, which seems like a pretty big change to both the cluster management as well as orchestration, would it be possible to have one runtime to represent one type of GPU, and leave the heavy lifting, such as the resource management to inference service? Basically just get rid of the model from current runtime.
acceleratorClass: nvidia-a100-40gb | ||
env: | ||
- name: TENSOR_PARALLEL_SIZE | ||
value: "2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this sglang-universal runtime is model agnostic, how does it know what is the expected tensor parallel size before model is specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its still one runtime per model
- "--enable-prefix-caching" | ||
|
||
- selector: | ||
acceleratorClass: nvidia-h100-80gb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have multiple acceleratorClass with the same name, say nvidia-h100-80gb, but different env and runner specs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no
acceleratorSelector: | ||
strategy: cost # Prefer cost over performance | ||
annotations: | ||
ome.ai/cost-threshold: "0.5" # $/hour threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assuming this cost-threshold will be defined on node level? Then it will be divided to per accelerator/GPU level?
# Bob creates one universal runtime | ||
kind: ServingRuntime | ||
metadata: | ||
name: sglang-universal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a single runtime like sglang-universal, those different acceleratorRequirements underneath might not necessarily produce the same level of performance for a given model. Would it be a problem if model randomly got allocated to different accelerator settings but the performance is not consistent?
0abc698
to
e5fceda
Compare
e5fceda
to
b31defe
Compare
I'm not entirely following on getting rid of the model from runtime. as getting rid of the supported Model from runtime? |
|
||
capabilities: | ||
memoryGB: "128Gi" | ||
levelZeroVersion: "1.3" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this? I don't see this attribute defined in the AcceleratorCapabilities struct.
DeviceIDs []string `json:"deviceIDs,omitempty"` | ||
} | ||
|
||
type AcceleratorCapabilities struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want this to be a pre-defined struct or a free form list?
|
||
// Performance metrics | ||
// +optional | ||
Performance *AcceleratorPerformance `json:"performance,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think AcceleratorPerformance is missing definition.
|
||
// Conditions when this preference applies | ||
// +optional | ||
Conditions []PreferenceCondition `json:"conditions,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing definition for PreferenceCondition?
acceleratorRequirements: | ||
preferenceOrder: | ||
- class: nvidia-a100-40gb | ||
score: 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How the score and conditions are applied? condition takes first and then pick the higher score?
acceleratorSelector: | ||
strategy: cost # Prefer cost over performance | ||
annotations: | ||
ome.io/cost-threshold: "0.5" # $/hour threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the total cost for the inference service? I still don't have a clear idea how this is calculated or applied.
|
||
Introduce new API resources and extensions to enable accelerator-aware runtime selection: | ||
|
||
1. **AcceleratorClass** (Cluster-scoped) - Defines accelerator capabilities and discovery patterns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need some RBAC or mutation webhook to control this AcceleratorClass doesn't get modified by unexpected parties.
AcceleratorRequirements *AcceleratorRequirements `json:"acceleratorRequirements,omitempty"` | ||
} | ||
|
||
type AcceleratorRequirements struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally LLM tensor parallelism depends on aggregated memory across multiple shards. Would it be a good idea to add something like minTotalMemoryGB?
What type of PR is this?
/kind feature
/kind design
What this PR does / why we need it:
This PR propose OEP-0003: Accelerator-Aware Runtime Selection for Heterogeneous GPU Environments.
Problem
Currently, OME's runtime selection mechanism lacks awareness of underlying hardware accelerators. In clusters with mixed GPU types (H100, A100, B200, H200), operators
must create separate runtime configurations for each GPU model, leading to:
Solution
This PR introduces AcceleratorClass, a vendor-agnostic abstraction layer that:
Key Components
Example Usage
Special notes for your reviewer:
Architecture Decisions
Does this PR introduce a user-facing change?
Added AcceleratorClass resource for GPU-aware runtime selection in heterogeneous environments.