-
Notifications
You must be signed in to change notification settings - Fork 501
Support P/D Pooling autoscaling in StormService #1625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support P/D Pooling autoscaling in StormService #1625
Conversation
Summary of ChangesHello @Jeffwan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the autoscaling capabilities within the system by introducing support for fine-grained, role-level autoscaling for StormService resources. Previously, individual roles within a StormService could not be scaled independently using standard Kubernetes scaling mechanisms. This change addresses that limitation by extending the PodAutoscaler API with a subTargetSelector field, allowing users to define distinct autoscaling policies for specific roles like 'prefill' or 'decode'. The core autoscaler controller has been refactored to accommodate this new functionality, abstracting scaling logic into a dedicated WorkloadScale component that can interact with both standard Kubernetes /scale subresources and directly manipulate StormService objects for role-specific replica adjustments. This provides greater flexibility and efficiency for managing resource allocation in complex, multi-role services. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for role-level autoscaling in StormService, which is a great enhancement for fine-grained resource management. The implementation introduces a WorkloadScale interface to abstract scaling logic, which is a solid design choice that improves modularity and maintainability. The changes to the PodAutoscaler API, CRD, and controller logic are well-aligned with the goal. I've found one issue related to an inconsistent implementation in the new WorkloadScale abstraction that should be addressed.
41b71f7 to
b13b5f6
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for role-level autoscaling in StormService, which is a valuable feature. The implementation is well-structured, particularly with the introduction of the WorkloadScale interface that cleanly abstracts the scaling mechanism for different resource types. This makes the controller more modular and extensible. The changes to the API, CRD, and documentation are clear and consistent. I've identified a few potential issues regarding the handling of pod selectors when they are missing, which could lead to incorrect behavior. Addressing these points will enhance the robustness of the autoscaler.
b13b5f6 to
9697151
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for role-level autoscaling for StormService by adding a subTargetSelector to the PodAutoscaler API. The implementation involves a significant and well-designed refactoring of the scaling logic, introducing a WorkloadScale interface to abstract away the details of scaling different resource types. This is a great improvement for maintainability and extensibility. The changes also include necessary updates to CRDs, RBAC, and documentation. My review focuses on further improving the new WorkloadScale abstraction by centralizing logic and removing dead code.
9697151 to
a0f6faa
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant enhancement to the PodAutoscaler by adding support for role-level autoscaling within a StormService. This is achieved through a new subTargetSelector field in the API, allowing users to target specific roles like 'prefill' or 'decode' for independent scaling. The implementation correctly avoids the /scale subresource and instead patches the StormService object directly. A new WorkloadScale abstraction has been introduced to cleanly separate the scaling logic for different workload types, which is a great refactoring. The changes are well-supported with new documentation and sample YAML files.
My review focuses on a few key areas:
- A potential performance regression due to changes in autoscaling window durations.
- The verbosity of new log messages.
- A minor issue in the documentation.
- A small code improvement for logging.
Overall, this is a solid contribution that adds valuable functionality. Addressing the feedback will help ensure the changes are robust and maintainable.
| stableWindowDuration = 180 * time.Second | ||
| panicWindowDuration = 60 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stableWindowDuration and panicWindowDuration have been significantly increased (from 60s to 180s and 6s to 60s, respectively). This will make the autoscaler react much more slowly to metric changes. This is a major behavioral change that isn't mentioned in the pull request description. If this is intentional, please provide a justification in the PR description. If not, it should be reverted to the previous values to avoid unexpected performance degradation.
|
|
||
| **Complete example:** | ||
|
|
||
| .. literalinclude:: ../../../../samples/autoscaler/stormservice-role-autoscaler.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The literalinclude directive points to stormservice-role-autoscaler.yaml, but this file does not appear to be part of the pull request. Based on the added sample files, it seems you intended to include stormservice-pool.yaml. Please correct the path to ensure the documentation renders correctly.
| .. literalinclude:: ../../../../samples/autoscaler/stormservice-role-autoscaler.yaml | |
| .. literalinclude:: ../../../../samples/autoscaling/stormservice-pool.yaml |
| klog.InfoS("Metrics window aggregation", "metricKey", metricKeyStr, | ||
| "stableAvg", stableValue, "panicAvg", panicValue, "stableWindowValues", stableWindow.Values(), "panicWindowValues", panicWindow.Values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log message has been promoted from a verbose level (V(4)) to the default info level. Additionally, it now logs the entire contents of stableWindow.Values() and panicWindow.Values(), which can be very large arrays. This will likely lead to excessive log spam in a production environment, making it difficult to monitor the system and potentially incurring high logging costs. It's recommended to move this log back to a verbose level (e.g., V(4)) to avoid this.
| klog.InfoS("Metrics window aggregation", "metricKey", metricKeyStr, | |
| "stableAvg", stableValue, "panicAvg", panicValue, "stableWindowValues", stableWindow.Values(), "panicWindowValues", panicWindow.Values()) | |
| klog.V(4).InfoS("Metrics window aggregation", "metricKey", metricKeyStr, | |
| "stableAvg", stableValue, "panicAvg", panicValue, "stableWindowValues", stableWindow.Values(), "panicWindowValues", panicWindow.Values()) |
| if err != nil { | ||
| klog.ErrorS(err, "Failed to add ray.io/node-type requirement to label selector") | ||
| return nil, fmt.Errorf("failed to add ray requirement: %w", err) | ||
| return nil, fmt.Errorf("failed to add ray.io/node-type requirement to label selector: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error log klog.ErrorS was removed here. While the error is still wrapped and returned, logging the error at the point of failure provides valuable context for debugging without needing to trace the error back up the call stack. Please consider re-adding the error log before this line.
klog.ErrorS(err, "Failed to add ray.io/node-type requirement to label selector")
return nil, fmt.Errorf("failed to add ray.io/node-type requirement to label selector: %w", err)Signed-off-by: Jiaxin Shan <[email protected]>
a0f6faa to
5ed2c33
Compare
Pull Request Description
At this moment, user should be able to create two scaling rules. 1 for prefill role and 1 for decode role
The tricky part is stormservice role is not a CR level resource, which can not support /scale interface now. We can not use scale to update it. Instead, we adjust directly update the stormservice object
In future, we will support single autoscaler object to update entire stormservice. proportional scaling will be supported as well.
Related Issues
Resolves: part of #1260
Important: Before submitting, please complete the description above and review the checklist below.
Contribution Guidelines (Expand for Details)
We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:
Pull Request Title Format
Your PR title should start with one of these prefixes to indicate the nature of the change:
[Bug]: Corrections to existing functionality[CI]: Changes to build process or CI pipeline[Docs]: Updates or additions to documentation[API]: Modifications to aibrix's API or interface[CLI]: Changes or additions to the Command Line Interface[Misc]: For changes not covered above (use sparingly)Note: For changes spanning multiple categories, use multiple prefixes in order of importance.
Submission Checklist
By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.