[Bug]: add validation for multiple PodAutoscalers targeting the same workload #1662

googs1025 · 2025-10-13T13:15:40Z

Pull Request Description

[Please provide a clear and concise description of your changes here]

➜  ~ kubectl get podautoscalers -A
NAMESPACE   NAME                                 MINPODS   MAXPODS   REPLICAS   STRATEGY   AGE
default     deepseek-r1-distill-llama-8b-hpa     1         10                   HPA        13s
default     deepseek-r1-distill-llama-8b-hpa-1   1         10                   HPA        3s
➜  ~ kubectl get podautoscalers -A -oyaml
apiVersion: v1
items:
- apiVersion: autoscaling.aibrix.ai/v1alpha1
  kind: PodAutoscaler
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"autoscaling.aibrix.ai/v1alpha1","kind":"PodAutoscaler","metadata":{"annotations":{},"labels":{"app.kubernetes.io/managed-by":"kustomize","app.kubernetes.io/name":"aibrix"},"name":"deepseek-r1-distill-llama-8b-hpa","namespace":"default"},"spec":{"maxReplicas":10,"metricsSources":[{"metricSourceType":"pod","path":"/metrics","port":"8000","protocolType":"http","targetMetric":"gpu_cache_usage_perc","targetValue":"50"}],"minReplicas":1,"scaleTargetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"deepseek-r1-distill-llama-8b"},"scalingStrategy":"HPA"}}
    creationTimestamp: "2025-10-13T13:14:40Z"
    generation: 1
    labels:
      app.kubernetes.io/managed-by: kustomize
      app.kubernetes.io/name: aibrix
    name: deepseek-r1-distill-llama-8b-hpa
    namespace: default
    resourceVersion: "9526115"
    uid: b6e089bc-322f-408e-b7b1-66a389ffed99
  spec:
    maxReplicas: 10
    metricsSources:
    - metricSourceType: pod
      path: /metrics
      port: "8000"
      protocolType: http
      targetMetric: gpu_cache_usage_perc
      targetValue: "50"
    minReplicas: 1
    scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: deepseek-r1-distill-llama-8b
    scalingStrategy: HPA
  status: {}
- apiVersion: autoscaling.aibrix.ai/v1alpha1
  kind: PodAutoscaler
  metadata:
    annotations:
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"autoscaling.aibrix.ai/v1alpha1","kind":"PodAutoscaler","metadata":{"annotations":{},"labels":{"app.kubernetes.io/managed-by":"kustomize","app.kubernetes.io/name":"aibrix"},"name":"deepseek-r1-distill-llama-8b-hpa-1","namespace":"default"},"spec":{"maxReplicas":10,"metricsSources":[{"metricSourceType":"pod","path":"/metrics","port":"8000","protocolType":"http","targetMetric":"gpu_cache_usage_perc","targetValue":"50"}],"minReplicas":1,"scaleTargetRef":{"apiVersion":"apps/v1","kind":"Deployment","name":"deepseek-r1-distill-llama-8b"},"scalingStrategy":"HPA"}}
    creationTimestamp: "2025-10-13T13:14:50Z"
    generation: 1
    labels:
      app.kubernetes.io/managed-by: kustomize
      app.kubernetes.io/name: aibrix
    name: deepseek-r1-distill-llama-8b-hpa-1
    namespace: default
    resourceVersion: "9526138"
    uid: 127bd894-59b7-40ca-9dee-53a6a5311e2b
  spec:
    maxReplicas: 10
    metricsSources:
    - metricSourceType: pod
      path: /metrics
      port: "8000"
      protocolType: http
      targetMetric: gpu_cache_usage_perc
      targetValue: "50"
    minReplicas: 1
    scaleTargetRef:
      apiVersion: apps/v1
      kind: Deployment
      name: deepseek-r1-distill-llama-8b
    scalingStrategy: HPA
  status:
    conditions:
    - lastTransitionTime: "2025-10-13T13:14:50Z"
      message: ""
      reason: AsExpected
      status: "True"
      type: ValidSpec
    - lastTransitionTime: "2025-10-13T13:14:50Z"
      message: Scaling target apps/v1.Deployment/default/deepseek-r1-distill-llama-8b
        is already controlled by PodAutoscaler default/deepseek-r1-distill-llama-8b-hpa,
        it will not take effect
      reason: MutilPodAutoscalerConflict
      status: "False"
      type: MutilPodAutoscalerConflict
    - lastTransitionTime: "2025-10-13T13:14:50Z"
      message: desired=0, actual=0
      reason: Stable
      status: "False"
      type: ScalingActive
    - lastTransitionTime: "2025-10-13T13:14:50Z"
      message: ""
      reason: InvalidSpec
      status: "False"
      type: AbleToScale
    - lastTransitionTime: "2025-10-13T13:14:50Z"
      message: ""
      reason: ReconcilingScaleDiff
      status: "False"
      type: Ready
kind: List
metadata:
  resourceVersion: ""

Related Issues

Resolves: ##1661

Important: Before submitting, please complete the description above and review the checklist below.

Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

[Bug]: Corrections to existing functionality
[CI]: Changes to build process or CI pipeline
[Docs]: Updates or additions to documentation
[API]: Modifications to aibrix's API or interface
[CLI]: Changes or additions to the Command Line Interface
[Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

PR title includes appropriate prefix(es)
Changes are clearly explained in the PR description
New and existing tests pass successfully
Code adheres to project style and best practices
Documentation updated to reflect changes (if applicable)
Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

gemini-code-assist · 2025-10-13T13:15:58Z

Summary of Changes

Hello @googs1025, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical bug where the system could allow multiple "PodAutoscaler" resources to attempt to control the same underlying workload, leading to unpredictable and potentially erroneous scaling actions. The changes introduce robust validation to ensure that each scaling target is managed by only one "PodAutoscaler" at a time. This significantly improves the stability and reliability of the autoscaling functionality by preventing conflicting instructions and providing clear status feedback when such conflicts arise.

Highlights

Conflict Prevention for PodAutoscalers: Introduced a new validation mechanism to prevent multiple "PodAutoscaler" resources from simultaneously targeting the same workload, ensuring predictable autoscaling behavior.
Internal State Management: Implemented internal maps ("scalingTargetToPA", "paToScalingKey") within the "PodAutoscalerReconciler" to track and manage which "PodAutoscaler" controls which scaling target, along with a mutex for thread safety.
Enhanced Status Conditions: Added a new "MutilPodAutoscalerConflict" status condition to clearly indicate when a "PodAutoscaler" is in a conflicting state, and updated the "AbleToScale" and "Ready" conditions to reflect this conflict status.
Resource Deletion Cleanup: Added logic to automatically clean up the internal conflict tracking state when a "PodAutoscaler" resource is deleted, preventing stale entries.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces validation to prevent multiple PodAutoscalers from targeting the same workload, enhancing the stability and predictability of autoscaling behavior. The changes include adding a new condition type, implementing conflict checks, and updating the status computation to reflect potential conflicts. I have identified a critical issue regarding potential race conditions in the conflict checking logic that needs to be addressed.

pkg/controller/podautoscaler/podautoscaler_controller.go

gemini-code-assist · 2025-10-13T13:17:48Z

pkg/controller/podautoscaler/podautoscaler_controller.go

+	specOK := specValidationResult.Valid
+	noConflict := conflictValidationResult.Valid


Consider combining these two lines into a single line for better readability.

Suggested change

specOK := specValidationResult.Valid

noConflict := conflictValidationResult.Valid

noConflict := conflictValidationResult.Valid

noConflict := conflictValidationResult.Valid

Jeffwan · 2025-10-13T17:30:38Z

For deployment, this is good. For stormservice, we expected multiple hpa rules created for same workloads. https://aibrix.readthedocs.io/latest/features/autoscaling/metric-based-autoscaling.html#stormservice-role-level-autoscaling

in this case, we do not want the validation. is this PR against pool autoscaling pattern?

pkg/controller/podautoscaler/podautoscaler_controller.go

Signed-off-by: googs1025 <[email protected]>

googs1025 force-pushed the podautoscaler_conflict branch from 902ed28 to 1f782be Compare October 13, 2025 13:17

googs1025 requested a review from Jeffwan October 13, 2025 13:17

gemini-code-assist bot reviewed Oct 13, 2025

View reviewed changes

googs1025 force-pushed the podautoscaler_conflict branch from 1f782be to 9bf7258 Compare October 13, 2025 13:23

googs1025 commented Oct 14, 2025

View reviewed changes

pkg/controller/podautoscaler/podautoscaler_controller.go Show resolved Hide resolved

add validation for multiple PodAutoscalers targeting the same workload

2e231bd

Signed-off-by: googs1025 <[email protected]>

Jeffwan force-pushed the podautoscaler_conflict branch from 9bf7258 to 2e231bd Compare October 14, 2025 17:15

Jeffwan merged commit f9c1a74 into vllm-project:main Oct 14, 2025
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: add validation for multiple PodAutoscalers targeting the same workload #1662

[Bug]: add validation for multiple PodAutoscalers targeting the same workload #1662

Uh oh!

googs1025 commented Oct 13, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 13, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Oct 13, 2025

Uh oh!

Jeffwan commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		specOK := specValidationResult.Valid
		noConflict := conflictValidationResult.Valid

[Bug]: add validation for multiple PodAutoscalers targeting the same workload #1662

[Bug]: add validation for multiple PodAutoscalers targeting the same workload #1662

Uh oh!

Conversation

googs1025 commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Description

Related Issues

Pull Request Title Format

Submission Checklist

Uh oh!

gemini-code-assist bot commented Oct 13, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Jeffwan commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

googs1025 commented Oct 13, 2025 •

edited

Loading