Skip to content

Conversation

@googs1025
Copy link
Collaborator

@googs1025 googs1025 commented Dec 11, 2025

Pull Request Description

[Please provide a clear and concise description of your changes here]

Related Issues

Resolves: #1842

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @googs1025, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the aibrix orchestration system by introducing a TopologyPolicy feature to the RoleSetSpec. This new policy enables fine-grained control over pod placement, allowing users to specify how pods should be co-located across different topology domains (e.g., hostnames, zones) at various scopes (StormService, RoleSet, or individual Role). The changes include API definition, automatic code generation for deepcopy and apply configurations, updates to CRD schemas, and the necessary controller logic to dynamically inject pod affinity rules, thereby improving scheduling flexibility and adherence to infrastructure topology.

Highlights

  • New API Field: TopologyPolicy: Introduced a new TopologyPolicy field within the RoleSetSpec to allow users to define co-location constraints for pods based on Kubernetes topology keys.
  • Topology Scope Definition: Defined TopologyScope enum with values StormService, RoleSet, and Role to specify the granularity at which co-location policies apply.
  • Automated Code Generation Updates: Generated deepcopy methods and apply configuration structures for the new TopologyPolicy type, ensuring proper handling within the Kubernetes API ecosystem.
  • CRD and Helm Chart Updates: Updated the Custom Resource Definition (CRD) schemas for RoleSet and StormService, as well as their corresponding Helm chart CRDs, to incorporate the new topologyPolicy field.
  • Controller Logic for Pod Affinity: Implemented controller logic to inject Kubernetes PodAffinity rules into generated Pods and PodTemplates based on the specified TopologyPolicy, enforcing co-location requirements during scheduling.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a TopologyPolicy to control pod co-location for StormServices. The changes include adding the new API types, updating CRDs, and implementing the logic to inject pod affinity rules based on the policy.

The implementation is mostly solid, but I've identified a few areas for improvement:

  • There is a typo in a new API constant that should be corrected.
  • The validation for the new TopologyPolicy could be strengthened to provide better feedback to users on misconfiguration.
  • There is some code duplication in the new utility functions that can be refactored for better maintainability.

Overall, this is a good addition. Please see my detailed comments for suggestions.

Comment on lines 152 to 159
// Key is the Kubernetes topology label key to enforce co-location on.
// Common values include:
// - "kubernetes.io/hostname" (node-level)
// - "topology.kubernetes.io/zone" (zone-level)
// Required when Scope is set.
// +optional
Key string `json:"key,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment for Key on line 156 states that it's "Required when Scope is set", but this is not enforced by any validation, and the field is marked as +optional. While the controller logic handles this by ignoring the policy if either Key or Scope is empty, it would be better for user experience to enforce this relationship via validation. Consider adding a validating webhook to enforce that Key is present if Scope is set. Without it, a user might specify a Scope without a Key and the policy would be silently ignored, which can be hard to debug.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add TODO to solve in next pr

@googs1025 googs1025 force-pushed the exclusive_topology branch 2 times, most recently from 2183d34 to 0681d7b Compare December 12, 2025 01:50
@googs1025
Copy link
Collaborator Author

googs1025 commented Dec 12, 2025

local test:

case 1:

root@iZ6we78ovpjfbxlipufqliZ:~# kubectl get node -owide --show-labels
NAME                     STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION     CONTAINER-RUNTIME    LABELS
cluster1-control-plane   Ready    control-plane   22h   v1.32.2   172.18.0.3    <none>        Debian GNU/Linux 12 (bookworm)   6.8.0-88-generic   containerd://2.0.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=cluster1-control-plane,kubernetes.io/os=linux,node-role.kubernetes.io/control-plane=,node.kubernetes.io/exclude-from-external-load-balancers=
cluster1-worker          Ready    <none>          22h   v1.32.2   172.18.0.5    <none>        Debian GNU/Linux 12 (bookworm)   6.8.0-88-generic   containerd://2.0.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,gateway=true,kubernetes.io/arch=amd64,kubernetes.io/hostname=cluster1-worker,kubernetes.io/os=linux,topology.kubernetes.io/zone=zone-a
cluster1-worker2         Ready    <none>          22h   v1.32.2   172.18.0.4    <none>        Debian GNU/Linux 12 (bookworm)   6.8.0-88-generic   containerd://2.0.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=cluster1-worker2,kubernetes.io/os=linux,topology.kubernetes.io/zone=zone-a
cluster1-worker3         Ready    <none>          22h   v1.32.2   172.18.0.2    <none>        Debian GNU/Linux 12 (bookworm)   6.8.0-88-generic   containerd://2.0.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=cluster1-worker3,kubernetes.io/os=linux,topology.kubernetes.io/zone=zone-b
cluster1-worker4         Ready    <none>          22h   v1.32.2   172.18.0.6    <none>        Debian GNU/Linux 12 (bookworm)   6.8.0-88-generic   containerd://2.0.3   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=cluster1-worker4,kubernetes.io/os=linux,topology.kubernetes.io/zone=zone-b
root@iZ6we78ovpjfbxlipufqliZ:~# vi aa.yaml
root@iZ6we78ovpjfbxlipufqliZ:~# vi aa.yaml
root@iZ6we78ovpjfbxlipufqliZ:~# kubectl apply -f aa.yaml
stormservice.orchestration.aibrix.ai/pd-inference created
root@iZ6we78ovpjfbxlipufqliZ:~# kubectl get pods -owide
NAME                                            READY   STATUS    RESTARTS   AGE   IP          NODE               NOMINATED NODE   READINESS GATES
pd-inference-roleset-7zrrb-decode-6794f4-0      1/1     Running   0          8s    10.6.2.48   cluster1-worker    <none>           <none>
pd-inference-roleset-7zrrb-decode-6794f4-1      1/1     Running   0          8s    10.6.4.84   cluster1-worker2   <none>           <none>
pd-inference-roleset-7zrrb-decode-6794f4-2      1/1     Running   0          8s    10.6.4.85   cluster1-worker2   <none>           <none>
pd-inference-roleset-7zrrb-decode-6794f4-3      1/1     Running   0          8s    10.6.2.47   cluster1-worker    <none>           <none>
pd-inference-roleset-7zrrb-prefill-648798-0-0   1/1     Running   0          8s    10.6.3.52   cluster1-worker3   <none>           <none>
pd-inference-roleset-7zrrb-prefill-648798-0-1   1/1     Running   0          7s    10.6.1.29   cluster1-worker4   <none>           <none>
pd-inference-roleset-7zrrb-prefill-648798-1-0   1/1     Running   0          7s    10.6.3.53   cluster1-worker3   <none>           <none>
pd-inference-roleset-7zrrb-prefill-648798-1-1   1/1     Running   0          7s    10.6.1.30   cluster1-worker4   <none>           <none>
pd-inference-roleset-fjf5b-decode-6794f4-0      1/1     Running   0          8s    10.6.2.49   cluster1-worker    <none>    
...       
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: pd-inference
spec:
  replicas: 3
  stateful: true
  selector:
    matchLabels:
      app: pd-inference
  template:
    metadata:
      labels:
        app: pd-inference
    spec:
      topologyPolicy:
        scope: Role
        key: topology.kubernetes.io/zone
      roles:
        - name: prefill
          replicas: 2
          podGroupSize: 2
          stateful: true
          template:
            metadata:
              labels:
                role: prefill
            spec:
              containers:
                - name: prefill
                  image: nginx:alpine
        - name: decode
          replicas: 4
          stateful: true
          template:
            metadata:
              labels:
                role: decode
            spec:
              containers:
                - name: decode
                  image: nginx:alpine

case 2:

root@iZ6we78ovpjfbxlipufqliZ:~# cat aa.yaml
apiVersion: orchestration.aibrix.ai/v1alpha1
kind: StormService
metadata:
  name: pd-inference
spec:
  replicas: 3
  stateful: true
  selector:
    matchLabels:
      app: pd-inference
  template:
    metadata:
      labels:
        app: pd-inference
    spec:
      topologyPolicy:
        scope: Role
        key: topology.kubernetes.io/zone
      roles:
        - name: prefill
          replicas: 2
          podGroupSize: 2
          stateful: true
          template:
            metadata:
              labels:
                role: prefill
            spec:
              containers:
                - name: prefill
                  image: nginx:alpine
        - name: decode
          replicas: 4
          stateful: true
          template:
            metadata:
              labels:
                role: decode
            spec:
              containers:
                - name: decode
                  image: nginx:alpine
root@iZ6we78ovpjfbxlipufqliZ:~#
root@iZ6we78ovpjfbxlipufqliZ:~# kubectl apply -f aa.yaml
stormservice.orchestration.aibrix.ai/pd-inference created
root@iZ6we78ovpjfbxlipufqliZ:~# kubectl get pods -owide
NAME                                            READY   STATUS    RESTARTS   AGE   IP          NODE               NOMINATED NODE   READINESS GATES
pd-inference-roleset-8qw9t-decode-6794f4-0      1/1     Running   0          8s    10.6.3.56   cluster1-worker3   <none>           <none>
pd-inference-roleset-8qw9t-decode-6794f4-1      1/1     Running   0          8s    10.6.3.59   cluster1-worker3   <none>           <none>
pd-inference-roleset-8qw9t-decode-6794f4-2      1/1     Running   0          8s    10.6.3.57   cluster1-worker3   <none>           <none>
pd-inference-roleset-8qw9t-decode-6794f4-3      1/1     Running   0          8s    10.6.3.58   cluster1-worker3   <none>           <none>
pd-inference-roleset-8qw9t-prefill-648798-0-0   1/1     Running   0          8s    10.6.3.63   cluster1-worker3   <none>           <none>
pd-inference-roleset-8qw9t-prefill-648798-0-1   1/1     Running   0          8s    10.6.3.60   cluster1-worker3   <none>           <none>
pd-inference-roleset-8qw9t-prefill-648798-1-0   1/1     Running   0          8s    10.6.3.61   cluster1-worker3   <none>           <none>
pd-inference-roleset-8qw9t-prefill-648798-1-1   1/1     Running   0          8s    10.6.3.62   cluster1-worker3   <none>           <none>
pd-inference-roleset-bgqhg-decode-6794f4-0      1/1     Running   0          9s    10.6.2.54   cluster1-worker    <none>           <none>
pd-inference-roleset-bgqhg-decode-6794f4-1      1/1     Running   0          9s    10.6.2.51   cluster1-worker    <none>           <none>
pd-inference-roleset-bgqhg-decode-6794f4-2      1/1     Running   0          9s    10.6.2.53   cluster1-worker    <none>           <none>
pd-inference-roleset-bgqhg-decode-6794f4-3      1/1     Running   0          9s    10.6.2.52   cluster1-worker    <none>           <none>
pd-inference-roleset-bgqhg-prefill-648798-0-0   1/1     Running   0          9s    10.6.2.55   cluster1-worker    <none>           <none>
pd-inference-roleset-bgqhg-prefill-648798-0-1   1/1     Running   0          9s    10.6.2.56   cluster1-worker    <none>           <none>
pd-inference-roleset-bgqhg-prefill-648798-1-0   1/1     Running   0          8s    10.6.2.57   cluster1-worker    <none>           <none>
pd-inference-roleset-bgqhg-prefill-648798-1-1   1/1     Running   0          8s    10.6.2.58   cluster1-worker    <none>           <none>
pd-inference-roleset-dfwfs-decode-6794f4-0      1/1     Running   0          8s    10.6.1.36   cluster1-worker4   <none>           <none>
pd-inference-roleset-dfwfs-decode-6794f4-1      1/1     Running   0          8s    10.6.1.34   cluster1-worker4   <none>           <none>
pd-inference-roleset-dfwfs-decode-6794f4-2      1/1     Running   0          8s    10.6.1.33   cluster1-worker4   <none>           <none>
pd-inference-roleset-dfwfs-decode-6794f4-3      1/1     Running   0          8s    10.6.1.35   cluster1-worker4   <none>           <none>
pd-inference-roleset-dfwfs-prefill-648798-0-0   1/1     Running   0          8s    10.6.1.37   cluster1-worker4   <none>           <none>
pd-inference-roleset-dfwfs-prefill-648798-0-1   1/1     Running   0          8s    10.6.1.39   cluster1-worker4   <none>           <none>
pd-inference-roleset-dfwfs-prefill-648798-1-0   1/1     Running   0          8s    10.6.1.38   cluster1-worker4   <none>           <none>
pd-inference-roleset-dfwfs-prefill-648798-1-1   1/1     Running   0          8s    10.6.1.40   cluster1-worker4   <none>           <none>

@googs1025 googs1025 marked this pull request as ready for review December 12, 2025 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Support Exclusive Topology in StormService for Pod Colocation

1 participant