Skip to content

Conversation

@bigerous
Copy link
Contributor

@bigerous bigerous commented Aug 13, 2025

Pull Request Description

StormServiceController will Create a headless service assocaited with stormservice. The headless service's publishNotReadyAddresses option is false (by default), only ready Pods get listed in DNS. This will cause pod of the same stormservice can not reach each other by dns name like <pod-name>.<service-name>.<namespace>.svc.cluster.local before target pod is ready. Reaching another pod by dns name before pod is ready is needed in some cases like sglang's multi-node deployment:

# replace 172.16.4.52:20000 with your own node ip address and port of the first node

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr {ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-${ROLE_REPLICA_INDEX}-0.${STORM_SERVICE_NAME}:5000 --nnodes 2 --node-rank 0

python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --dist-init-addr {ROLESET_NAME}-${ROLE_NAME}-${ROLE_TEMPLATE_HASH}-${ROLE_REPLICA_INDEX}-0.${STORM_SERVICE_NAME}:5000  --nnodes 2 --node-rank 1

When publishNotReadyAddresses is true: Pods that aren’t ready yet (per readiness probes) will also be included in the DNS entries for the service.

Related Issues

Resolves: #[Insert issue number(s)]

Important: Before submitting, please complete the description above and review the checklist below.


Contribution Guidelines (Expand for Details)

We appreciate your contribution to aibrix! To ensure a smooth review process and maintain high code quality, please adhere to the following guidelines:

Pull Request Title Format

Your PR title should start with one of these prefixes to indicate the nature of the change:

  • [Bug]: Corrections to existing functionality
  • [CI]: Changes to build process or CI pipeline
  • [Docs]: Updates or additions to documentation
  • [API]: Modifications to aibrix's API or interface
  • [CLI]: Changes or additions to the Command Line Interface
  • [Misc]: For changes not covered above (use sparingly)

Note: For changes spanning multiple categories, use multiple prefixes in order of importance.

Submission Checklist

  • PR title includes appropriate prefix(es)
  • Changes are clearly explained in the PR description
  • New and existing tests pass successfully
  • Code adheres to project style and best practices
  • Documentation updated to reflect changes (if applicable)
  • Thorough testing completed, no regressions introduced

By submitting this PR, you confirm that you've read these guidelines and your changes align with the project's contribution standards.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @bigerous, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug where pods within the same stormservice could not communicate via DNS before all target pods were fully ready. The change enables the publishNotReadyAddresses option on the headless service associated with stormservice instances. This ensures that DNS entries for the service include unready pods, facilitating scenarios like multi-node deployments (e.g., sglang) that require pods to resolve each other's DNS names even if they are not yet in a 'ready' state.

Highlights

  • Headless Service Configuration: The PublishNotReadyAddresses field for the StormService's headless service is now set to true. This ensures that even unready pods are included in the DNS entries, allowing for inter-pod communication via DNS before all pods are fully ready.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a bug where pods of a stormservice cannot discover each other via DNS before they are ready. The proposed change correctly sets PublishNotReadyAddresses: true on the headless service spec. However, the review identified a significant issue: the change will not be applied to existing services because the equality check function (isServiceEqual) does not compare this new field. This makes the fix incomplete. I've provided a detailed comment on how to correct this, which involves updating isServiceEqual and the associated unit tests.

Comment on lines +81 to +87
Type: corev1.ServiceTypeClusterIP,
ClusterIP: corev1.ClusterIPNone,
Selector: map[string]string{constants.StormServiceNameLabelKey: service.Name},
PublishNotReadyAddresses: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While adding PublishNotReadyAddresses: true is correct for the intended purpose, the current implementation will not apply this change to existing headless services.

The isServiceEqual function, which is used to check if an update is needed, does not compare the PublishNotReadyAddresses field. As a result, the controller will not detect the difference for existing services and will not trigger an update. This makes the bug fix incomplete as it won't work for existing StormService resources.

To fix this, isServiceEqual in pkg/controller/stormservice/utils.go needs to be updated to include this check:

// isServiceEqual compares two Kubernetes Service objects for equality
func isServiceEqual(a, b *corev1.Service) bool {
	return a.Spec.Type == b.Spec.Type &&
		apiequality.Semantic.DeepEqual(a.Spec.Selector, b.Spec.Selector) &&
		a.Spec.ClusterIP == b.Spec.ClusterIP &&
		a.Spec.PublishNotReadyAddresses == b.Spec.PublishNotReadyAddresses
}

Additionally, the test TestSyncHeadlessService in pkg/controller/stormservice/sync_test.go should be updated to assert that PublishNotReadyAddresses is set to true on the created/updated service to prevent future regressions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bigerous could you check the utils. for existing services, once user finish the upgrade, the service can be updated.

Copy link
Contributor Author

@bigerous bigerous Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I have updated the utils and add ut test for this case.

@bigerous bigerous force-pushed the cdj/fix/headless_svc_readiness branch from 466ed55 to 5490575 Compare August 13, 2025 13:29
@Jeffwan Jeffwan force-pushed the cdj/fix/headless_svc_readiness branch from 5490575 to 3a1a847 Compare August 14, 2025 01:24
@Jeffwan Jeffwan force-pushed the cdj/fix/headless_svc_readiness branch from 86ea787 to 1376be0 Compare August 14, 2025 10:54
@Jeffwan Jeffwan merged commit 3c94399 into vllm-project:main Aug 14, 2025
14 checks passed
@bigerous bigerous deleted the cdj/fix/headless_svc_readiness branch August 14, 2025 11:27
Jeffwan pushed a commit to Jeffwan/aibrix that referenced this pull request Aug 18, 2025
…es (vllm-project#1441)

* fix: stormservice's headless service need set PublishNotReadyAddresses

Signed-off-by: dajun.cui <[email protected]>

* fix: isServiceEqual check PublishNotReadyAddresses

Signed-off-by: dajun.cui <[email protected]>

---------

Signed-off-by: dajun.cui <[email protected]>
Jeffwan added a commit that referenced this pull request Aug 18, 2025
…se-0.4 branch (#1468)

* Select PD workers in same roleset (#1409)

* Select PD workers in same roleset
* nit
* update ut
---------

Signed-off-by: Varun Gupta <[email protected]>

* [Bug] fix webhook config output when using make manifests (#1412)

fix webhook config output when using make manifests

Signed-off-by: googs1025 <[email protected]>

* [Fix] Fix vLLM NIXL-based P/D samples (#1425)

Signed-off-by: Haiyang Shi <[email protected]>
Co-authored-by: Haiyang Shi <[email protected]>

* [Fix] Disable GGA in NIXL samples (#1436)

[Fix] Fix NIXL samples

Explicitly set UCX_TLS to let UCX not use GGA (GPU Direct) transport

Signed-off-by: Haiyang Shi <[email protected]>
Co-authored-by: Haiyang Shi <[email protected]>

* Fix P/D disaggregation router to follow Nixl kv_transfer_params (#1429)

- Add kv_transfer_params configuration to prefill requests and decode requests

Signed-off-by: Jiaxin Shan <[email protected]>

* [Bug] Corrected naming convention for AIBRIX_MODEL_GPU_PROFILE_CACHING_FLAG (#1427)

Corrected naming convention for AIBRIX_MODEL_GPU_PROFILE_CACHING_FLAG

Signed-off-by: Jonathon Shea <[email protected]>

* [Bug] stormservice's headless service not set ownerRef (#1442)

* fix: stormservice's headless service not set ownerRef

Signed-off-by: dajun.cui <[email protected]>

* fix: patch ut test for service sync

Signed-off-by: dajun.cui <[email protected]>

---------

Signed-off-by: dajun.cui <[email protected]>

* [Bug] stormservice's headless service need set PublishNotReadyAddresses (#1441)

* fix: stormservice's headless service need set PublishNotReadyAddresses

Signed-off-by: dajun.cui <[email protected]>

* fix: isServiceEqual check PublishNotReadyAddresses

Signed-off-by: dajun.cui <[email protected]>

---------

Signed-off-by: dajun.cui <[email protected]>

---------

Signed-off-by: Varun Gupta <[email protected]>
Signed-off-by: googs1025 <[email protected]>
Signed-off-by: Haiyang Shi <[email protected]>
Signed-off-by: Jiaxin Shan <[email protected]>
Signed-off-by: Jonathon Shea <[email protected]>
Signed-off-by: dajun.cui <[email protected]>
Co-authored-by: Varun Gupta <[email protected]>
Co-authored-by: CYJiang <[email protected]>
Co-authored-by: Haiyang Shi <[email protected]>
Co-authored-by: Haiyang Shi <[email protected]>
Co-authored-by: Jonathon Shea <[email protected]>
Co-authored-by: cuidajun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants