fix: complete RayClusterFleet example for multi-node vLLM inference #954
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR updates the existing RayClusterFleet example for qwen-coder-7b-instruct to make it fully functional for multi-node inference with vLLM and Ray.
What's fixed or added:
Corrected the command and args for the Ray head pod to properly start vllm serve
Explicitly set tensor-parallel-size and distributed-executor-backend
Added the aibrix-runtime sidecar container
Added a missing Service to expose port 8000 (vLLM)
Added an HTTPRoute to route OpenAI-compatible requests via Gateway API
Added the environment variable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to handle memory fragmentation
Ensured Ray workers are correctly bootstrapped and terminated
Why this matters
The original example didn't include necessary components like the service or routing layer and lacked Ray-aware vLLM initialization. This update makes it a solid reference for users aiming to use distributed inference with multiple GPU nodes.
Tested and validated on an EKS cluster with 3x g4dn.2xlarge nodes.