-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[rollout] fix: sglang async fail with Multi-stage Awake feature #2365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rollout] fix: sglang async fail with Multi-stage Awake feature #2365
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request addresses a critical regression causing CUDA memory access errors in the SGLang async rollout process by correctly re-introducing memory resumption steps during the wake_up
phase. The changes ensure that model weights and KV cache are properly loaded into GPU memory, aligning the wake_up
logic with existing memory management patterns. While the fix is effective, there is an opportunity to refactor duplicated memory resumption logic for improved maintainability.
if self.device_mesh["infer_tp"].get_local_rank() == 0 and self.rollout_config.free_cache_engine: | ||
if self.multi_stage_wake_up: | ||
await self.inference_engine.resume_memory_occupation(tags=["weights"]) | ||
log_gpu_memory_usage("Before resume SGLang weights in sharding manager", logger=logger) | ||
else: | ||
await self.inference_engine.resume_memory_occupation() | ||
log_gpu_memory_usage("Before resume SGLang weights + kv_cache in sharding manager", logger=logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block correctly re-introduces the memory occupation resumption for model weights. This is crucial for preventing the reported "illegal memory access" CUDA errors, as it ensures that the SGLang inference engine has its necessary weights loaded into GPU memory before further operations like update_weights
are performed. This change aligns the wake_up
method's behavior with the memory management logic already present in the __enter__
method.
if self.multi_stage_wake_up and self.rollout_config.free_cache_engine: | ||
await self.inference_engine.resume_memory_occupation(tags=["kv_cache"]) | ||
log_gpu_memory_usage("After resume SGLang kv_cache in sharding manager", logger=logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This addition ensures that the KV cache is also properly resumed into GPU memory when multi_stage_wake_up
is enabled. This granular control over memory resumption is vital for the stability and performance of the SGLang engine, especially in scenarios where KV cache might be offloaded. This change is a necessary part of the overall fix for the memory access regression.
@zhaochenyang20 we may need a discussion about how to merge sglang rollout features with agent loop to keep new features in the same branch. cc @wuxibin89 |
We should also fix megatron backend. |
Thanks so much. We are also working on our side to fix this. sgl-project/sglang#7419 Is this PR solved the Should we still change on our side? @hebiao064 @nanjiangwill |
I was not aware of this code path to be honest, this fix make sense to me, thanks for fixing it! I think #2365 (comment) this comment is valid, we can reuse the enter and exit to reduce duplicate code, like vllm server did. |
fixed. Please code review again |
resolved |
What does this PR do?
Fix a regression from #1911, because the PR did not change the sglang async branch.
CI did not catch this error because it only run 1 step, but this error happen in the second test. So I update the testcases to run 2 steps.
To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash tests/special_e2e/ppo_trainer/run_function_reward.sh
It fail with:
Checklist Before Starting
[{modules}] {type}: {description}
(This will be checked by the CI){modules}
includefsdp
,megatron
,sglang
,vllm
,rollout
,trainer
,ci
,training_utils
,recipe
,hardware
,deployment
,ray
,worker
,single_controller
,misc
,perf
,model
,algo
,env
,tool
,ckpt
,doc
,data
,
like[megatron, fsdp, doc]
{type}
is infeat
,fix
,refactor
,chore
,test
[BREAKING]
to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batching
Test
API and Usage Example
# Add code snippet or script demonstrating how to use this
High-Level Design
Specific Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
ci-request
channel in theverl
Slack workspace.