Skip to content

[rollout] fix: sglang async fail with Multi-stage Awake feature #2365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

chenhaiq
Copy link
Collaborator

@chenhaiq chenhaiq commented Jul 4, 2025

What does this PR do?

Fix a regression from #1911, because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:

(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request addresses a critical regression causing CUDA memory access errors in the SGLang async rollout process by correctly re-introducing memory resumption steps during the wake_up phase. The changes ensure that model weights and KV cache are properly loaded into GPU memory, aligning the wake_up logic with existing memory management patterns. While the fix is effective, there is an opportunity to refactor duplicated memory resumption logic for improved maintainability.

Comment on lines +211 to +217
if self.device_mesh["infer_tp"].get_local_rank() == 0 and self.rollout_config.free_cache_engine:
if self.multi_stage_wake_up:
await self.inference_engine.resume_memory_occupation(tags=["weights"])
log_gpu_memory_usage("Before resume SGLang weights in sharding manager", logger=logger)
else:
await self.inference_engine.resume_memory_occupation()
log_gpu_memory_usage("Before resume SGLang weights + kv_cache in sharding manager", logger=logger)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block correctly re-introduces the memory occupation resumption for model weights. This is crucial for preventing the reported "illegal memory access" CUDA errors, as it ensures that the SGLang inference engine has its necessary weights loaded into GPU memory before further operations like update_weights are performed. This change aligns the wake_up method's behavior with the memory management logic already present in the __enter__ method.

Comment on lines +229 to +231
if self.multi_stage_wake_up and self.rollout_config.free_cache_engine:
await self.inference_engine.resume_memory_occupation(tags=["kv_cache"])
log_gpu_memory_usage("After resume SGLang kv_cache in sharding manager", logger=logger)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This addition ensures that the KV cache is also properly resumed into GPU memory when multi_stage_wake_up is enabled. This granular control over memory resumption is vital for the stability and performance of the SGLang engine, especially in scenarios where KV cache might be offloaded. This change is a necessary part of the overall fix for the memory access regression.

@chenhaiq
Copy link
Collaborator Author

chenhaiq commented Jul 4, 2025

@zhaochenyang20 we may need a discussion about how to merge sglang rollout features with agent loop to keep new features in the same branch. cc @wuxibin89

@wuxibin89
Copy link
Collaborator

We should also fix megatron backend.

@zhaochenyang20
Copy link
Collaborator

Thanks so much. We are also working on our side to fix this. sgl-project/sglang#7419

Is this PR solved the illegal access in verl's main and SGLang 0.4.8.post1?

Should we still change on our side? @hebiao064 @nanjiangwill

@hebiao064
Copy link
Collaborator

I was not aware of this code path to be honest, this fix make sense to me, thanks for fixing it!

I think #2365 (comment) this comment is valid, we can reuse the enter and exit to reduce duplicate code, like vllm server did.

@chenhaiq
Copy link
Collaborator Author

chenhaiq commented Jul 7, 2025

We should also fix megatron backend.

fixed. Please code review again

@chenhaiq
Copy link
Collaborator Author

chenhaiq commented Jul 7, 2025

I was not aware of this code path to be honest, this fix make sense to me, thanks for fixing it!

I think #2365 (comment) this comment is valid, we can reuse the enter and exit to reduce duplicate code, like vllm server did.

resolved

@wuxibin89 wuxibin89 merged commit 4c37c97 into volcengine:main Jul 7, 2025
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants