[rollout] fix: sglang async fail with Multi-stage Awake feature #2365

chenhaiq · 2025-07-04T10:29:42Z

What does this PR do?

Fix a regression from #1911, because the PR did not change the sglang async branch.

CI did not catch this error because it only run 1 step, but this error happen in the second test. So I update the testcases to run 2 steps.

To reproduce the bug, run test:
TOTAL_TRAIN_STEPS=2 ENGINE=sglang ROLLOUT_MODE=async bash tests/special_e2e/ppo_trainer/run_function_reward.sh

It fail with:

(WorkerDict pid=1257286) Total steps: 2, num_warmup_steps: 0
(WorkerDict pid=1257286) Actor use_remove_padding=True
(WorkerDict pid=1257286) Actor use_fused_kernels=False
(AsyncSglangServer pid=1260392) FastAPI listen on [192.168.111.48:40451](http://192.168.111.48:40451/)
(WorkerDict pid=1257286) terminate called after throwing an instance of 'c10::Error'
(WorkerDict pid=1257286)   what():  CUDA error: an illegal memory access was encountered
(WorkerDict pid=1257286) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(WorkerDict pid=1257286) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(WorkerDict pid=1257286) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(WorkerDict pid=1257286)
(WorkerDict pid=1257286) Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
(WorkerDict pid=1257286) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fbf6036c1b6 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbf60315a76 in /usr/local/lib/python3.10/dist-packages/torch/lib/[libc10.so](http://libc10.so/))
(WorkerDict pid=1257286) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbf6080d918 in

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/issues?q=is%3Aissue%20state%3Aopen%20an%20illegal%20memory%20access%20was%20encountered
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

(TaskRunner pid=1647269) step:2 - global_seqlen/min:13075 - global_seqlen/max:14837 - global_seqlen/minmax_diff:1762 - global_seqlen/balanced_min:14231 - global_seqlen/balanced_max:14232 - global_seqlen/mean:14231.5 - actor/entropy:2.0606913566589355 - critic/vf_loss:8.7157882153

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

List the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
[ X] Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace.

gemini-code-assist

Code Review

The pull request addresses a critical regression causing CUDA memory access errors in the SGLang async rollout process by correctly re-introducing memory resumption steps during the wake_up phase. The changes ensure that model weights and KV cache are properly loaded into GPU memory, aligning the wake_up logic with existing memory management patterns. While the fix is effective, there is an opportunity to refactor duplicated memory resumption logic for improved maintainability.

gemini-code-assist · 2025-07-04T10:30:23Z

verl/workers/sharding_manager/fsdp_sglang.py

+        if self.device_mesh["infer_tp"].get_local_rank() == 0 and self.rollout_config.free_cache_engine:
+            if self.multi_stage_wake_up:
+                await self.inference_engine.resume_memory_occupation(tags=["weights"])
+                log_gpu_memory_usage("Before resume SGLang weights in sharding manager", logger=logger)
+            else:
+                await self.inference_engine.resume_memory_occupation()
+                log_gpu_memory_usage("Before resume SGLang weights + kv_cache in sharding manager", logger=logger)


This block correctly re-introduces the memory occupation resumption for model weights. This is crucial for preventing the reported "illegal memory access" CUDA errors, as it ensures that the SGLang inference engine has its necessary weights loaded into GPU memory before further operations like update_weights are performed. This change aligns the wake_up method's behavior with the memory management logic already present in the __enter__ method.

gemini-code-assist · 2025-07-04T10:30:23Z

verl/workers/sharding_manager/fsdp_sglang.py

+        if self.multi_stage_wake_up and self.rollout_config.free_cache_engine:
+            await self.inference_engine.resume_memory_occupation(tags=["kv_cache"])
+            log_gpu_memory_usage("After resume SGLang kv_cache in sharding manager", logger=logger)


This addition ensures that the KV cache is also properly resumed into GPU memory when multi_stage_wake_up is enabled. This granular control over memory resumption is vital for the stability and performance of the SGLang engine, especially in scenarios where KV cache might be offloaded. This change is a necessary part of the overall fix for the memory access regression.

verl/workers/sharding_manager/fsdp_sglang.py

chenhaiq · 2025-07-04T10:32:05Z

@zhaochenyang20 we may need a discussion about how to merge sglang rollout features with agent loop to keep new features in the same branch. cc @wuxibin89

verl/workers/sharding_manager/fsdp_sglang.py

wuxibin89 · 2025-07-04T13:14:42Z

We should also fix megatron backend.

zhaochenyang20 · 2025-07-04T17:08:21Z

Thanks so much. We are also working on our side to fix this. sgl-project/sglang#7419

Is this PR solved the illegal access in verl's main and SGLang 0.4.8.post1?

Should we still change on our side? @hebiao064 @nanjiangwill

hebiao064 · 2025-07-04T18:03:36Z

I was not aware of this code path to be honest, this fix make sense to me, thanks for fixing it!

I think #2365 (comment) this comment is valid, we can reuse the enter and exit to reduce duplicate code, like vllm server did.

chenhaiq · 2025-07-07T03:18:41Z

We should also fix megatron backend.

fixed. Please code review again

chenhaiq · 2025-07-07T03:19:02Z

I was not aware of this code path to be honest, this fix make sense to me, thanks for fixing it!

I think #2365 (comment) this comment is valid, we can reuse the enter and exit to reduce duplicate code, like vllm server did.

resolved

fix sglang async with Multi-stage Awake

f769172

chenhaiq requested review from hebiao064 and zhaochenyang20 July 4, 2025 10:30

gemini-code-assist bot reviewed Jul 4, 2025

View reviewed changes

wuxibin89 reviewed Jul 4, 2025

View reviewed changes

verl/workers/sharding_manager/fsdp_sglang.py Show resolved Hide resolved

chenhaiq added 3 commits July 7, 2025 10:28

resolve duplicated code in __enter__ and __exit__

a49c300

Merge branch 'main' into fix_sglang_async_multi_stage_awake

67e7b44

fix megatron with sglang async

b4e7ddd

wuxibin89 approved these changes Jul 7, 2025

View reviewed changes

wuxibin89 merged commit 4c37c97 into volcengine:main Jul 7, 2025
45 checks passed

This was referenced Jul 7, 2025

[rollout] fix: sglang megatron backend missing generate function #2367

Closed

Error about the Async sglang #2366

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rollout] fix: sglang async fail with Multi-stage Awake feature #2365

[rollout] fix: sglang async fail with Multi-stage Awake feature #2365

chenhaiq commented Jul 4, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 4, 2025

Uh oh!

gemini-code-assist bot Jul 4, 2025

Uh oh!

Uh oh!

chenhaiq commented Jul 4, 2025

Uh oh!

Uh oh!

wuxibin89 commented Jul 4, 2025

Uh oh!

zhaochenyang20 commented Jul 4, 2025

Uh oh!

hebiao064 commented Jul 4, 2025

Uh oh!

chenhaiq commented Jul 7, 2025

Uh oh!

chenhaiq commented Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

[rollout] fix: sglang async fail with Multi-stage Awake feature #2365

[rollout] fix: sglang async fail with Multi-stage Awake feature #2365

Conversation

chenhaiq commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

High-Level Design

Specific Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chenhaiq commented Jul 4, 2025

Uh oh!

Uh oh!

wuxibin89 commented Jul 4, 2025

Uh oh!

zhaochenyang20 commented Jul 4, 2025

Uh oh!

hebiao064 commented Jul 4, 2025

Uh oh!

chenhaiq commented Jul 7, 2025

Uh oh!

chenhaiq commented Jul 7, 2025

Uh oh!

Uh oh!

Uh oh!

chenhaiq commented Jul 4, 2025 •

edited

Loading