Skip to content

Commit baacfb2

Browse files
minleminzuiocss884
authored andcommitted
Doc: add a environment to fix that the memory capacity is unbalanced (volcengine#1105)
if we use sglang as the rollout engine, we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK to avoid that the memory capacity is unbalanced, please refer to [#5426 in sglang](sgl-project/sglang#5426) # why we should export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK when using SGLang as the rollout engine in verl? 1. verl initializes a SGlangRollout module during rollout, which is used to evaluate/generate samples. 2. SGLangRollout will initialize VerlEngine, further initialize a torch. Distributed. DeviceMesh, used to support the TP. 3. DeviceMesh.init () internally checks the free video memory of all participating devices, and if the difference is too large (more than about 10%), it directly reports an error, preventing initialization failures or communication deadlock. # Why might there be inconsistent graphic memory? ## Ray Distributed Actor loads the model at different times: verl uses ray multi-process multi-gpu concurrent training, and each `WorkerDict` may be called at different times: `self.rollout = SGLangRollout(...)` different workers initialize the model at different times → different memory usage. ## Delayed initialization causes memory bias Some workers enter the model loading/infer process earlier than others, such as `generate_sequences()` or `compute_log_prob()`. The early-loaded worker video memory has been eaten by the model, and the late-loaded worker video memory is still empty → the graphic memory gap is large. ## Verl+SGLang's TP initialization goes "all device broadcast", but there is no uniform release timing SGLangRollout only needs to involve the part of the graphics card used by the rollout machine, but its VerlEngine initialization calls torch.distribut.init process group() and broadcast a bunch of weights. Result in: Non-rollout cards also participate in communication; Then initialize DeviceMesh, and the error "inconsistent memory" is reported. ## Different loading modes of FSDP/TP models also cause deviations if the following parameters are set ``` actor.fsdp_config.param_offload=True ref.fsdp_config.param_offload=True ``` Some worker parameters are on the CPU, and some parameters are shard to the GPU in advance. This also creates an asymmetric distribution of video memory. --------- Co-authored-by: ocss884 <[email protected]>
1 parent 42ce936 commit baacfb2

File tree

1 file changed

+46
-0
lines changed

1 file changed

+46
-0
lines changed

docs/workers/sglang_worker.rst

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test.
3737

3838
.. code-block:: bash
3939
40+
export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK=True
4041
PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
4142
data.train_files=$HOME/data/gsm8k/train.parquet \
4243
data.val_files=$HOME/data/gsm8k/test.parquet \
@@ -70,6 +71,51 @@ We use Qwen/Qwen2-7B-Instruct on the gsm8k dataset for a simple test.
7071
trainer.test_freq=10 \
7172
trainer.total_epochs=15 2>&1 | tee verl_demo.log
7273
74+
Why export SGL_DISABLE_TP_MEMORY_INBALANCE_CHECK?
75+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
76+
77+
1. ``verl`` initializes a ``SGLangRollout`` module during rollout, which is used to evaluate/generate samples.
78+
79+
2. ``SGLangRollout`` will initialize ``VerlEngine``, and further initialize a ``torch.distributed.DeviceMesh``, used to support Tensor Parallel (TP).
80+
81+
3. ``DeviceMesh.init()`` internally checks the free GPU memory of all participating devices. If the difference is too large (more than ~10%), it directly reports an error to avoid initialization failures or deadlocks.
82+
83+
Why might there be inconsistent GPU memory?
84+
"""""""""""""""""""""""""""""""""""""""""""
85+
86+
**1. Ray Distributed Actor loads the model at different times**
87+
88+
``verl`` uses Ray-based multi-process, multi-GPU concurrent training. Each ``WorkerDict`` may be called at different times:
89+
90+
.. code-block:: python
91+
92+
self.rollout = SGLangRollout(...)
93+
94+
Different workers initialize the model at different times → different memory usage.
95+
96+
**2. Delayed initialization causes memory bias**
97+
98+
Some workers start model loading/inference (e.g., ``generate_sequences()``, ``compute_log_prob()``) earlier than others.
99+
Early workers already use up GPU memory → late workers still have empty memory → memory difference appears.
100+
101+
**3. SGLang's TP init uses "all-device broadcast", but there's no uniform release timing**
102+
103+
Although ``SGLangRollout`` may only involve subset of GPUs, its ``VerlEngine`` initialization calls ``torch.distributed.init_process_group()`` and broadcasts weights, so:
104+
105+
- Non-rollout GPUs also join the communication.
106+
- Later on, ``DeviceMesh`` init will fail due to "inconsistent memory".
107+
108+
**4. Different FSDP/TP loading behaviors also lead to mismatch**
109+
110+
If using:
111+
112+
.. code-block:: bash
113+
114+
actor.fsdp_config.param_offload=True
115+
ref.fsdp_config.param_offload=True
116+
117+
Then some workers keep params on CPU while others already sharded to GPU → leads to asymmetric memory layout.
118+
73119
Using SGLang as the Inference Backend for PPO Training Across Multiple Machines
74120
------------------------------------------------------------------------------
75121
SGLang also supports running verl's RAY-based cross-machine inference in IPv4 and IPv6 scenarios. In the script below, we use TP=16 for cross-machine inference. Suppose we have two interconnected machines: node0 with IP 10.94.16.4 and node1 with IP 10.94.16.5.

0 commit comments

Comments
 (0)