[Feat]: Implement partial rollout #1826

stargazerZJ · 2025-06-03T12:13:18Z

partial rollout PR draft

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Supporting the partial rollout feature: Set vllm's maximum output length to an integer fraction of config.response_length. If one model response is not completed in the current round, its generation can continue in the next iteration with updated weights. This trick unlocked significant rollout time reduction without sacrifacing model performance.

partial rollout with 4K max response length, compared to baseline:

Note on the 3rd plot: We didn't enable testing on 2GPU baseline runs because they were already so slowly. We will update the 4GPU baseline runs after it finishes in a few hours.

High-Level Design

Specific Changes

Add buffer pool and filter logic in the fit function
Concat partial responses to prompt inputs in the vLLMRollout class
Add split method to DataProto
Add configuration entry

API

There are mainly two areas where compatibility is potentially tricky:

In the original architecture, the main loop sends only one copy of each prompt to vllm, and vllm uses the SamplingParams.n parameter to generate multiple responses before returning them. In our implementation, we must duplicate the prompts in the main loop and send multiple copies to vllm.
The fit function in verl takes a batch from the dataloader in each iteration and continuously adds keys to it. In our implementation, there's a step where the partial_batch (partially generated by vllm) needs to be merged with the initial batch taken from the dataloader.

Usage Example

    algorithm.partial_rollout_max_split=2

Test

Setup: qwen3 0.6b base; MATH dataset; batch size 1024; rollout n = 5; max response length 4K; 2 H100 GPUs; Megatron trainer

We run:

Partial rollout 4K 2 split
Partial rollout 4K 4 split
Baseline 4K
Baseline 2K: max response length reduced to 2k, for reference
Baseline 1K
Baseline 4K 4GPU

Results, partial rollout 4K 2 split compared to baseline 4K:

Overall experiment speed (16h vs 48h) and generation speed (estimated at 20%) significantly improved; Both are even faster than baseline 2K and baseline 4K 4GPU.
Reward curves are basically consistent after around 15 steps (out of 105 steps in total). (No discernible difference among baseline 4K, 2K, 1K, and partial 4K, etc.)
In the early stages of training, the reward increase of partial 4K 2 split lags behind the baseline by about 1.5 steps, and partial 4K 4 split lags even more.
In the middle and late stages of training, the mean response length of partial 4K and baseline_4096_4GPU show an upward trend, while other baselines tend to stabilize.

We also tested these settings
qwen3 0.6b base, dataset is gsm8k; batch size 1024; rollout n = 5; max prompt length 512; 2 GPUs; FSDP trainer
3 experiments:
baseline_512
partial_512
baseline_256 max response length 256
In these experiments, the speed gain is about 30% overall and 50% in gen. the aforementioned two problems in training dynamics are less visible.

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: none
Inference: vLLM

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if necessary.

feifeibear · 2025-06-04T03:02:48Z

Hi @stargazerZJ , thank you for your design diagrams and draft code.

We originally planned to develop the partial rollout feature after completing the server-based rollout refactoring. #1636
The server-based rollout will enable more efficient inter-request asynchrony and load balancing between inference dp ranks, which is a prerequisite for partial rollout.

Would you mind joining us in the refactoring of the rollout module? We'd like to hear your thoughts first.

lilei199908 · 2025-06-04T03:10:17Z

Great work! I'm wondering if there have been any tests on the effectiveness of larger models (due to multi-step asynchrony from partial rollouts) and the efficiency on larger clusters.

eric-haibin-lin · 2025-06-04T06:26:47Z

Are you on slack? Looking forward to more discussions.

eric-haibin-lin

A CI test is needed

eric-haibin-lin · 2025-06-15T21:42:48Z

please also resolve conflicts with main, thx

stargazerZJ · 2025-06-18T11:38:13Z

I'm back! I have resolved conflicts at
https://github.com/stargazerZJ/verl/tree/partial-2
and will test it tomorrow

eric-haibin-lin · 2025-06-23T00:02:53Z

verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py

                for sample_id in range(len(output.outputs)):
                    response_ids = output.outputs[sample_id].token_ids
-                    response.append(response_ids)
+                    filtered_response = [id if id < 151669 else 0 for id in response_ids]


what is this magic number? 151669

looks like the vocab size of a certain model, but why do you need filter here? did you encounter model to generate token id larger than what's in the tokenizer?

Yes. Qwen3 will do that, and will cause vllm to raise an exception with Qwen's output is fed back to its input. I'm sorry that I have forgotten to mention it when writing the PR.

The same happens to Qwen 2.5 Math too.

eric-haibin-lin · 2025-06-23T04:10:53Z

Split based on vllm max gen token does not seem general, as the actual rollout time may be affected by tool call/env interaction time. This needs careful consideration

protossw512 · 2025-06-25T23:08:40Z

verl/protocol.py

+        first_proto = data_proto.select_idxs(filter_mask)
+        second_proto = data_proto.select_idxs(inverse_mask)


You will running into issues here if filter_mask is either all True or all False:
RuntimeError: batch dimension mismatch, got self.batch_size=torch.Size([32]) and value.shape=torch.Size([0, 4096]).

Basically select_idxs won't support empty selection

Thank you. I'll test further as soon as the machines are vacant. I remembered encountering and fixing similar issues during early developing. I think during later training runs there were cases when all/no prompts are finished, and there was no RE.

nvm, turned out I was using an older version of verl (0.3). I think this issue has been fixed. thanks!

chenjiaoAngel · 2025-07-07T08:25:56Z

Excuse me, i use this mr to implement on my repo, but it run when partial_rollout_max_split>1, it will occur crush. the error info is :
model: Qwen2.5-VL-7B-Instruct
datasets: geo3k
max_response_length=2k
partial_rollout_max_split=2
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 3072 but got size 1024 for tensor number 1 in the list.

chenxia-han · 2025-07-17T02:44:59Z

Can we specify any maximum response length, or must the original maximum response length be divisible by it?

stargazerZJ · 2025-07-17T09:09:45Z

Excuse me, i use this mr to implement on my repo, but it run when partial_rollout_max_split>1, it will occur crush. the error info is : model: Qwen2.5-VL-7B-Instruct datasets: geo3k max_response_length=2k partial_rollout_max_split=2 RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 3072 but got size 1024 for tensor number 1 in the list.

Sorry, multi-modal input support is not implemented in this draft pr. Support it requires adding the related fields to partial_batch and other batches manually. It should be trivial, but I am just not familiar with multimodal training at the moment.

BTW, sglang is not supported as well.

stargazerZJ · 2025-07-17T09:10:30Z

Can we specify any maximum response length, or must the original maximum response length be divisible by it?

the original maximum response length be divisible by it. Otherwise an assertion error is raised.

chenxia-han · 2025-07-17T09:14:50Z

Can we specify any maximum response length, or must the original maximum response length be divisible by it?

the original maximum response length be divisible by it. Otherwise an assertion error is raised.

I’m a bit confused. Instead of relying on age == max_split, we could simply check the response length to determine whether it has finished. That would allow us to use any maximum response length.

stargazerZJ · 2025-07-17T09:34:30Z

Can we specify any maximum response length, or must the original maximum response length be divisible by it?

the original maximum response length be divisible by it. Otherwise an assertion error is raised.

I’m a bit confused. Instead of relying on age == max_split, we could simply check the response length to determine whether it has finished. That would allow us to use any maximum response length.

You are correct that any maximum response length can work this way, but that requires iterating the raw_input_ids to get the length of the response. Also the vllm sampling param needs to be copied per request to allow for different max_length for each request. It can work, just slightly more complex.

stargazerZJ · 2025-07-17T09:35:56Z

Moreover, arbitrary long output limit is difficult to achieve because we need a preconfigured maximum response length to determine the size of the tensors, such as input_ids to be fed into the training engine and the rollout engine throughout the fit function. FSDPTrainer also need a fixed size tensor when SP is not turned on (I may misremember). Moreover, even if a fixed size input_ids is never needed in a specific training setting, it's transferred here and there and needs a size.

chenxia-han · 2025-07-17T09:41:26Z

Moreover, arbitrary long output limit is difficult to achieve because we need a preconfigured maximum response length to determine the size of the tensors, such as input_ids to be fed into the training engine and the rollout engine throughout the fit function. FSDPTrainer also need a fixed size tensor when SP is not turned on (I may misremember). Moreover, even if a fixed size input_ids is never needed in a specific training setting, it's transferred here and there and needs a size.

By “any maximum response length,” I mean a fixed (not dynamic) value that doesn’t necessarily divide evenly into the original maximum response length. For instance, if the original maximum is 40K, the current implementation doesn’t support 7K because 40K % 7K ≠ 0.

stargazerZJ · 2025-07-17T09:43:38Z

Thanks for clarification. As I previously mentioned, it's possible, just slightly more complex.

chenxia-han · 2025-07-17T09:44:01Z

Thanks for clarification. As I previously mentioned, it's possible, just slightly more complex.

I see. Thanks for the reply.

chenjiaoAngel · 2025-07-17T09:53:38Z

Excuse me, i use this mr to implement on my repo, but it run when partial_rollout_max_split>1, it will occur crush. the error info is : model: Qwen2.5-VL-7B-Instruct datasets: geo3k max_response_length=2k partial_rollout_max_split=2 RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 3072 but got size 1024 for tensor number 1 in the list.

Sorry, multi-modal input support is not implemented in this draft pr. Support it requires adding the related fields to partial_batch and other batches manually. It should be trivial, but I am just not familiar with multimodal training at the moment.

BTW, sglang is not supported as well.

but i test on other model， such as Qwen2.5-32B model , it will cause this problem (i just run success on Qwen2.5-7B model, but the advantage is not obvious)
the first problem is on concat, shape is not match, i use padding method to work (self.padding(chunks - len(self) % chunks, "first"))

the seconde proble is:

stargazerZJ · 2025-07-17T10:38:32Z

Excuse me, i use this mr to implement on my repo, but it run when partial_rollout_max_split>1, it will occur crush. the error info is : model: Qwen2.5-VL-7B-Instruct datasets: geo3k max_response_length=2k partial_rollout_max_split=2 RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 3072 but got size 1024 for tensor number 1 in the list.

Sorry, multi-modal input support is not implemented in this draft pr. Support it requires adding the related fields to partial_batch and other batches manually. It should be trivial, but I am just not familiar with multimodal training at the moment.
BTW, sglang is not supported as well.

Excuse me, i use this mr to implement on my repo, but it run when partial_rollout_max_split>1, it will occur crush. the error info is : model: Qwen2.5-VL-7B-Instruct datasets: geo3k max_response_length=2k partial_rollout_max_split=2 RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 3072 but got size 1024 for tensor number 1 in the list.

Sorry, multi-modal input support is not implemented in this draft pr. Support it requires adding the related fields to partial_batch and other batches manually. It should be trivial, but I am just not familiar with multimodal training at the moment.
BTW, sglang is not supported as well.

but i test on other model， such as Qwen2.5-32B model , it will cause this problem (i just run success on Qwen2.5-7B model, but the advantage is not obvious) the first problem is on concat, shape is not match, i use padding method to work (self.padding(chunks - len(self) % chunks, "first"))

the seconde proble is:

Ah, did you export VERL_AUTO_PADDING=1? This is required.

Sorry for no documentation.

stargazerZJ · 2025-07-17T11:18:24Z

During later testing and communication with the verl team, we discovered that this trick is not very scalable with more GPUs, larger model, and the DAPO algorithm with the overlong buffer.

When tested with 8 H100 on a single node, Qwen2.5-Math-7B, using https://github.com/volcengine/verl/blob/main/recipe/dapo/test_dapo_7b_math.sh, partial rollout with 2, 4 or 8 splits results in at most 1~2% speed gain (the fastest is the 4 split setting).

We can conclude that this trick works best in smaller models, less GPUs and no overlong buffer.

I decided to close the PR for this reason. Thank you to everyone that helped me during my SJTU CS 2916 final project!

chenxia-han · 2025-07-17T11:29:43Z

During later testing and communication with the verl team, we discovered that this trick is not very scalable with more GPUs, larger model, and the DAPO algorithm with the overlong buffer.

When tested with 8 H100 on a single node, Qwen2.5-Math-7B, using https://github.com/volcengine/verl/blob/main/recipe/dapo/test_dapo_7b_math.sh, partial rollout with 2, 4 or 8 splits results in at most 1~2% speed gain (the fastest is the 4 split setting).

We can conclude that this trick works best in smaller models, less GPUs and no overlong buffer.

I decided to close the PR for this reason. Thank you to everyone that helped me during my SJTU CS 2916 final project!

Could you elaborate on why it doesn’t scale? I’m genuinely curious about the bottleneck.

chenjiaoAngel · 2025-07-18T02:58:17Z

export VERL_AUTO_PADDING=1

OK，i will use this to try. but use this export VERL_AUTO_PADDING=1, it will influence its profiler?

feat: implement partial rollout

f3f065b

feifeibear added the rollout label Jun 4, 2025

feifeibear mentioned this pull request Jun 6, 2025

[roadmap] Rollout Module Development Progress & Roadmap #1881

Closed

12 tasks

chenhaiq mentioned this pull request Jun 6, 2025

[roadmap] Rollout Module Development Progress & Roadmap #1882

Open

13 tasks

eric-haibin-lin reviewed Jun 14, 2025

View reviewed changes

Resolve merge conflicts with main

c3d967a

eric-haibin-lin reviewed Jun 23, 2025

View reviewed changes

stargazerZJ added 2 commits June 23, 2025 09:26

fix: training sample divisor computation

8493b18

fix: key error when concating with an empty partial batch

3f4b360

protossw512 reviewed Jun 25, 2025

View reviewed changes

stargazerZJ closed this Jul 17, 2025

		first_proto = data_proto.select_idxs(filter_mask)
		second_proto = data_proto.select_idxs(inverse_mask)

[Feat]: Implement partial rollout #1826

[Feat]: Implement partial rollout #1826

Conversation

stargazerZJ commented Jun 3, 2025

partial rollout PR draft

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

feifeibear commented Jun 4, 2025

Uh oh!

lilei199908 commented Jun 4, 2025

Uh oh!

eric-haibin-lin commented Jun 4, 2025

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin commented Jun 15, 2025

Uh oh!

stargazerZJ commented Jun 18, 2025

Uh oh!

eric-haibin-lin Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

protossw512 Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

stargazerZJ Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

stargazerZJ Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

eric-haibin-lin commented Jun 23, 2025

Uh oh!

protossw512 Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

stargazerZJ Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

protossw512 Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

chenjiaoAngel commented Jul 7, 2025

Uh oh!

chenxia-han commented Jul 17, 2025

Uh oh!

stargazerZJ commented Jul 17, 2025

Uh oh!

stargazerZJ commented Jul 17, 2025

Uh oh!

chenxia-han commented Jul 17, 2025

Uh oh!

stargazerZJ commented Jul 17, 2025

Uh oh!

stargazerZJ commented Jul 17, 2025

Uh oh!

chenxia-han commented Jul 17, 2025

Uh oh!

stargazerZJ commented Jul 17, 2025

Uh oh!

chenxia-han commented Jul 17, 2025

Uh oh!

chenjiaoAngel commented Jul 17, 2025

Uh oh!

stargazerZJ commented Jul 17, 2025

Uh oh!

stargazerZJ commented Jul 17, 2025

Uh oh!

chenxia-han commented Jul 17, 2025

Uh oh!

chenjiaoAngel commented Jul 18, 2025