Skip to content

Conversation

@stargazerZJ
Copy link

partial rollout PR draft

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Supporting the partial rollout feature: Set vllm's maximum output length to an integer fraction of config.response_length. If one model response is not completed in the current round, its generation can continue in the next iteration with updated weights. This trick unlocked significant rollout time reduction without sacrifacing model performance.

partial rollout with 4K max response length, compared to baseline:
image
image
image
Note on the 3rd plot: We didn't enable testing on 2GPU baseline runs because they were already so slowly. We will update the 4GPU baseline runs after it finishes in a few hours.

High-Level Design

image

Specific Changes

  • Add buffer pool and filter logic in the fit function
  • Concat partial responses to prompt inputs in the vLLMRollout class
  • Add split method to DataProto
  • Add configuration entry

API

There are mainly two areas where compatibility is potentially tricky:

  1. In the original architecture, the main loop sends only one copy of each prompt to vllm, and vllm uses the SamplingParams.n parameter to generate multiple responses before returning them. In our implementation, we must duplicate the prompts in the main loop and send multiple copies to vllm.
  2. The fit function in verl takes a batch from the dataloader in each iteration and continuously adds keys to it. In our implementation, there's a step where the partial_batch (partially generated by vllm) needs to be merged with the initial batch taken from the dataloader.

Usage Example

    algorithm.partial_rollout_max_split=2

Test

Setup: qwen3 0.6b base; MATH dataset; batch size 1024; rollout n = 5; max response length 4K; 2 H100 GPUs; Megatron trainer

We run:

  • Partial rollout 4K 2 split
  • Partial rollout 4K 4 split
  • Baseline 4K
  • Baseline 2K: max response length reduced to 2k, for reference
  • Baseline 1K
  • Baseline 4K 4GPU

Results, partial rollout 4K 2 split compared to baseline 4K:

  • Overall experiment speed (16h vs 48h) and generation speed (estimated at 20%) significantly improved; Both are even faster than baseline 2K and baseline 4K 4GPU.
  • Reward curves are basically consistent after around 15 steps (out of 105 steps in total). (No discernible difference among baseline 4K, 2K, 1K, and partial 4K, etc.)
  • In the early stages of training, the reward increase of partial 4K 2 split lags behind the baseline by about 1.5 steps, and partial 4K 4 split lags even more.
  • In the middle and late stages of training, the mean response length of partial 4K and baseline_4096_4GPU show an upward trend, while other baselines tend to stabilize.

image

We also tested these settings
qwen3 0.6b base, dataset is gsm8k; batch size 1024; rollout n = 5; max prompt length 512; 2 GPUs; FSDP trainer
3 experiments:
baseline_512
partial_512
baseline_256 max response length 256
In these experiments, the speed gain is about 30% overall and 50% in gen. the aforementioned two problems in training dynamics are less visible.
image
image
image
image

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: none
  • Inference: vLLM

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if necessary.

@feifeibear
Copy link
Collaborator

Hi @stargazerZJ , thank you for your design diagrams and draft code.

We originally planned to develop the partial rollout feature after completing the server-based rollout refactoring. #1636
The server-based rollout will enable more efficient inter-request asynchrony and load balancing between inference dp ranks, which is a prerequisite for partial rollout.

Would you mind joining us in the refactoring of the rollout module? We'd like to hear your thoughts first.

@lilei199908
Copy link
Contributor

Great work! I'm wondering if there have been any tests on the effectiveness of larger models (due to multi-step asynchrony from partial rollouts) and the efficiency on larger clusters.

@eric-haibin-lin
Copy link
Collaborator

Are you on slack? Looking forward to more discussions.

Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A CI test is needed

@eric-haibin-lin
Copy link
Collaborator

please also resolve conflicts with main, thx

@stargazerZJ
Copy link
Author

I'm back! I have resolved conflicts at
https://github.com/stargazerZJ/verl/tree/partial-2
and will test it tomorrow

for sample_id in range(len(output.outputs)):
response_ids = output.outputs[sample_id].token_ids
response.append(response_ids)
filtered_response = [id if id < 151669 else 0 for id in response_ids]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this magic number? 151669

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the vocab size of a certain model, but why do you need filter here? did you encounter model to generate token id larger than what's in the tokenizer?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Qwen3 will do that, and will cause vllm to raise an exception with Qwen's output is fed back to its input. I'm sorry that I have forgotten to mention it when writing the PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same happens to Qwen 2.5 Math too.

@eric-haibin-lin
Copy link
Collaborator

Split based on vllm max gen token does not seem general, as the actual rollout time may be affected by tool call/env interaction time. This needs careful consideration

Comment on lines +783 to +784
first_proto = data_proto.select_idxs(filter_mask)
second_proto = data_proto.select_idxs(inverse_mask)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will running into issues here if filter_mask is either all True or all False:
RuntimeError: batch dimension mismatch, got self.batch_size=torch.Size([32]) and value.shape=torch.Size([0, 4096]).

Basically select_idxs won't support empty selection

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I'll test further as soon as the machines are vacant. I remembered encountering and fixing similar issues during early developing. I think during later training runs there were cases when all/no prompts are finished, and there was no RE.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, turned out I was using an older version of verl (0.3). I think this issue has been fixed. thanks!

@chenjiaoAngel
Copy link
Contributor

Excuse me, i use this mr to implement on my repo, but it run when partial_rollout_max_split>1, it will occur crush. the error info is :
model: Qwen2.5-VL-7B-Instruct
datasets: geo3k
max_response_length=2k
partial_rollout_max_split=2
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 3072 but got size 1024 for tensor number 1 in the list.
image

@chenxia-han
Copy link

Can we specify any maximum response length, or must the original maximum response length be divisible by it?

@stargazerZJ
Copy link
Author

Excuse me, i use this mr to implement on my repo, but it run when partial_rollout_max_split>1, it will occur crush. the error info is : model: Qwen2.5-VL-7B-Instruct datasets: geo3k max_response_length=2k partial_rollout_max_split=2 RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 3072 but got size 1024 for tensor number 1 in the list. image

Sorry, multi-modal input support is not implemented in this draft pr. Support it requires adding the related fields to partial_batch and other batches manually. It should be trivial, but I am just not familiar with multimodal training at the moment.

BTW, sglang is not supported as well.

@stargazerZJ
Copy link
Author

Can we specify any maximum response length, or must the original maximum response length be divisible by it?

the original maximum response length be divisible by it. Otherwise an assertion error is raised.

@chenxia-han
Copy link

Can we specify any maximum response length, or must the original maximum response length be divisible by it?

the original maximum response length be divisible by it. Otherwise an assertion error is raised.

I’m a bit confused. Instead of relying on age == max_split, we could simply check the response length to determine whether it has finished. That would allow us to use any maximum response length.

@stargazerZJ
Copy link
Author

Can we specify any maximum response length, or must the original maximum response length be divisible by it?

the original maximum response length be divisible by it. Otherwise an assertion error is raised.

I’m a bit confused. Instead of relying on age == max_split, we could simply check the response length to determine whether it has finished. That would allow us to use any maximum response length.

You are correct that any maximum response length can work this way, but that requires iterating the raw_input_ids to get the length of the response. Also the vllm sampling param needs to be copied per request to allow for different max_length for each request. It can work, just slightly more complex.

@stargazerZJ
Copy link
Author

Moreover, arbitrary long output limit is difficult to achieve because we need a preconfigured maximum response length to determine the size of the tensors, such as input_ids to be fed into the training engine and the rollout engine throughout the fit function. FSDPTrainer also need a fixed size tensor when SP is not turned on (I may misremember). Moreover, even if a fixed size input_ids is never needed in a specific training setting, it's transferred here and there and needs a size.

@chenxia-han
Copy link

Moreover, arbitrary long output limit is difficult to achieve because we need a preconfigured maximum response length to determine the size of the tensors, such as input_ids to be fed into the training engine and the rollout engine throughout the fit function. FSDPTrainer also need a fixed size tensor when SP is not turned on (I may misremember). Moreover, even if a fixed size input_ids is never needed in a specific training setting, it's transferred here and there and needs a size.

By “any maximum response length,” I mean a fixed (not dynamic) value that doesn’t necessarily divide evenly into the original maximum response length. For instance, if the original maximum is 40K, the current implementation doesn’t support 7K because 40K % 7K ≠ 0.

@stargazerZJ
Copy link
Author

Thanks for clarification. As I previously mentioned, it's possible, just slightly more complex.

@chenxia-han
Copy link

Thanks for clarification. As I previously mentioned, it's possible, just slightly more complex.

I see. Thanks for the reply.

@chenjiaoAngel
Copy link
Contributor

Excuse me, i use this mr to implement on my repo, but it run when partial_rollout_max_split>1, it will occur crush. the error info is : model: Qwen2.5-VL-7B-Instruct datasets: geo3k max_response_length=2k partial_rollout_max_split=2 RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 3072 but got size 1024 for tensor number 1 in the list. image

Sorry, multi-modal input support is not implemented in this draft pr. Support it requires adding the related fields to partial_batch and other batches manually. It should be trivial, but I am just not familiar with multimodal training at the moment.

BTW, sglang is not supported as well.

Excuse me, i use this mr to implement on my repo, but it run when partial_rollout_max_split>1, it will occur crush. the error info is : model: Qwen2.5-VL-7B-Instruct datasets: geo3k max_response_length=2k partial_rollout_max_split=2 RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 3072 but got size 1024 for tensor number 1 in the list. image

Sorry, multi-modal input support is not implemented in this draft pr. Support it requires adding the related fields to partial_batch and other batches manually. It should be trivial, but I am just not familiar with multimodal training at the moment.

BTW, sglang is not supported as well.

but i test on other model, such as Qwen2.5-32B model , it will cause this problem (i just run success on Qwen2.5-7B model, but the advantage is not obvious)
the first problem is on concat, shape is not match, i use padding method to work (self.padding(chunks - len(self) % chunks, "first"))
image
image

the seconde proble is:
image

@stargazerZJ
Copy link
Author

Excuse me, i use this mr to implement on my repo, but it run when partial_rollout_max_split>1, it will occur crush. the error info is : model: Qwen2.5-VL-7B-Instruct datasets: geo3k max_response_length=2k partial_rollout_max_split=2 RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 3072 but got size 1024 for tensor number 1 in the list. image

Sorry, multi-modal input support is not implemented in this draft pr. Support it requires adding the related fields to partial_batch and other batches manually. It should be trivial, but I am just not familiar with multimodal training at the moment.
BTW, sglang is not supported as well.

Excuse me, i use this mr to implement on my repo, but it run when partial_rollout_max_split>1, it will occur crush. the error info is : model: Qwen2.5-VL-7B-Instruct datasets: geo3k max_response_length=2k partial_rollout_max_split=2 RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 3072 but got size 1024 for tensor number 1 in the list. image

Sorry, multi-modal input support is not implemented in this draft pr. Support it requires adding the related fields to partial_batch and other batches manually. It should be trivial, but I am just not familiar with multimodal training at the moment.
BTW, sglang is not supported as well.

but i test on other model, such as Qwen2.5-32B model , it will cause this problem (i just run success on Qwen2.5-7B model, but the advantage is not obvious) the first problem is on concat, shape is not match, i use padding method to work (self.padding(chunks - len(self) % chunks, "first")) image image

the seconde proble is: image

Ah, did you export VERL_AUTO_PADDING=1? This is required.

Sorry for no documentation.

@stargazerZJ
Copy link
Author

During later testing and communication with the verl team, we discovered that this trick is not very scalable with more GPUs, larger model, and the DAPO algorithm with the overlong buffer.

When tested with 8 H100 on a single node, Qwen2.5-Math-7B, using https://github.com/volcengine/verl/blob/main/recipe/dapo/test_dapo_7b_math.sh, partial rollout with 2, 4 or 8 splits results in at most 1~2% speed gain (the fastest is the 4 split setting).

We can conclude that this trick works best in smaller models, less GPUs and no overlong buffer.

I decided to close the PR for this reason. Thank you to everyone that helped me during my SJTU CS 2916 final project!

@chenxia-han
Copy link

During later testing and communication with the verl team, we discovered that this trick is not very scalable with more GPUs, larger model, and the DAPO algorithm with the overlong buffer.

When tested with 8 H100 on a single node, Qwen2.5-Math-7B, using https://github.com/volcengine/verl/blob/main/recipe/dapo/test_dapo_7b_math.sh, partial rollout with 2, 4 or 8 splits results in at most 1~2% speed gain (the fastest is the 4 split setting).

We can conclude that this trick works best in smaller models, less GPUs and no overlong buffer.

I decided to close the PR for this reason. Thank you to everyone that helped me during my SJTU CS 2916 final project!

Could you elaborate on why it doesn’t scale? I’m genuinely curious about the bottleneck.

@chenjiaoAngel
Copy link
Contributor

export VERL_AUTO_PADDING=1

OK,i will use this to try. but use this export VERL_AUTO_PADDING=1, it will influence its profiler?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants