-
Notifications
You must be signed in to change notification settings - Fork 2.6k
[Feat]: Implement partial rollout #1826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Hi @stargazerZJ , thank you for your design diagrams and draft code. We originally planned to develop the partial rollout feature after completing the server-based rollout refactoring. #1636 Would you mind joining us in the refactoring of the rollout module? We'd like to hear your thoughts first. |
|
Great work! I'm wondering if there have been any tests on the effectiveness of larger models (due to multi-step asynchrony from partial rollouts) and the efficiency on larger clusters. |
|
Are you on slack? Looking forward to more discussions. |
eric-haibin-lin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A CI test is needed
|
please also resolve conflicts with main, thx |
|
I'm back! I have resolved conflicts at |
| for sample_id in range(len(output.outputs)): | ||
| response_ids = output.outputs[sample_id].token_ids | ||
| response.append(response_ids) | ||
| filtered_response = [id if id < 151669 else 0 for id in response_ids] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this magic number? 151669
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the vocab size of a certain model, but why do you need filter here? did you encounter model to generate token id larger than what's in the tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Qwen3 will do that, and will cause vllm to raise an exception with Qwen's output is fed back to its input. I'm sorry that I have forgotten to mention it when writing the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same happens to Qwen 2.5 Math too.
|
Split based on vllm max gen token does not seem general, as the actual rollout time may be affected by tool call/env interaction time. This needs careful consideration |
| first_proto = data_proto.select_idxs(filter_mask) | ||
| second_proto = data_proto.select_idxs(inverse_mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will running into issues here if filter_mask is either all True or all False:
RuntimeError: batch dimension mismatch, got self.batch_size=torch.Size([32]) and value.shape=torch.Size([0, 4096]).
Basically select_idxs won't support empty selection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I'll test further as soon as the machines are vacant. I remembered encountering and fixing similar issues during early developing. I think during later training runs there were cases when all/no prompts are finished, and there was no RE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, turned out I was using an older version of verl (0.3). I think this issue has been fixed. thanks!
|
Can we specify any maximum response length, or must the original maximum response length be divisible by it? |
the original maximum response length be divisible by it. Otherwise an assertion error is raised. |
I’m a bit confused. Instead of relying on |
You are correct that |
|
Moreover, arbitrary long output limit is difficult to achieve because we need a preconfigured |
By “any maximum response length,” I mean a fixed (not dynamic) value that doesn’t necessarily divide evenly into the original maximum response length. For instance, if the original maximum is 40K, the current implementation doesn’t support 7K because 40K % 7K ≠ 0. |
|
Thanks for clarification. As I previously mentioned, it's possible, just slightly more complex. |
I see. Thanks for the reply. |
|
During later testing and communication with the verl team, we discovered that this trick is not very scalable with more GPUs, larger model, and the DAPO algorithm with the overlong buffer. When tested with 8 H100 on a single node, We can conclude that this trick works best in smaller models, less GPUs and no overlong buffer. I decided to close the PR for this reason. Thank you to everyone that helped me during my SJTU CS 2916 final project! |
Could you elaborate on why it doesn’t scale? I’m genuinely curious about the bottleneck. |
OK,i will use this to try. but use this export VERL_AUTO_PADDING=1, it will influence its profiler? |








partial rollout PR draft
Checklist Before Starting
What does this PR do?
Supporting the partial rollout feature: Set vllm's maximum output length to an integer fraction of
config.response_length. If one model response is not completed in the current round, its generation can continue in the next iteration with updated weights. This trick unlocked significant rollout time reduction without sacrifacing model performance.partial rollout with 4K max response length, compared to baseline:



Note on the 3rd plot: We didn't enable testing on 2GPU baseline runs because they were already so slowly. We will update the 4GPU baseline runs after it finishes in a few hours.
High-Level Design
Specific Changes
fitfunctionvLLMRolloutclasssplitmethod toDataProtoAPI
There are mainly two areas where compatibility is potentially tricky:
SamplingParams.nparameter to generate multiple responses before returning them. In our implementation, we must duplicate the prompts in the main loop and send multiple copies to vllm.fitfunction in verl takes a batch from the dataloader in each iteration and continuously adds keys to it. In our implementation, there's a step where thepartial_batch(partially generated by vllm) needs to be merged with the initialbatchtaken from the dataloader.Usage Example
Test
Setup: qwen3 0.6b base; MATH dataset; batch size 1024; rollout n = 5; max response length 4K; 2 H100 GPUs; Megatron trainer
We run:
Results, partial rollout 4K 2 split compared to baseline 4K:
baseline 2Kandbaseline 4K 4GPU.partial 4K 2 splitlags behind the baseline by about 1.5 steps, andpartial 4K 4 splitlags even more.partial 4Kandbaseline_4096_4GPUshow an upward trend, while other baselines tend to stabilize.We also tested these settings




qwen3 0.6b base, dataset is gsm8k; batch size 1024; rollout n = 5; max prompt length 512; 2 GPUs; FSDP trainer
3 experiments:
baseline_512partial_512baseline_256max response length 256In these experiments, the speed gain is about 30% overall and 50% in gen. the aforementioned two problems in training dynamics are less visible.
Additional Info.
Checklist Before Submitting
[BREAKING]to the PR title if it breaks any API.