Skip to content

Support Megatron 0.11.0 and vLLM 0.8.2, update images to use latest vllm and Megatron #851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 32 commits into from

Conversation

ETOgaosion
Copy link
Collaborator

@ETOgaosion ETOgaosion commented Mar 31, 2025

Codes are ready, devided into 3 PR

This PR includes:

1. Support of Megatron 0.11.0 and vLLM 0.8.2

Per-tensor weights loading to reduce peak memory cost

image

Align GPTModel names with new vLLM 0.8.2 model interfaces

Mitigate the concept of Micro-DP

There are many cases that micro-dp cannot handle well and may need complicated logics, Now we directly use tensor parallel of Megatron and vLLM. Since per-tensor all-gather has a minimal memory cost, so the communication group itself does not matter most.

Import Megatron modules on GPU nodes

This is because now import megatron.core will end up call triton, it will detect current GPU drivers and cause error if we initialize these on CPU node. So lots of codes are refactored.

2. Update docker image and doc

  • TODO: vLLM test test_vllm_hf_loader.py still use old vLLM API, so currently stay unchanged.

3. Fix rng checkpoints

Only 1 node in a machine save rng_states to avoid conflicts.

TODO

  • Not Support Context Parallel currently

@ETOgaosion ETOgaosion force-pushed the update_image branch 3 times, most recently from 7f11e76 to 83f7497 Compare April 7, 2025 17:25
@ETOgaosion ETOgaosion changed the title update images to use latest vllm and Megatron Support Megatron 0.11.0 and vLLM 0.8.2, update images to use latest vllm and Megatron Apr 8, 2025
@ETOgaosion ETOgaosion force-pushed the update_image branch 2 times, most recently from 18b9568 to f7bd414 Compare April 10, 2025 17:06
@ETOgaosion
Copy link
Collaborator Author

ETOgaosion commented Apr 12, 2025

PP=2, TP=2, CP=2, VPP=2

In first training task (no matther actor or critic), Context parallel stuck on:

  • 1 thread
Thread 34147 (active): "MainThread"
    _apply_rotary_pos_emb_thd (core/models/common/embeddings/rope_utils.py:162)
    apply_rotary_pos_emb (core/models/common/embeddings/rope_utils.py:219)
    forward (core/transformer/attention.py:436)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    forward (core/transformer/transformer_layer.py:390)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    __call__ (core/transformer/transformer_layer.py:502)
    forward (core/transformer/transformer_block.py:549)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    forward (core/models/gpt/gpt_model.py:264)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    forward (core/transformer/module.py:178)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    forward (core/distributed/data_parallel_base.py:22)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    gptmodel_forward (verl/models/mcore/gpt_model.py:37)
    forward_step (verl/workers/critic/megatron_critic.py:171)
    forward_step (core/pipeline_parallel/schedules.py:275)
    forward_step_helper (core/pipeline_parallel/schedules.py:899)
    forward_backward_pipelining_with_interleaving (core/pipeline_parallel/schedules.py:1049)
    forward_backward_batch (verl/workers/critic/megatron_critic.py:186)
    update_critic (verl/workers/critic/megatron_critic.py:218)
    update_critic (verl/workers/megatron_workers.py:652)
    inner (verl/single_controller/base/decorator.py:409)
    func (verl/single_controller/ray/base.py:439)
    _resume_span (ray/util/tracing/tracing_helper.py:467)
    actor_method_executor (ray/_private/function_manager.py:722)
    main_loop (ray/_private/worker.py:892)
    <module> (ray/_private/workers/default_worker.py:327)
  • 2 thread
Thread 34149 (active): "MainThread"
    synchronize (torch/cuda/__init__.py:985)
    _communicate_shapes (core/pipeline_parallel/p2p_communication.py:106)
    _communicate (core/pipeline_parallel/p2p_communication.py:280)
    recv_forward (core/pipeline_parallel/p2p_communication.py:423)
    forward_backward_pipelining_with_interleaving (core/pipeline_parallel/schedules.py:977)
    forward_backward_batch (verl/workers/critic/megatron_critic.py:186)
    update_critic (verl/workers/critic/megatron_critic.py:218)
    update_critic (verl/workers/megatron_workers.py:652)
    inner (verl/single_controller/base/decorator.py:409)
    func (verl/single_controller/ray/base.py:439)
    _resume_span (ray/util/tracing/tracing_helper.py:467)
    actor_method_executor (ray/_private/function_manager.py:722)
    main_loop (ray/_private/worker.py:892)
    <module> (ray/_private/workers/default_worker.py:327)

In _apply_rotary_pos_emb_thd (core/models/common/embeddings/rope_utils.py:162):

seqlens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()

@ETOgaosion ETOgaosion closed this Apr 12, 2025
@eric-haibin-lin
Copy link
Collaborator

why was this PR closed?

@ETOgaosion
Copy link
Collaborator Author

ETOgaosion commented Apr 13, 2025

/models/common/embeddings/rope_utils.py

This is too heavy to debug or later development, it combines too many development atomic features.

So I split this PR into several ones, for minimal changes in each feature.

vermouth1992 pushed a commit that referenced this pull request Apr 14, 2025
Parts of #851 

Including minimal of upgrade:

1. vllm 0.8.2 with megatron
2. part of per-tensor allgather and load weights
3. fix bugs with context parallel, because of dataloader random seed,
seems behavior changed in torch 2.6.0
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Apr 18, 2025
Parts of volcengine#851 

Including minimal of upgrade:

1. vllm 0.8.2 with megatron
2. part of per-tensor allgather and load weights
3. fix bugs with context parallel, because of dataloader random seed,
seems behavior changed in torch 2.6.0
wangyuchen333 pushed a commit to wangyuchen333/verl that referenced this pull request Apr 25, 2025
Parts of volcengine#851 

Including minimal of upgrade:

1. vllm 0.8.2 with megatron
2. part of per-tensor allgather and load weights
3. fix bugs with context parallel, because of dataloader random seed,
seems behavior changed in torch 2.6.0
yhyang201 pushed a commit to yhyang201/verl that referenced this pull request Apr 26, 2025
Parts of volcengine#851 

Including minimal of upgrade:

1. vllm 0.8.2 with megatron
2. part of per-tensor allgather and load weights
3. fix bugs with context parallel, because of dataloader random seed,
seems behavior changed in torch 2.6.0
zhaochenyang20 pushed a commit that referenced this pull request May 7, 2025
Based on the ongoing alignment between mcore and vllm #851 , I believe
we can simultaneously advance the alignment between mcore and sglang, as
their interfaces are similar. In the end, we will only need to obtain a
generator parameter.
[link](sgl-project/sglang#5345)
GitMonkey0 pushed a commit to GitMonkey0/verl that referenced this pull request Jun 14, 2025
Parts of volcengine#851 

Including minimal of upgrade:

1. vllm 0.8.2 with megatron
2. part of per-tensor allgather and load weights
3. fix bugs with context parallel, because of dataloader random seed,
seems behavior changed in torch 2.6.0
GitMonkey0 pushed a commit to GitMonkey0/verl that referenced this pull request Jun 14, 2025
Based on the ongoing alignment between mcore and vllm volcengine#851 , I believe
we can simultaneously advance the alignment between mcore and sglang, as
their interfaces are similar. In the end, we will only need to obtain a
generator parameter.
[link](sgl-project/sglang#5345)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants