Support Megatron 0.11.0 and vLLM 0.8.2, update images to use latest vllm and Megatron #851

ETOgaosion · 2025-03-31T16:23:06Z

Codes are ready, devided into 3 PR

This PR includes:

1. Support of Megatron 0.11.0 and vLLM 0.8.2

Per-tensor weights loading to reduce peak memory cost

Align GPTModel names with new vLLM 0.8.2 model interfaces

Mitigate the concept of Micro-DP

There are many cases that micro-dp cannot handle well and may need complicated logics, Now we directly use tensor parallel of Megatron and vLLM. Since per-tensor all-gather has a minimal memory cost, so the communication group itself does not matter most.

Import Megatron modules on GPU nodes

This is because now import megatron.core will end up call triton, it will detect current GPU drivers and cause error if we initialize these on CPU node. So lots of codes are refactored.

2. Update docker image and doc

TODO: vLLM test test_vllm_hf_loader.py still use old vLLM API, so currently stay unchanged.

3. Fix rng checkpoints

Only 1 node in a machine save rng_states to avoid conflicts.

TODO

Not Support Context Parallel currently

ETOgaosion · 2025-04-12T04:03:12Z

PP=2, TP=2, CP=2, VPP=2

In first training task (no matther actor or critic), Context parallel stuck on:

1 thread

Thread 34147 (active): "MainThread"
    _apply_rotary_pos_emb_thd (core/models/common/embeddings/rope_utils.py:162)
    apply_rotary_pos_emb (core/models/common/embeddings/rope_utils.py:219)
    forward (core/transformer/attention.py:436)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    forward (core/transformer/transformer_layer.py:390)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    __call__ (core/transformer/transformer_layer.py:502)
    forward (core/transformer/transformer_block.py:549)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    forward (core/models/gpt/gpt_model.py:264)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    forward (core/transformer/module.py:178)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    forward (core/distributed/data_parallel_base.py:22)
    _call_impl (torch/nn/modules/module.py:1750)
    _wrapped_call_impl (torch/nn/modules/module.py:1739)
    gptmodel_forward (verl/models/mcore/gpt_model.py:37)
    forward_step (verl/workers/critic/megatron_critic.py:171)
    forward_step (core/pipeline_parallel/schedules.py:275)
    forward_step_helper (core/pipeline_parallel/schedules.py:899)
    forward_backward_pipelining_with_interleaving (core/pipeline_parallel/schedules.py:1049)
    forward_backward_batch (verl/workers/critic/megatron_critic.py:186)
    update_critic (verl/workers/critic/megatron_critic.py:218)
    update_critic (verl/workers/megatron_workers.py:652)
    inner (verl/single_controller/base/decorator.py:409)
    func (verl/single_controller/ray/base.py:439)
    _resume_span (ray/util/tracing/tracing_helper.py:467)
    actor_method_executor (ray/_private/function_manager.py:722)
    main_loop (ray/_private/worker.py:892)
    <module> (ray/_private/workers/default_worker.py:327)

2 thread

Thread 34149 (active): "MainThread"
    synchronize (torch/cuda/__init__.py:985)
    _communicate_shapes (core/pipeline_parallel/p2p_communication.py:106)
    _communicate (core/pipeline_parallel/p2p_communication.py:280)
    recv_forward (core/pipeline_parallel/p2p_communication.py:423)
    forward_backward_pipelining_with_interleaving (core/pipeline_parallel/schedules.py:977)
    forward_backward_batch (verl/workers/critic/megatron_critic.py:186)
    update_critic (verl/workers/critic/megatron_critic.py:218)
    update_critic (verl/workers/megatron_workers.py:652)
    inner (verl/single_controller/base/decorator.py:409)
    func (verl/single_controller/ray/base.py:439)
    _resume_span (ray/util/tracing/tracing_helper.py:467)
    actor_method_executor (ray/_private/function_manager.py:722)
    main_loop (ray/_private/worker.py:892)
    <module> (ray/_private/workers/default_worker.py:327)

In _apply_rotary_pos_emb_thd (core/models/common/embeddings/rope_utils.py:162):

seqlens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()

eric-haibin-lin · 2025-04-12T13:25:26Z

why was this PR closed?

ETOgaosion · 2025-04-13T08:59:32Z

/models/common/embeddings/rope_utils.py

This is too heavy to debug or later development, it combines too many development atomic features.

So I split this PR into several ones, for minimal changes in each feature.

Parts of #851 Including minimal of upgrade: 1. vllm 0.8.2 with megatron 2. part of per-tensor allgather and load weights 3. fix bugs with context parallel, because of dataloader random seed, seems behavior changed in torch 2.6.0

Parts of volcengine#851 Including minimal of upgrade: 1. vllm 0.8.2 with megatron 2. part of per-tensor allgather and load weights 3. fix bugs with context parallel, because of dataloader random seed, seems behavior changed in torch 2.6.0

Based on the ongoing alignment between mcore and vllm #851 , I believe we can simultaneously advance the alignment between mcore and sglang, as their interfaces are similar. In the end, we will only need to obtain a generator parameter. [link](sgl-project/sglang#5345)

Parts of volcengine#851 Including minimal of upgrade: 1. vllm 0.8.2 with megatron 2. part of per-tensor allgather and load weights 3. fix bugs with context parallel, because of dataloader random seed, seems behavior changed in torch 2.6.0

Based on the ongoing alignment between mcore and vllm volcengine#851 , I believe we can simultaneously advance the alignment between mcore and sglang, as their interfaces are similar. In the end, we will only need to obtain a generator parameter. [link](sgl-project/sglang#5345)

ETOgaosion force-pushed the update_image branch from 0f54245 to bd7e663 Compare April 1, 2025 01:58

ETOgaosion mentioned this pull request Apr 1, 2025

use qwen2.5 1.5b for DAPO training got RuntimeError: aten.copy_.default: got mixed torch.Tensor and DTensor #864

Open

ETOgaosion force-pushed the update_image branch 3 times, most recently from 7f11e76 to 83f7497 Compare April 7, 2025 17:25

ETOgaosion changed the title ~~update images to use latest vllm and Megatron~~ Support Megatron 0.11.0 and vLLM 0.8.2, update images to use latest vllm and Megatron Apr 8, 2025

ETOgaosion force-pushed the update_image branch 2 times, most recently from 18b9568 to f7bd414 Compare April 10, 2025 17:06

ISEEKYAN mentioned this pull request Apr 11, 2025

[mcore] verl+megatron development tracking #1033

Open

13 tasks

ETOgaosion and others added 21 commits April 12, 2025 01:29

update images

b20a656

try to fix triton and flash_attn version errors

a842759

try to fix triton and flash_attn version errors

6313cb9

training almost fix, vllm to fix

6ea4647

fall back a test config

c302b74

fall back a test config

c12669d

seems able to run

ee63638

format

b1c9be1

test back in merlin

0f260fe

format

9cb3249

able to run

cf97437

able to run

6555aa7

format

a1a4493

not related file

03143c8

fix errors

72f1d87

fix torch load

6e51799

test loss megatron

d01c268

dataset error

111975e

per tensor

16191fa

hot fix convert weight

27ee9a4

fix final_layernorm

b397590

ETOgaosion added 10 commits April 12, 2025 01:29

fix vLLM

1a92c07

deepseek ckpt error

605b868

release ray test

0953176

release sandbox test

e1e4401

requirements pyarrow too low

fa26943

unrelated file

5ec10b3

fix preprocess and postprocess logic

c448089

fix numpy import

de932fd

not compatible with cp

944d4bf

format

f2380ae

ETOgaosion force-pushed the update_image branch from 52abd6c to f2380ae Compare April 11, 2025 17:29

fix checkpoint rng_states confliction

9b026ab

ETOgaosion closed this Apr 12, 2025

ETOgaosion mentioned this pull request Apr 13, 2025

Update vllm 0.8.2 with megatron 0.11.0 #1054

Merged

BearBiscuit05 mentioned this pull request Apr 13, 2025

[SGLang] Add support between mcore0.11 and sglang #1055

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support Megatron 0.11.0 and vLLM 0.8.2, update images to use latest vllm and Megatron #851

Support Megatron 0.11.0 and vLLM 0.8.2, update images to use latest vllm and Megatron #851

Uh oh!

ETOgaosion commented Mar 31, 2025 •

edited

Loading

Uh oh!

ETOgaosion commented Apr 12, 2025 •

edited

Loading

Uh oh!

eric-haibin-lin commented Apr 12, 2025

Uh oh!

ETOgaosion commented Apr 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Support Megatron 0.11.0 and vLLM 0.8.2, update images to use latest vllm and Megatron #851

Support Megatron 0.11.0 and vLLM 0.8.2, update images to use latest vllm and Megatron #851

Uh oh!

Conversation

ETOgaosion commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codes are ready, devided into 3 PR

1. Support of Megatron 0.11.0 and vLLM 0.8.2

Per-tensor weights loading to reduce peak memory cost

Align GPTModel names with new vLLM 0.8.2 model interfaces

Mitigate the concept of Micro-DP

Import Megatron modules on GPU nodes

2. Update docker image and doc

3. Fix rng checkpoints

TODO

Uh oh!

ETOgaosion commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-haibin-lin commented Apr 12, 2025

Uh oh!

ETOgaosion commented Apr 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ETOgaosion commented Mar 31, 2025 •

edited

Loading

ETOgaosion commented Apr 12, 2025 •

edited

Loading

ETOgaosion commented Apr 13, 2025 •

edited

Loading