-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Support Megatron 0.11.0 and vLLM 0.8.2, update images to use latest vllm and Megatron #851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7f11e76
to
83f7497
Compare
18b9568
to
f7bd414
Compare
52abd6c
to
f2380ae
Compare
PP=2, TP=2, CP=2, VPP=2 In first training task (no matther actor or critic), Context parallel stuck on:
In _apply_rotary_pos_emb_thd (core/models/common/embeddings/rope_utils.py:162): seqlens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist() |
why was this PR closed? |
This is too heavy to debug or later development, it combines too many development atomic features. So I split this PR into several ones, for minimal changes in each feature. |
Parts of #851 Including minimal of upgrade: 1. vllm 0.8.2 with megatron 2. part of per-tensor allgather and load weights 3. fix bugs with context parallel, because of dataloader random seed, seems behavior changed in torch 2.6.0
Parts of volcengine#851 Including minimal of upgrade: 1. vllm 0.8.2 with megatron 2. part of per-tensor allgather and load weights 3. fix bugs with context parallel, because of dataloader random seed, seems behavior changed in torch 2.6.0
Parts of volcengine#851 Including minimal of upgrade: 1. vllm 0.8.2 with megatron 2. part of per-tensor allgather and load weights 3. fix bugs with context parallel, because of dataloader random seed, seems behavior changed in torch 2.6.0
Parts of volcengine#851 Including minimal of upgrade: 1. vllm 0.8.2 with megatron 2. part of per-tensor allgather and load weights 3. fix bugs with context parallel, because of dataloader random seed, seems behavior changed in torch 2.6.0
Based on the ongoing alignment between mcore and vllm #851 , I believe we can simultaneously advance the alignment between mcore and sglang, as their interfaces are similar. In the end, we will only need to obtain a generator parameter. [link](sgl-project/sglang#5345)
Parts of volcengine#851 Including minimal of upgrade: 1. vllm 0.8.2 with megatron 2. part of per-tensor allgather and load weights 3. fix bugs with context parallel, because of dataloader random seed, seems behavior changed in torch 2.6.0
Based on the ongoing alignment between mcore and vllm volcengine#851 , I believe we can simultaneously advance the alignment between mcore and sglang, as their interfaces are similar. In the end, we will only need to obtain a generator parameter. [link](sgl-project/sglang#5345)
Codes are ready, devided into 3 PR
This PR includes:
1. Support of Megatron 0.11.0 and vLLM 0.8.2
Per-tensor weights loading to reduce peak memory cost
Align GPTModel names with new vLLM 0.8.2 model interfaces
Mitigate the concept of Micro-DP
There are many cases that micro-dp cannot handle well and may need complicated logics, Now we directly use tensor parallel of Megatron and vLLM. Since per-tensor all-gather has a minimal memory cost, so the communication group itself does not matter most.
Import Megatron modules on GPU nodes
This is because now
import megatron.core
will end up call triton, it will detect current GPU drivers and cause error if we initialize these on CPU node. So lots of codes are refactored.2. Update docker image and doc
test_vllm_hf_loader.py
still use old vLLM API, so currently stay unchanged.3. Fix rng checkpoints
Only 1 node in a machine save rng_states to avoid conflicts.
TODO