Skip to content

Commit ea6df45

Browse files
authored
Update vllm 0.8.2 with megatron 0.11.0 (volcengine#1054)
Parts of volcengine#851 Including minimal of upgrade: 1. vllm 0.8.2 with megatron 2. part of per-tensor allgather and load weights 3. fix bugs with context parallel, because of dataloader random seed, seems behavior changed in torch 2.6.0
1 parent b17bce7 commit ea6df45

31 files changed

+390
-155
lines changed

.github/workflows/checkpoints.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ permissions:
2222
contents: read
2323

2424
jobs:
25-
e2e_gsm8k_megatron:
25+
checkpoints:
2626
runs-on: [self-hosted, l20-0]
2727
timeout-minutes: 40 # Increase this timeout value as needed
2828
env:
@@ -31,7 +31,7 @@ jobs:
3131
NO_PROXY: "localhost,127.0.0.1"
3232
HF_HUB_ENABLE_HF_TRANSFER: 1
3333
container:
34-
image: whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6
34+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
3535
options: --gpus all --shm-size=10g
3636
steps:
3737
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@@ -47,7 +47,6 @@ jobs:
4747
- name: Running Checkpoint Integration Test (Qwen Megatron)
4848
run: |
4949
ray stop --force
50-
export PYTHONPATH=$PYTHONPATH:/opt/nvidia/Megatron-LM
5150
bash tests/checkpoint/run_qwen_megatron_ckpt.sh
5251
- name: Running Checkpoint Integration Test (Deepseek Megatron)
5352
run: |

.github/workflows/dataset.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ jobs:
3232
NO_PROXY: "localhost,127.0.0.1"
3333
HF_HUB_ENABLE_HF_TRANSFER: 1
3434
container:
35-
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
35+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
3636
options: --gpus all --shm-size=10g
3737
steps:
3838
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

.github/workflows/e2e_eval_aime24.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ jobs:
2828
NO_PROXY: "localhost,127.0.0.1"
2929
HF_HUB_ENABLE_HF_TRANSFER: 1
3030
container:
31-
image: hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2
31+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
3232
options: --gpus all --shm-size=10g
3333
steps:
3434
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

.github/workflows/e2e_grpo.yml

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ permissions:
2525
contents: read
2626

2727
jobs:
28-
e2e_gsm8k_megatron-l20-0:
28+
e2e_grpo-l20-0:
2929
runs-on: [self-hosted, l20-0]
3030
timeout-minutes: 40 # Increase this timeout value as needed
3131
env:
@@ -34,7 +34,7 @@ jobs:
3434
NO_PROXY: "localhost,127.0.0.1"
3535
HF_HUB_ENABLE_HF_TRANSFER: 1
3636
container:
37-
image: whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6
37+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
3838
options: --gpus all --shm-size=10g
3939
steps:
4040
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@@ -55,7 +55,7 @@ jobs:
5555
run: |
5656
ray stop --force
5757
bash tests/e2e/run_qwen_grpo_megatron.sh
58-
e2e_gsm8k_megatron-l20-1:
58+
e2e_grpo-l20-1:
5959
runs-on: [self-hosted, l20-1]
6060
timeout-minutes: 40 # Increase this timeout value as needed
6161
env:
@@ -64,7 +64,7 @@ jobs:
6464
NO_PROXY: "localhost,127.0.0.1"
6565
HF_HUB_ENABLE_HF_TRANSFER: 1
6666
container:
67-
image: whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6
67+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
6868
options: --gpus all --shm-size=10g
6969
steps:
7070
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

.github/workflows/e2e_gsm8k.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ jobs:
3434
NO_PROXY: "localhost,127.0.0.1"
3535
HF_HUB_ENABLE_HF_TRANSFER: 1
3636
container:
37-
image: hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2
37+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
3838
options: --gpus all --shm-size=10g
3939
steps:
4040
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

.github/workflows/e2e_gsm8k_megatron.yml

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
name: e2e_gsm8k_megatron
2-
# latest version: Megatron-LM core_r0.11.0 https://github.com/NVIDIA/Megatron-LM/tree/core_r0.11.0
2+
# latest version: Megatron-LM v0.11.0 https://github.com/NVIDIA/Megatron-LM/tree/v0.11.0
33

44
on:
55
# Trigger the workflow on push or pull request,
@@ -27,7 +27,7 @@ permissions:
2727
contents: read
2828

2929
jobs:
30-
e2e_gsm8k_megatron:
30+
e2e_gsm8k_megatron-l20-0:
3131
runs-on: [self-hosted, l20-0]
3232
timeout-minutes: 40 # Increase this timeout value as needed
3333
env:
@@ -36,7 +36,7 @@ jobs:
3636
NO_PROXY: "localhost,127.0.0.1"
3737
HF_HUB_ENABLE_HF_TRANSFER: 1
3838
container:
39-
image: whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6
39+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
4040
options: --gpus all --shm-size=10g
4141
steps:
4242
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
@@ -53,6 +53,28 @@ jobs:
5353
run: |
5454
ray stop --force
5555
bash tests/e2e/run_deepseek_megatron_parallelism.sh
56+
e2e_gsm8k_megatron-l20-1:
57+
runs-on: [self-hosted, l20-1]
58+
timeout-minutes: 40 # Increase this timeout value as needed
59+
env:
60+
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
61+
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
62+
NO_PROXY: "localhost,127.0.0.1"
63+
HF_HUB_ENABLE_HF_TRANSFER: 1
64+
container:
65+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
66+
options: --gpus all --shm-size=10g
67+
steps:
68+
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
69+
with:
70+
fetch-depth: 0
71+
- name: Install the current repository
72+
run: |
73+
pip3 install hf_transfer
74+
pip3 install -e .[test]
75+
- name: Prepare gsm8k dataset
76+
run: |
77+
python3 examples/data_preprocess/gsm8k.py
5678
- name: Running gsm8k e2e training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen)
5779
run: |
5880
ray stop --force

.github/workflows/e2e_gsm8k_prime.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ jobs:
3131
NO_PROXY: "localhost,127.0.0.1"
3232
HF_HUB_ENABLE_HF_TRANSFER: 1
3333
container:
34-
image: hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2
34+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
3535
options: --gpus all --shm-size=10g
3636
steps:
3737
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

.github/workflows/e2e_lora.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ jobs:
3333
NO_PROXY: "localhost,127.0.0.1"
3434
HF_HUB_ENABLE_HF_TRANSFER: 1
3535
container:
36-
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
36+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
3737
options: --gpus all --shm-size=10g
3838
steps:
3939
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

.github/workflows/e2e_sft.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ jobs:
3434
NO_PROXY: "localhost,127.0.0.1"
3535
HF_HUB_ENABLE_HF_TRANSFER: 1
3636
container:
37-
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
37+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
3838
options: --gpus all --shm-size=10g
3939
steps:
4040
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

.github/workflows/e2e_vlm_geo3k.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ jobs:
2828
NO_PROXY: "localhost,127.0.0.1"
2929
HF_HUB_ENABLE_HF_TRANSFER: 1
3030
container:
31-
image: hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2
31+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
3232
options: --gpus all --shm-size=40g
3333
steps:
3434
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

.github/workflows/model.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ jobs:
2828
NO_PROXY: "localhost,127.0.0.1"
2929
HF_HUB_ENABLE_HF_TRANSFER: 1
3030
container:
31-
image: whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6
31+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
3232
options: --gpus all --shm-size=10g
3333
steps:
3434
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

.github/workflows/ray_test.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,14 +31,14 @@ permissions:
3131
jobs:
3232
ray:
3333
runs-on: [self-hosted, l20-0]
34-
timeout-minutes: 5 # Increase this timeout value as needed
34+
timeout-minutes: 10 # Increase this timeout value as needed
3535
env:
3636
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
3737
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
3838
NO_PROXY: "localhost,127.0.0.1"
3939
HF_HUB_ENABLE_HF_TRANSFER: 1
4040
container:
41-
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
41+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
4242
options: --gpus all --shm-size=10g
4343
steps:
4444
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

.github/workflows/sandbox.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,14 +23,14 @@ permissions:
2323
jobs:
2424
sandbox:
2525
runs-on: [self-hosted, l20-0]
26-
timeout-minutes: 3 # Increase this timeout value as needed
26+
timeout-minutes: 10 # Increase this timeout value as needed
2727
env:
2828
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
2929
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
3030
NO_PROXY: "localhost,127.0.0.1"
3131
HF_HUB_ENABLE_HF_TRANSFER: 1
3232
container:
33-
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
33+
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0
3434
options: --gpus all --shm-size=10g
3535
steps:
3636
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

docker/Dockerfile.megatron

Lines changed: 39 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,42 @@
1-
FROM verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
1+
FROM hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2
22

3-
RUN pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
3+
# Define environments
4+
ENV MAX_JOBS=64
45

5-
RUN cd /opt/nvidia && git clone --single-branch --branch core_r0.11.0 https://github.com/NVIDIA/Megatron-LM.git Megatron-LM
6+
RUN apt-get update && \
7+
apt-get install -y aria2
68

7-
# only config pip index with https://pypi.tuna.tsinghua.edu.cn/simple if needed
8-
# unset for now
9-
RUN cd /opt/nvidia/Megatron-LM && pip3 install --no-deps -e .
9+
# 1. Reinstall CUDA 12.4
10+
RUN aria2c https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin && \
11+
mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
12+
13+
RUN aria2c --always-resume=true --max-tries=99999 https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
14+
15+
RUN dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
16+
17+
RUN cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/
18+
19+
RUN apt-get update
20+
21+
RUN apt-get -y install cuda-toolkit-12-4
22+
23+
RUN rm cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb
24+
25+
RUN update-alternatives --set cuda /usr/local/cuda-12.4
26+
27+
# 2. Reinstall Flash attn 2.7.3
28+
RUN pip uninstall -y flash-attn && \
29+
wget -nv https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl && \
30+
pip install --no-cache-dir flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl && \
31+
rm flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
32+
33+
# 3. Install Apex
34+
RUN git clone https://github.com/NVIDIA/apex.git && \
35+
cd apex && \
36+
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
37+
38+
# 4. Install TransformerEngine
39+
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --no-deps git+https://github.com/NVIDIA/[email protected]
40+
41+
# 5. Install Megatron-LM
42+
RUN pip3 install git+https://github.com/NVIDIA/[email protected]

docs/advance/checkpoint.rst

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -84,10 +84,11 @@ So example use of Megatron model merger is:
8484

8585
.. code:: bash
8686
87-
python3 scripts/model_merger.py --backend megatron \
88-
--is-value-model \
89-
--hf_model_path Qwen/Qwen2-7B \
90-
--local_dir checkpoints/verl_megatron_gsm8k_examples/deepseek_megatron_checkpoint_saveload/global_step_1/actor/model
87+
python scripts/model_merger.py \
88+
--backend megatron \
89+
--tie-word-embedding \
90+
--hf_model_path Qwen/Qwen2.5-0.5B \
91+
--local_dir checkpoints/verl_megatron_gsm8k_examples/qwen2_5_0b5_megatron_saveload/global_step_1/actor
9192
9293
Megatron Merger details
9394
-----------------------

docs/examples/config.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,7 @@ Actor/Rollout/Reference Policy
105105
kl_loss_coef: 0.001 # for grpo
106106
kl_loss_type: low_var_kl # for grpo
107107
ppo_epochs: 1
108+
data_loader_seed: null
108109
shuffle: False
109110
ulysses_sequence_parallel_size: 1 # sp size
110111
optim:
@@ -206,6 +207,10 @@ Actor/Rollout/Reference Policy
206207
- ``actor_rollout_ref.actor.ppo_epochs``: Number of epochs for PPO
207208
updates on one set of sampled data
208209

210+
- ``actor_rollout_ref.actor.data_loader_seed``: From torch 2.6.0 Megatron backend can get wrong seed generated by pytorch
211+
between cp ranks and cause misalignment between data on these ranks, so we shall manually set the seed to avoid hanging
212+
issue. if ``actor_rollout_ref.actor.shuffle`` is not null, this must be set.
213+
209214
- ``actor_rollout_ref.actor.shuffle``: Whether to shuffle data when
210215
there are multiple epochs
211216

docs/start/install.rst

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Choices of Backend Engines
1919

2020
We recommend using **FSDP** backend to investigate, research and prototype different models, datasets and RL algorithms. The guide for using FSDP backend can be found in :doc:`FSDP Workers<../workers/fsdp_workers>`.
2121

22-
For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support Megatron-LM v0.11 [1]_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
22+
For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support `Megatron-LM v0.11<https://github.com/NVIDIA/Megatron-LM/tree/v0.11.0>`_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
2323

2424
.. note::
2525

@@ -39,19 +39,19 @@ Install from docker image
3939

4040
We provide pre-built Docker images for quick setup.
4141

42-
For latest vllm, please use ``hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2-verl0.3.0.post1`` with vllm v0.8.2 with FSDP.
43-
44-
For users who need latest Megatron, please use ``whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6`` for vllm v0.6.3 with Megatron/FSDP.
42+
For latest vllm and Megatron or FSDP, please use ``whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0``.
4543

4644
For SGLang with FSDP, please use ``ocss884/verl-sglang:ngc-th2.5.1-cu126-sglang0.4.4.post4`` which is provided SGLang RL Group.
4745

4846
See files under ``docker/`` for NGC-based image or if you want to build your own.
4947

50-
1. Launch the desired Docker image:
48+
1. Launch the desired Docker image and attach into it:
5149

5250
.. code:: bash
5351
54-
docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v <image:tag>
52+
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag>
53+
docker start verl
54+
docker exec -it verl bash
5555
5656
5757
2. Inside the container, install latest verl:
@@ -65,16 +65,16 @@ See files under ``docker/`` for NGC-based image or if you want to build your own
6565
6666
.. note::
6767

68-
The Docker image ``whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6`` is built with the following configurations:
68+
The Docker image ``whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0-te2.0`` is built with the following configurations:
6969

70-
- **PyTorch**: 2.4.0+cu124
70+
- **PyTorch**: 2.6.0+cu124
7171
- **CUDA**: 12.4
72-
- **Megatron-LM**: core_r0.11.0
73-
- **vLLM**: 0.6.3
74-
- **Ray**: 2.10.0
75-
- **TransformerEngine**: 2.0.0+754d2a0
72+
- **Megatron-LM**: v0.11.0
73+
- **vLLM**: 0.8.2
74+
- **Ray**: 2.44.0
75+
- **TransformerEngine**: 2.0.0
7676

77-
Now verl has been **compatible to Megatron-LM core_r0.11.0**, and there is **no need to apply patches** to Megatron-LM. Also, the image has integrated **Megatron-LM core_r0.11.0**, located at ``/opt/nvidia/Meagtron-LM``. One more thing, because verl only use ``megatron.core`` module for now, there is **no need to modify** ``PATH`` if you have installed Megatron-LM with this docker image.
77+
Now verl has been **compatible to Megatron-LM v0.11.0**, and there is **no need to apply patches** to Megatron-LM. Also, the image has integrated **Megatron-LM v0.11.0**, located at ``/opt/nvidia/Meagtron-LM``. One more thing, because verl only use ``megatron.core`` module for now, there is **no need to modify** ``PATH`` if you have installed Megatron-LM with this docker image.
7878

7979

8080
Install from custom environment
@@ -94,7 +94,7 @@ own post-training jobs.
9494
.. code:: bash
9595
9696
# install verl together with some lightweight dependencies in setup.py
97-
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
97+
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu126
9898
pip3 install flash-attn --no-build-isolation
9999
git clone https://github.com/volcengine/verl.git
100100
cd verl

scripts/model_merger.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,8 @@ def convert_fsdp_checkpoints_to_hfmodels():
8686
assert world_size, "No model file with the proper format"
8787

8888
state_dict = torch.load(os.path.join(local_dir, f'model_world_size_{world_size}_rank_{rank}.pt'),
89-
map_location='cpu')
89+
map_location='cpu',
90+
weights_only=False)
9091
pivot_key = sorted(list(state_dict.keys()))[0]
9192
weight = state_dict[pivot_key]
9293

0 commit comments

Comments
 (0)