Skip to content

Support Megatron 0.11.0 and vLLM 0.8.2, update images to use latest vllm and Megatron #851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 32 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
b20a656
update images
ETOgaosion Mar 31, 2025
a842759
try to fix triton and flash_attn version errors
ETOgaosion Mar 31, 2025
6313cb9
try to fix triton and flash_attn version errors
ETOgaosion Mar 31, 2025
6ea4647
training almost fix, vllm to fix
ETOgaosion Apr 1, 2025
c302b74
fall back a test config
ETOgaosion Apr 1, 2025
c12669d
fall back a test config
ETOgaosion Apr 1, 2025
ee63638
seems able to run
ETOgaosion Apr 5, 2025
b1c9be1
format
ETOgaosion Apr 5, 2025
0f260fe
test back in merlin
ETOgaosion Apr 7, 2025
9cb3249
format
ETOgaosion Apr 7, 2025
cf97437
able to run
ETOgaosion Apr 7, 2025
6555aa7
able to run
ETOgaosion Apr 7, 2025
a1a4493
format
ETOgaosion Apr 7, 2025
03143c8
not related file
ETOgaosion Apr 7, 2025
72f1d87
fix errors
ETOgaosion Apr 8, 2025
6e51799
fix torch load
ETOgaosion Apr 8, 2025
d01c268
test loss megatron
ETOgaosion Apr 8, 2025
111975e
dataset error
ETOgaosion Apr 8, 2025
16191fa
per tensor
ETOgaosion Apr 8, 2025
27ee9a4
hot fix convert weight
ETOgaosion Apr 8, 2025
b397590
fix final_layernorm
ETOgaosion Apr 8, 2025
1a92c07
fix vLLM
ETOgaosion Apr 8, 2025
605b868
deepseek ckpt error
ETOgaosion Apr 9, 2025
0953176
release ray test
ETOgaosion Apr 9, 2025
e1e4401
release sandbox test
ETOgaosion Apr 9, 2025
fa26943
requirements pyarrow too low
ETOgaosion Apr 9, 2025
5ec10b3
unrelated file
ETOgaosion Apr 9, 2025
c448089
fix preprocess and postprocess logic
ETOgaosion Apr 10, 2025
de932fd
fix numpy import
ETOgaosion Apr 10, 2025
944d4bf
not compatible with cp
ETOgaosion Apr 11, 2025
f2380ae
format
ETOgaosion Apr 11, 2025
9b026ab
fix checkpoint rng_states confliction
ETOgaosion Apr 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions .github/workflows/checkpoints.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ permissions:
contents: read

jobs:
e2e_gsm8k_megatron:
checkpoints:
runs-on: [self-hosted, l20-0]
timeout-minutes: 40 # Increase this timeout value as needed
env:
Expand All @@ -31,7 +31,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand All @@ -47,7 +47,6 @@ jobs:
- name: Running Checkpoint Integration Test (Qwen Megatron)
run: |
ray stop --force
export PYTHONPATH=$PYTHONPATH:/opt/nvidia/Megatron-LM
bash tests/checkpoint/run_qwen_megatron_ckpt.sh
- name: Running Checkpoint Integration Test (Deepseek Megatron)
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/dataset.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/e2e_eval_aime24.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2
image: whatcanyousee/verl:ngc-th2.6.0-cu126-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/e2e_grpo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ permissions:
contents: read

jobs:
e2e_gsm8k_megatron-l20-0:
e2e_grpo-l20-0:
runs-on: [self-hosted, l20-0]
timeout-minutes: 40 # Increase this timeout value as needed
env:
Expand All @@ -33,7 +33,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand All @@ -54,7 +54,7 @@ jobs:
run: |
ray stop --force
bash tests/e2e/run_qwen_grpo_megatron.sh
e2e_gsm8k_megatron-l20-1:
e2e_grpo-l20-1:
runs-on: [self-hosted, l20-1]
timeout-minutes: 40 # Increase this timeout value as needed
env:
Expand All @@ -63,7 +63,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/e2e_gsm8k.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand Down
28 changes: 25 additions & 3 deletions .github/workflows/e2e_gsm8k_megatron.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: e2e_gsm8k_megatron
# latest version: Megatron-LM core_r0.11.0 https://github.com/NVIDIA/Megatron-LM/tree/core_r0.11.0
# latest version: Megatron-LM v0.11.0 https://github.com/NVIDIA/Megatron-LM/tree/v0.11.0

on:
# Trigger the workflow on push or pull request,
Expand All @@ -26,7 +26,7 @@ permissions:
contents: read

jobs:
e2e_gsm8k_megatron:
e2e_gsm8k_megatron-l20-0:
runs-on: [self-hosted, l20-0]
timeout-minutes: 40 # Increase this timeout value as needed
env:
Expand All @@ -35,7 +35,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand All @@ -52,6 +52,28 @@ jobs:
run: |
ray stop --force
bash tests/e2e/run_deepseek_megatron_parallelism.sh
e2e_gsm8k_megatron-l20-1:
runs-on: [self-hosted, l20-1]
timeout-minutes: 40 # Increase this timeout value as needed
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
fetch-depth: 0
- name: Install the current repository
run: |
pip3 install hf_transfer
pip3 install -e .[test]
- name: Prepare gsm8k dataset
run: |
python3 examples/data_preprocess/gsm8k.py
- name: Running gsm8k e2e training tests with 3D parallelism on 8 L20 GPUs with Megatron (Qwen)
run: |
ray stop --force
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/e2e_gsm8k_prime.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/e2e_lora.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/e2e_sft.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/e2e_vlm_geo3k.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=40g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/model.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ jobs:
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/ray_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,14 @@ permissions:
jobs:
ray:
runs-on: [self-hosted, l20-0]
timeout-minutes: 5 # Increase this timeout value as needed
timeout-minutes: 10 # Increase this timeout value as needed
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/sandbox.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,14 @@ permissions:
jobs:
sandbox:
runs-on: [self-hosted, l20-0]
timeout-minutes: 3 # Increase this timeout value as needed
timeout-minutes: 10 # Increase this timeout value as needed
env:
HTTP_PROXY: ${{ secrets.PROXY_HTTP }}
HTTPS_PROXY: ${{ secrets.PROXY_HTTPS }}
NO_PROXY: "localhost,127.0.0.1"
HF_HUB_ENABLE_HF_TRANSFER: 1
container:
image: verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
image: whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0
options: --gpus all --shm-size=10g
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/vllm.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,15 @@ jobs:
pip3 install hf_transfer
pip3 install -e .[test]
pip3 install vllm==0.5.4
pip3 install flash_attn
- name: Running vllm tests on 8 L20 GPUs
run: |
cd tests/rollout
torchrun --standalone --nnodes=1 --nproc_per_node=8 $(which pytest) -s test_vllm_hf_loader.py
- name: Test the latest vLLM
run: |
pip3 install --upgrade vllm==0.7.3
pip3 install flash_attn
cd tests/rollout
torchrun --standalone --nnodes=1 --nproc_per_node=4 $(which pytest) -s test_vllm_spmd.py
- name: Run Qwen 0.5B generation test
Expand Down
39 changes: 33 additions & 6 deletions docker/Dockerfile.megatron
Original file line number Diff line number Diff line change
@@ -1,9 +1,36 @@
FROM verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3
FROM hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2

RUN pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
# Define environments
ENV MAX_JOBS=64

RUN cd /opt/nvidia && git clone --single-branch --branch core_r0.11.0 https://github.com/NVIDIA/Megatron-LM.git Megatron-LM
RUN apt-get update && \
apt-get install -y aria2

# only config pip index with https://pypi.tuna.tsinghua.edu.cn/simple if needed
# unset for now
RUN cd /opt/nvidia/Megatron-LM && pip3 install --no-deps -e .
# 1. Reinstall CUDA 12.4
RUN aria2c https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin && \
mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600

RUN aria2c --always-resume=true --max-tries=99999 https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb

RUN dpkg -i cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb

RUN cp /var/cuda-repo-ubuntu2204-12-4-local/cuda-*-keyring.gpg /usr/share/keyrings/

RUN apt-get update

RUN apt-get -y install cuda-toolkit-12-4

RUN rm cuda-repo-ubuntu2204-12-4-local_12.4.1-550.54.15-1_amd64.deb

RUN update-alternatives --set cuda /usr/local/cuda-12.4

# 2. Install Apex
RUN git clone https://github.com/NVIDIA/apex.git && \
cd apex && \
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

# 3. Install TransformerEngine
RUN export NVTE_FRAMEWORK=pytorch && pip3 install --no-deps git+https://github.com/NVIDIA/[email protected]

# 4. Install Megatron-LM
RUN pip3 install git+https://github.com/NVIDIA/[email protected]
9 changes: 5 additions & 4 deletions docs/advance/checkpoint.rst
Original file line number Diff line number Diff line change
Expand Up @@ -84,10 +84,11 @@ So example use of Megatron model merger is:

.. code:: bash

python3 scripts/model_merger.py --backend megatron \
--is-value-model \
--hf_model_path Qwen/Qwen2-7B \
--local_dir checkpoints/verl_megatron_gsm8k_examples/deepseek_megatron_checkpoint_saveload/global_step_1/actor/model
python scripts/model_merger.py \
--backend megatron \
--tie-word-embedding \
--hf_model_path Qwen/Qwen2.5-0.5B \
--local_dir checkpoints/verl_megatron_gsm8k_examples/qwen2_5_0b5_megatron_saveload/global_step_1/actor

Megatron Merger details
-----------------------
Expand Down
30 changes: 14 additions & 16 deletions docs/start/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Choices of Backend Engines

We recommend using **FSDP** backend to investigate, research and prototype different models, datasets and RL algorithms. The guide for using FSDP backend can be found in :doc:`FSDP Workers<../workers/fsdp_workers>`.

For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support Megatron-LM v0.11 [1]_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.
For users who pursue better scalability, we recommend using **Megatron-LM** backend. Currently, we support `Megatron-LM v0.11<https://github.com/NVIDIA/Megatron-LM/tree/v0.11.0>`_. The guide for using Megatron-LM backend can be found in :doc:`Megatron-LM Workers<../workers/megatron_workers>`.

.. note::

Expand All @@ -40,17 +40,15 @@ Install from docker image

We provide pre-built Docker images for quick setup. For SGLang usage, please follow the later sections in this doc.

Image and tag: ``whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6`` if you need both FSDP and Megatron support.
Image and tag: ``whatcanyousee/verl:ngc-th2.6.0-cu124-vllm0.8.2-mcore0.11.0``. Check files under ``docker/`` for NGC-based image or if you want to build your own.

We highly recommend ``hiyouga/verl:ngc-th2.6.0-cu120-vllm0.8.2-verl0.3.0.post1`` with vllm v0.8.2 for fastest rollout performance with FSDP.

See files under ``docker/`` for NGC-based image or if you want to build your own.

1. Launch the desired Docker image:
1. Launch the desired Docker image and attach into it:

.. code:: bash

docker run --runtime=nvidia -it --rm --shm-size="10g" --cap-add=SYS_ADMIN -v <image:tag>
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN -v .:/workspace/verl --name verl <image:tag>
docker start verl
docker exec -it verl bash


2. Inside the container, install latest verl:
Expand All @@ -65,14 +63,14 @@ See files under ``docker/`` for NGC-based image or if you want to build your own

The Docker image ``whatcanyousee/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te2.0-megatron0.11.0-v0.0.6`` is built with the following configurations:

- **PyTorch**: 2.4.0+cu124
- **CUDA**: 12.4
- **Megatron-LM**: core_r0.11.0
- **vLLM**: 0.6.3
- **Ray**: 2.10.0
- **TransformerEngine**: 2.0.0+754d2a0
- **PyTorch**: 2.6.0+cu124
- **CUDA**: 12.6
- **Megatron-LM**: v0.11.0
- **vLLM**: 0.8.2
- **Ray**: 2.44.0
- **TransformerEngine**: 2.1.0+8eb1712

Now verl has been **compatible to Megatron-LM core_r0.11.0**, and there is **no need to apply patches** to Megatron-LM. Also, the image has integrated **Megatron-LM core_r0.11.0**, located at ``/opt/nvidia/Meagtron-LM``. One more thing, because verl only use ``megatron.core`` module for now, there is **no need to modify** ``PATH`` if you have installed Megatron-LM with this docker image.
Now verl has been **compatible to Megatron-LM v0.11.0**, and there is **no need to apply patches** to Megatron-LM. Also, the image has integrated **Megatron-LM v0.11.0**, located at ``/opt/nvidia/Meagtron-LM``. One more thing, because verl only use ``megatron.core`` module for now, there is **no need to modify** ``PATH`` if you have installed Megatron-LM with this docker image.


Install SGLang as rollout backend
Expand Down Expand Up @@ -127,7 +125,7 @@ own post-training jobs.
.. code:: bash

# install verl together with some lightweight dependencies in setup.py
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
git clone https://github.com/volcengine/verl.git
cd verl
Expand Down
2 changes: 1 addition & 1 deletion examples/grpo_trainer/run_deepseek7b_llm_math_megatron.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ python3 -m verl.trainer.main_ppo --config-path=config \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.project_name='try_fix_megatron_loss_calc' \
trainer.experiment_name='deepseek_llm_7b_function_rm_math_megatron' \
trainer.n_gpus_per_node=16 \
trainer.nnodes=1 \
Expand Down
2 changes: 1 addition & 1 deletion examples/grpo_trainer/run_qwen2-7b_math_megatron.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ python3 -m verl.trainer.main_ppo --config-path=config \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.project_name='verl_grpo_example_gsm8k' \
trainer.project_name='try_fix_megatron_loss_calc' \
trainer.experiment_name='qwen2_7b_function_rm_megatron' \
trainer.n_gpus_per_node=16 \
trainer.nnodes=1 \
Expand Down
Loading