Nemo2.0 on sagemaker hyperpod by olaoyea4 · Pull Request #570 · awslabs/awsome-distributed-training

olaoyea4 · 2025-02-26T18:28:32Z

Description of changes: This change includes code on how to run pretraining job on a sagemaker hyperpod cluster using NeMo 2.0 framework

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…ted-training into nemo2.0

KeitaW

Thank you @olaoyea4 for contributing!
We are migrating new directory structure for each test case.
Could you please update this test case to have the following structure?:

testcase
- model
  - slurm
  - kubernetes

Also, HyperPod cluster deployment guide is not necessary, as we already have it in https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod and https://catalog.workshops.aws/sagemaker-hyperpod/en-US.

KeitaW · 2025-02-26T23:54:19Z

3.test_cases/21.nemo-run/run.py

+   return parser
+
+
+def slurm_executor(


I like how the training job is launched. It's more straight forward than existing way. Can we also integrate a feature adding --auto-resume=1 when (and only when) we are on HyperPod? cf. https://github.com/aws-samples/awsome-distributed-training/blob/27531abd134a7db726b5babb00c19e97d451d8dd/3.test_cases/16.pytorch-cpu-ddp/slurm/1.conda-train.sbatch#L18-L21

@KeitaW - thank you for the suggestion, currently the sbatch file is built automatically by nemo_run, and it does not support auto-resume argument just yet. We are working on adding that support, however it will take some time to build and test with sagemaker hyperpod. For this iteration, adding this support would be out of scope.

Understood that. Thanks @aroraakshit let us know when nemo_run supports custom flags.

olaoyea4 · 2025-02-27T23:38:12Z

Thank you @olaoyea4 for contributing! We are migrating new directory structure for each test case. Could you please update this test case to have the following structure?:

testcase

model

slurm

kubernetes

Also, HyperPod cluster deployment guide is not necessary, as we already have it in https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod and https://catalog.workshops.aws/sagemaker-hyperpod/en-US.

Thanks @KeitaW. Modified the README to point to the workshop link for hyperpod deployment.

For the folder structure, I'm not sure I understand, should I either have a slurm folder or kubernetes folder or both? In this case, this example does not cover a specific model and its using a hyperpod cluster based on slurm not EKS. What would the folder structure look like?

mhuguesaws · 2025-02-28T14:35:27Z

3.test_cases/21.nemo-run/Dockerfile

+    && rm -rf /opt/hpcx/nccl_rdma_sharp_plugin \
+    && ldconfig
+
+RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \


Please order packages in alphabetical order. Easier to reader.

mhuguesaws · 2025-02-28T14:36:37Z

3.test_cases/21.nemo-run/Dockerfile

+    DEBIAN_FRONTEND=noninteractive apt autoremove -y
+
+# EFA
+RUN apt-get update && \


Please remove update. Already done on first line.

mhuguesaws · 2025-02-28T14:37:28Z

3.test_cases/21.nemo-run/Dockerfile

+FROM nvcr.io/nvidia/nemo:24.12
+ENV DEBIAN_FRONTEND=noninteractive
+
+ENV GDRCOPY_VERSION=v2.4.1


Please follow the version convention used in NCCL tests https://github.com/aws-samples/awsome-distributed-training/blob/main/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile#L5

We use full release tag

3.test_cases/21.nemo-run/Dockerfile

3.test_cases/21.nemo-run/README.md

mhuguesaws

Left comments

KeitaW · 2025-03-05T00:13:06Z

Thank you @olaoyea4 for contributing! We are migrating new directory structure for each test case. Could you please update this test case to have the following structure?:

testcase

model

slurm

kubernetes

Also, HyperPod cluster deployment guide is not necessary, as we already have it in https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod and https://catalog.workshops.aws/sagemaker-hyperpod/en-US.

Thanks @KeitaW. Modified the README to point to the workshop link for hyperpod deployment.

For the folder structure, I'm not sure I understand, should I either have a slurm folder or kubernetes folder or both? In this case, this example does not cover a specific model and its using a hyperpod cluster based on slurm not EKS. What would the folder structure look like?

Kindly use https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/16.pytorch-cpu-ddp as a template. You can leave kuberenetes folder blank with .gitkeep.

olaoyea4 · 2025-03-05T16:30:24Z

Thank you @olaoyea4 for contributing! We are migrating new directory structure for each test case. Could you please update this test case to have the following structure?:

testcase

model

slurm

kubernetes

Also, HyperPod cluster deployment guide is not necessary, as we already have it in https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod and https://catalog.workshops.aws/sagemaker-hyperpod/en-US.

Thanks @KeitaW. Modified the README to point to the workshop link for hyperpod deployment.
For the folder structure, I'm not sure I understand, should I either have a slurm folder or kubernetes folder or both? In this case, this example does not cover a specific model and its using a hyperpod cluster based on slurm not EKS. What would the folder structure look like?

Kindly use https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/16.pytorch-cpu-ddp as a template. You can leave kuberenetes folder blank with .gitkeep.

Resoled! I have reorganized the directory structure as mentioned

3.test_cases/21.nemo-run/slurm/README.md

3.test_cases/21.nemo-run/slurm/Dockerfile

KeitaW · 2025-03-08T23:08:18Z

#!/bin/bash
#
# Generated by NeMo Run
# Run with: sbatch --requeue --parsable
#

# Parameters
#SBATCH --account=ubuntu
#SBATCH --exclusive
#SBATCH --gpus-per-node=8
#SBATCH --job-name=ubuntu-ubuntu.training
#SBATCH --mem=0
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --open-mode=append
#SBATCH --output=/fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/sbatch_ubuntu-ubuntu.training_%j.out
#SBATCH --partition=dev
#SBATCH --time=01:00:00

set -evx

export PYTHONUNBUFFERED=1
export SLURM_UNBUFFEREDIO=1
export TORCHX_MAX_RETRIES=0

set +e

# setup

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

export TORCH_NCCL_AVOID_RECORD_STREAMS=1
export NCCL_NVLS_ENABLE=0
export NVTE_DP_AMAX_REDUCE_INTERVAL=0
export NVTE_ASYNC_AMAX_REDUCTION=1
export NVTE_FUSED_ATTN=0
export FI_EFA_USE_HUGE_PAGE=0


# Command 1

srun --output /fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/log-ubuntu-ubuntu.training_%j_${SLURM_RESTART_COUNT:-0}.out --container-image /fsx/ubuntu/aws-nemo-24-12.sqsh --container-mounts /fsx/ubuntu/megatron:/root/.cache/torch/megatron,/fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training:/nemo_run --container-workdir /nemo_run/code --wait=60 --kill-on-bad-exit=1 python -m nemo_run.core.runners.fdl_runner -n training /nemo_run/configs/training_fn_or_script

exitcode=$?

set -e

echo "job exited with code $exitcode"
if [ $exitcode -ne 0 ]; then
    if [ "$TORCHX_MAX_RETRIES" -gt "${SLURM_RESTART_COUNT:-0}" ]; then
        scontrol requeue "$SLURM_JOB_ID"
    fi
    exit $exitcode
fi

Here is the slurm script the last nemo-run has generated.

KeitaW · 2025-03-08T23:14:29Z

It turns out that the HyperPod cluster I'm running does not like:

#SBATCH --gpus-per-node=8

KeitaW · 2025-03-08T23:26:56Z

Related discussion #462 (comment)

KeitaW · 2025-03-08T23:27:59Z

Possibly, newer HP cluster does not have the issue. I will re-deploy a HP cluster to test it out.

KeitaW · 2025-03-08T23:37:19Z

(nemo-env) ubuntu@ip-10-1-0-217:~/nemo-review/3.test_cases/21.nemo-run/slurm$ tail -f /fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/sbatch_ubuntu-ubuntu.training_25.out 
+ NVTE_FUSED_ATTN=0
export FI_EFA_USE_HUGE_PAGE=0
+ export FI_EFA_USE_HUGE_PAGE=0
+ FI_EFA_USE_HUGE_PAGE=0


# Command 1

srun --output /fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/log-ubuntu-ubuntu.training_%j_${SLURM_RESTART_COUNT:-0}.out --container-image /fsx/ubuntu/aws-nemo-24-12.sqsh --container-mounts /fsx/ubuntu/megatron:/root/.cache/torch/megatron,/fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training:/nemo_run --container-workdir /nemo_run/code --wait=60 --kill-on-bad-exit=1 python -m nemo_run.core.runners.fdl_runner -n training /nemo_run/configs/training_fn_or_script
+ srun --output /fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/log-ubuntu-ubuntu.training_%j_0.out --container-image /fsx/ubuntu/aws-nemo-24-12.sqsh --container-mounts /fsx/ubuntu/megatron:/root/.cache/torch/megatron,/fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training:/nemo_run --container-workdir /nemo_run/code --wait=60 --kill-on-bad-exit=1 python -m nemo_run.core.runners.fdl_runner -n training /nemo_run/configs/training_fn_or_script
^C
(nemo-env) ubuntu@ip-10-1-0-217:~/nemo-review/3.test_cases/21.nemo-run/slurm$ tail -f /fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/log-ubuntu-ubun
tu.training_25_0.out 
Training epoch 11, iteration 0/999 | lr: 3.148e-05 | global_batch_size: 512 | global_step: 209 | reduced_train_loss: 10.83 | train_step_timing in s: 4.403 | consumed_samples: 107520
[NeMo I 2025-03-08 23:35:31 model_checkpoint:522] Async checkpoint save for step 209 (aws-nemo2T9LV7/checkpoints/model_name=0--val_loss=0.00-step=208-consumed_samples=107008.0-last.ckpt) finalized successfully.
Training epoch 11, iteration 1/999 | lr: 3.163e-05 | global_batch_size: 512 | global_step: 210 | reduced_train_loss: 10.83 | train_step_timing in s: 4.319 | consumed_samples: 108032
Training epoch 11, iteration 2/999 | lr: 3.178e-05 | global_batch_size: 512 | global_step: 211 | reduced_train_loss: 10.83 | train_step_timing in s: 4.278 | consumed_samples: 108544
Training epoch 11, iteration 3/999 | lr: 3.193e-05 | global_batch_size: 512 | global_step: 212 | reduced_train_loss: 10.83 | train_step_timing in s: 4.285 | consumed_samples: 109056
...

The config without gres works on HP.

KeitaW · 2025-03-10T12:13:08Z

Confirmed that GRES is not enabled by default on HP even today. Will check with service team.

…ning into nemo2.0

It will be disabled if it is not supported by the hardware. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable

KeitaW · 2025-03-11T09:01:50Z

#570 (review)
is addressed by 1925409 and 1043ab4

3.test_cases/21.nemo-run/slurm/run.py

mhuguesaws

Left comment

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

olaoyea4 · 2025-03-11T17:13:14Z

Left comment

suggested change has been accepted

aroraakshit · 2025-03-12T23:14:17Z

@mhuguesaws can we merge given all comments have been addressed?

mhuguesaws

LGTM

KeitaW · 2025-03-13T00:18:40Z

One suggestion, we may refer to the particular commit e67fc35 in the upcoming blog post as directory structure of this repository might change in near future.

* Initial commit * add readme * update readme * update readme * update readme * update readme and add new files * re-organize directory * Apply suggestions to Dockerfile from code review Co-authored-by: Keita Watanabe <keitaw09@gmail.com> * Update 3.test_cases/21.nemo-run/slurm/README.md * Update 3.test_cases/21.nemo-run/slurm/README.md * Update 3.test_cases/21.nemo-run/README.md * Delete 3.test_cases/21.nemo-run/slurm/cluster-config-template.json * Update 3.test_cases/21.nemo-run/slurm/README.md * Update README.md * Update venv.sh * Update 3.test_cases/21.nemo-run/README.md * update * Update 3.test_cases/21.nemo-run/slurm/Dockerfile Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * Update 3.test_cases/21.nemo-run/slurm/Dockerfile Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * ignore virtual env * update scripts * Enable the use of NVLink SHARP (NVLS) by default. It will be disabled if it is not supported by the hardware. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable * remove k8s subdirectory for now * remove gres option * move Dockerfile * update Dockerfile * Clean up * Update 3.test_cases/21.nemo-run/slurm/run.py Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * increment test case numbering --------- Co-authored-by: Keita Watanabe <keitaw09@gmail.com> Co-authored-by: Keita Watanabe <mlkeita@amazon.com> Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

olaoyea4 and others added 5 commits February 25, 2025 17:15

Initial commit

7ee56f4

Merge branch 'aws-samples:main' into nemo2.0

2b27832

add readme

2d83738

Merge branch 'nemo2.0' of https://github.com/olaoyea4/awsome-distribu…

47a1467

…ted-training into nemo2.0

update readme

2142fbe

olaoyea4 changed the title ~~Nemo2.0~~ Nemo2.0 on sagemaker hyperpod Feb 26, 2025

update readme

bcaffbc

KeitaW requested changes Feb 26, 2025

View reviewed changes

KeitaW reviewed Feb 26, 2025

View reviewed changes

update readme

6518e63

mhuguesaws reviewed Feb 28, 2025

View reviewed changes

3.test_cases/21.nemo-run/Dockerfile Outdated Show resolved Hide resolved

mhuguesaws reviewed Feb 28, 2025

View reviewed changes

3.test_cases/21.nemo-run/Dockerfile Outdated Show resolved Hide resolved

mhuguesaws reviewed Feb 28, 2025

View reviewed changes

3.test_cases/21.nemo-run/Dockerfile Outdated Show resolved Hide resolved

mhuguesaws reviewed Feb 28, 2025

View reviewed changes

3.test_cases/21.nemo-run/README.md Outdated Show resolved Hide resolved

mhuguesaws reviewed Feb 28, 2025

View reviewed changes

3.test_cases/21.nemo-run/README.md Outdated Show resolved Hide resolved

mhuguesaws suggested changes Feb 28, 2025

View reviewed changes

update readme and add new files

241b7de

olaoyea4 and others added 2 commits March 5, 2025 10:04

Merge branch 'aws-samples:main' into nemo2.0

e8cd7c2

re-organize directory

171608b