Skip to content

Nemo2.0 on sagemaker hyperpod#570

Merged
mhuguesaws merged 34 commits intoawslabs:mainfrom
olaoyea4:nemo2.0
Mar 12, 2025
Merged

Nemo2.0 on sagemaker hyperpod#570
mhuguesaws merged 34 commits intoawslabs:mainfrom
olaoyea4:nemo2.0

Conversation

@olaoyea4
Copy link
Copy Markdown
Contributor

Description of changes: This change includes code on how to run pretraining job on a sagemaker hyperpod cluster using NeMo 2.0 framework

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@olaoyea4 olaoyea4 changed the title Nemo2.0 Nemo2.0 on sagemaker hyperpod Feb 26, 2025
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @olaoyea4 for contributing!
We are migrating new directory structure for each test case.
Could you please update this test case to have the following structure?:

  • testcase
    • model
      • slurm
      • kubernetes

Also, HyperPod cluster deployment guide is not necessary, as we already have it in https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod and https://catalog.workshops.aws/sagemaker-hyperpod/en-US.

return parser


def slurm_executor(
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how the training job is launched. It's more straight forward than existing way. Can we also integrate a feature adding --auto-resume=1 when (and only when) we are on HyperPod? cf. https://github.com/aws-samples/awsome-distributed-training/blob/27531abd134a7db726b5babb00c19e97d451d8dd/3.test_cases/16.pytorch-cpu-ddp/slurm/1.conda-train.sbatch#L18-L21

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KeitaW - thank you for the suggestion, currently the sbatch file is built automatically by nemo_run, and it does not support auto-resume argument just yet. We are working on adding that support, however it will take some time to build and test with sagemaker hyperpod. For this iteration, adding this support would be out of scope.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood that. Thanks @aroraakshit let us know when nemo_run supports custom flags.

@olaoyea4
Copy link
Copy Markdown
Contributor Author

olaoyea4 commented Feb 27, 2025

Thank you @olaoyea4 for contributing! We are migrating new directory structure for each test case. Could you please update this test case to have the following structure?:

  • testcase

    • model

      • slurm
      • kubernetes

Also, HyperPod cluster deployment guide is not necessary, as we already have it in https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod and https://catalog.workshops.aws/sagemaker-hyperpod/en-US.

Thanks @KeitaW. Modified the README to point to the workshop link for hyperpod deployment.

For the folder structure, I'm not sure I understand, should I either have a slurm folder or kubernetes folder or both? In this case, this example does not cover a specific model and its using a hyperpod cluster based on slurm not EKS. What would the folder structure look like?

&& rm -rf /opt/hpcx/nccl_rdma_sharp_plugin \
&& ldconfig

RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please order packages in alphabetical order. Easier to reader.

DEBIAN_FRONTEND=noninteractive apt autoremove -y

# EFA
RUN apt-get update && \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove update. Already done on first line.

FROM nvcr.io/nvidia/nemo:24.12
ENV DEBIAN_FRONTEND=noninteractive

ENV GDRCOPY_VERSION=v2.4.1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 5, 2025

Thank you @olaoyea4 for contributing! We are migrating new directory structure for each test case. Could you please update this test case to have the following structure?:

  • testcase

    • model

      • slurm
      • kubernetes

Also, HyperPod cluster deployment guide is not necessary, as we already have it in https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod and https://catalog.workshops.aws/sagemaker-hyperpod/en-US.

Thanks @KeitaW. Modified the README to point to the workshop link for hyperpod deployment.

For the folder structure, I'm not sure I understand, should I either have a slurm folder or kubernetes folder or both? In this case, this example does not cover a specific model and its using a hyperpod cluster based on slurm not EKS. What would the folder structure look like?

Kindly use https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/16.pytorch-cpu-ddp as a template. You can leave kuberenetes folder blank with .gitkeep.

@olaoyea4
Copy link
Copy Markdown
Contributor Author

olaoyea4 commented Mar 5, 2025

Thank you @olaoyea4 for contributing! We are migrating new directory structure for each test case. Could you please update this test case to have the following structure?:

  • testcase

    • model

      • slurm
      • kubernetes

Also, HyperPod cluster deployment guide is not necessary, as we already have it in https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod and https://catalog.workshops.aws/sagemaker-hyperpod/en-US.

Thanks @KeitaW. Modified the README to point to the workshop link for hyperpod deployment.
For the folder structure, I'm not sure I understand, should I either have a slurm folder or kubernetes folder or both? In this case, this example does not cover a specific model and its using a hyperpod cluster based on slurm not EKS. What would the folder structure look like?

Kindly use https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/16.pytorch-cpu-ddp as a template. You can leave kuberenetes folder blank with .gitkeep.

Resoled! I have reorganized the directory structure as mentioned

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 8, 2025

#!/bin/bash
#
# Generated by NeMo Run
# Run with: sbatch --requeue --parsable
#

# Parameters
#SBATCH --account=ubuntu
#SBATCH --exclusive
#SBATCH --gpus-per-node=8
#SBATCH --job-name=ubuntu-ubuntu.training
#SBATCH --mem=0
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --open-mode=append
#SBATCH --output=/fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/sbatch_ubuntu-ubuntu.training_%j.out
#SBATCH --partition=dev
#SBATCH --time=01:00:00

set -evx

export PYTHONUNBUFFERED=1
export SLURM_UNBUFFEREDIO=1
export TORCHX_MAX_RETRIES=0

set +e

# setup

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

export TORCH_NCCL_AVOID_RECORD_STREAMS=1
export NCCL_NVLS_ENABLE=0
export NVTE_DP_AMAX_REDUCE_INTERVAL=0
export NVTE_ASYNC_AMAX_REDUCTION=1
export NVTE_FUSED_ATTN=0
export FI_EFA_USE_HUGE_PAGE=0


# Command 1

srun --output /fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/log-ubuntu-ubuntu.training_%j_${SLURM_RESTART_COUNT:-0}.out --container-image /fsx/ubuntu/aws-nemo-24-12.sqsh --container-mounts /fsx/ubuntu/megatron:/root/.cache/torch/megatron,/fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training:/nemo_run --container-workdir /nemo_run/code --wait=60 --kill-on-bad-exit=1 python -m nemo_run.core.runners.fdl_runner -n training /nemo_run/configs/training_fn_or_script

exitcode=$?

set -e

echo "job exited with code $exitcode"
if [ $exitcode -ne 0 ]; then
    if [ "$TORCHX_MAX_RETRIES" -gt "${SLURM_RESTART_COUNT:-0}" ]; then
        scontrol requeue "$SLURM_JOB_ID"
    fi
    exit $exitcode
fi

Here is the slurm script the last nemo-run has generated.

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 8, 2025

It turns out that the HyperPod cluster I'm running does not like:

#SBATCH --gpus-per-node=8

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 8, 2025

Related discussion #462 (comment)

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 8, 2025

Possibly, newer HP cluster does not have the issue. I will re-deploy a HP cluster to test it out.

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 8, 2025

(nemo-env) ubuntu@ip-10-1-0-217:~/nemo-review/3.test_cases/21.nemo-run/slurm$ tail -f /fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/sbatch_ubuntu-ubuntu.training_25.out 
+ NVTE_FUSED_ATTN=0
export FI_EFA_USE_HUGE_PAGE=0
+ export FI_EFA_USE_HUGE_PAGE=0
+ FI_EFA_USE_HUGE_PAGE=0


# Command 1

srun --output /fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/log-ubuntu-ubuntu.training_%j_${SLURM_RESTART_COUNT:-0}.out --container-image /fsx/ubuntu/aws-nemo-24-12.sqsh --container-mounts /fsx/ubuntu/megatron:/root/.cache/torch/megatron,/fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training:/nemo_run --container-workdir /nemo_run/code --wait=60 --kill-on-bad-exit=1 python -m nemo_run.core.runners.fdl_runner -n training /nemo_run/configs/training_fn_or_script
+ srun --output /fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/log-ubuntu-ubuntu.training_%j_0.out --container-image /fsx/ubuntu/aws-nemo-24-12.sqsh --container-mounts /fsx/ubuntu/megatron:/root/.cache/torch/megatron,/fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training:/nemo_run --container-workdir /nemo_run/code --wait=60 --kill-on-bad-exit=1 python -m nemo_run.core.runners.fdl_runner -n training /nemo_run/configs/training_fn_or_script
^C
(nemo-env) ubuntu@ip-10-1-0-217:~/nemo-review/3.test_cases/21.nemo-run/slurm$ tail -f /fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/log-ubuntu-ubun
tu.training_25_0.out 
Training epoch 11, iteration 0/999 | lr: 3.148e-05 | global_batch_size: 512 | global_step: 209 | reduced_train_loss: 10.83 | train_step_timing in s: 4.403 | consumed_samples: 107520
[NeMo I 2025-03-08 23:35:31 model_checkpoint:522] Async checkpoint save for step 209 (aws-nemo2T9LV7/checkpoints/model_name=0--val_loss=0.00-step=208-consumed_samples=107008.0-last.ckpt) finalized successfully.
Training epoch 11, iteration 1/999 | lr: 3.163e-05 | global_batch_size: 512 | global_step: 210 | reduced_train_loss: 10.83 | train_step_timing in s: 4.319 | consumed_samples: 108032
Training epoch 11, iteration 2/999 | lr: 3.178e-05 | global_batch_size: 512 | global_step: 211 | reduced_train_loss: 10.83 | train_step_timing in s: 4.278 | consumed_samples: 108544
Training epoch 11, iteration 3/999 | lr: 3.193e-05 | global_batch_size: 512 | global_step: 212 | reduced_train_loss: 10.83 | train_step_timing in s: 4.285 | consumed_samples: 109056
...

The config without gres works on HP.

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 10, 2025

Confirmed that GRES is not enabled by default on HP even today. Will check with service team.

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 11, 2025

#570 (review)
is addressed by 1925409 and 1043ab4

@KeitaW KeitaW requested a review from mhuguesaws March 11, 2025 09:55
@KeitaW KeitaW self-assigned this Mar 11, 2025
Copy link
Copy Markdown
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comment

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
@olaoyea4
Copy link
Copy Markdown
Contributor Author

Left comment

suggested change has been accepted

@aroraakshit
Copy link
Copy Markdown
Contributor

@mhuguesaws can we merge given all comments have been addressed?

Copy link
Copy Markdown
Contributor

@mhuguesaws mhuguesaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mhuguesaws mhuguesaws merged commit e67fc35 into awslabs:main Mar 12, 2025
@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 13, 2025

One suggestion, we may refer to the particular commit e67fc35 in the upcoming blog post as directory structure of this repository might change in near future.

KeitaW added a commit that referenced this pull request Feb 17, 2026
* Initial commit

* add readme

* update readme

* update readme

* update readme

* update readme and add new files

* re-organize directory

* Apply suggestions to Dockerfile from code review

Co-authored-by: Keita Watanabe <keitaw09@gmail.com>

* Update 3.test_cases/21.nemo-run/slurm/README.md

* Update 3.test_cases/21.nemo-run/slurm/README.md

* Update 3.test_cases/21.nemo-run/README.md

* Delete 3.test_cases/21.nemo-run/slurm/cluster-config-template.json

* Update 3.test_cases/21.nemo-run/slurm/README.md

* Update README.md

* Update venv.sh

* Update 3.test_cases/21.nemo-run/README.md

* update

* Update 3.test_cases/21.nemo-run/slurm/Dockerfile

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

* Update 3.test_cases/21.nemo-run/slurm/Dockerfile

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

* ignore virtual env

* update scripts

* Enable the use of NVLink SHARP (NVLS) by default.

It will be disabled if it is not supported by the hardware.
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable

* remove k8s subdirectory for now

* remove gres option

* move Dockerfile

* update Dockerfile

* Clean up

* Update 3.test_cases/21.nemo-run/slurm/run.py

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

* increment test case numbering

---------

Co-authored-by: Keita Watanabe <keitaw09@gmail.com>
Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants