Nemo2.0 on sagemaker hyperpod#570
Conversation
There was a problem hiding this comment.
Thank you @olaoyea4 for contributing!
We are migrating new directory structure for each test case.
Could you please update this test case to have the following structure?:
- testcase
- model
- slurm
- kubernetes
- model
Also, HyperPod cluster deployment guide is not necessary, as we already have it in https://github.com/aws-samples/awsome-distributed-training/tree/main/1.architectures/5.sagemaker-hyperpod and https://catalog.workshops.aws/sagemaker-hyperpod/en-US.
| return parser | ||
|
|
||
|
|
||
| def slurm_executor( |
There was a problem hiding this comment.
I like how the training job is launched. It's more straight forward than existing way. Can we also integrate a feature adding --auto-resume=1 when (and only when) we are on HyperPod? cf. https://github.com/aws-samples/awsome-distributed-training/blob/27531abd134a7db726b5babb00c19e97d451d8dd/3.test_cases/16.pytorch-cpu-ddp/slurm/1.conda-train.sbatch#L18-L21
There was a problem hiding this comment.
@KeitaW - thank you for the suggestion, currently the sbatch file is built automatically by nemo_run, and it does not support auto-resume argument just yet. We are working on adding that support, however it will take some time to build and test with sagemaker hyperpod. For this iteration, adding this support would be out of scope.
There was a problem hiding this comment.
Understood that. Thanks @aroraakshit let us know when nemo_run supports custom flags.
Thanks @KeitaW. Modified the README to point to the workshop link for hyperpod deployment. For the folder structure, I'm not sure I understand, should I either have a slurm folder or kubernetes folder or both? In this case, this example does not cover a specific model and its using a hyperpod cluster based on slurm not EKS. What would the folder structure look like? |
3.test_cases/21.nemo-run/Dockerfile
Outdated
| && rm -rf /opt/hpcx/nccl_rdma_sharp_plugin \ | ||
| && ldconfig | ||
|
|
||
| RUN DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated \ |
There was a problem hiding this comment.
Please order packages in alphabetical order. Easier to reader.
3.test_cases/21.nemo-run/Dockerfile
Outdated
| DEBIAN_FRONTEND=noninteractive apt autoremove -y | ||
|
|
||
| # EFA | ||
| RUN apt-get update && \ |
There was a problem hiding this comment.
Please remove update. Already done on first line.
3.test_cases/21.nemo-run/Dockerfile
Outdated
| FROM nvcr.io/nvidia/nemo:24.12 | ||
| ENV DEBIAN_FRONTEND=noninteractive | ||
|
|
||
| ENV GDRCOPY_VERSION=v2.4.1 |
There was a problem hiding this comment.
Please follow the version convention used in NCCL tests https://github.com/aws-samples/awsome-distributed-training/blob/main/micro-benchmarks/nccl-tests/nccl-tests.Dockerfile#L5
We use full release tag
Kindly use https://github.com/aws-samples/awsome-distributed-training/tree/main/3.test_cases/16.pytorch-cpu-ddp as a template. You can leave |
Resoled! I have reorganized the directory structure as mentioned |
#!/bin/bash
#
# Generated by NeMo Run
# Run with: sbatch --requeue --parsable
#
# Parameters
#SBATCH --account=ubuntu
#SBATCH --exclusive
#SBATCH --gpus-per-node=8
#SBATCH --job-name=ubuntu-ubuntu.training
#SBATCH --mem=0
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --open-mode=append
#SBATCH --output=/fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/sbatch_ubuntu-ubuntu.training_%j.out
#SBATCH --partition=dev
#SBATCH --time=01:00:00
set -evx
export PYTHONUNBUFFERED=1
export SLURM_UNBUFFEREDIO=1
export TORCHX_MAX_RETRIES=0
set +e
# setup
nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
export NCCL_NVLS_ENABLE=0
export NVTE_DP_AMAX_REDUCE_INTERVAL=0
export NVTE_ASYNC_AMAX_REDUCTION=1
export NVTE_FUSED_ATTN=0
export FI_EFA_USE_HUGE_PAGE=0
# Command 1
srun --output /fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training/log-ubuntu-ubuntu.training_%j_${SLURM_RESTART_COUNT:-0}.out --container-image /fsx/ubuntu/aws-nemo-24-12.sqsh --container-mounts /fsx/ubuntu/megatron:/root/.cache/torch/megatron,/fsx/ubuntu/.nemo_run/experiments/aws-nemo2T9LV7/aws-nemo2T9LV7_1741441363/training:/nemo_run --container-workdir /nemo_run/code --wait=60 --kill-on-bad-exit=1 python -m nemo_run.core.runners.fdl_runner -n training /nemo_run/configs/training_fn_or_script
exitcode=$?
set -e
echo "job exited with code $exitcode"
if [ $exitcode -ne 0 ]; then
if [ "$TORCHX_MAX_RETRIES" -gt "${SLURM_RESTART_COUNT:-0}" ]; then
scontrol requeue "$SLURM_JOB_ID"
fi
exit $exitcode
fiHere is the slurm script the last nemo-run has generated. |
|
It turns out that the HyperPod cluster I'm running does not like: |
|
Related discussion #462 (comment) |
|
Possibly, newer HP cluster does not have the issue. I will re-deploy a HP cluster to test it out. |
The config without gres works on HP. |
|
Confirmed that GRES is not enabled by default on HP even today. Will check with service team. |
It will be disabled if it is not supported by the hardware. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable
|
#570 (review) |
Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
suggested change has been accepted |
|
@mhuguesaws can we merge given all comments have been addressed? |
|
One suggestion, we may refer to the particular commit e67fc35 in the upcoming blog post as directory structure of this repository might change in near future. |
* Initial commit * add readme * update readme * update readme * update readme * update readme and add new files * re-organize directory * Apply suggestions to Dockerfile from code review Co-authored-by: Keita Watanabe <keitaw09@gmail.com> * Update 3.test_cases/21.nemo-run/slurm/README.md * Update 3.test_cases/21.nemo-run/slurm/README.md * Update 3.test_cases/21.nemo-run/README.md * Delete 3.test_cases/21.nemo-run/slurm/cluster-config-template.json * Update 3.test_cases/21.nemo-run/slurm/README.md * Update README.md * Update venv.sh * Update 3.test_cases/21.nemo-run/README.md * update * Update 3.test_cases/21.nemo-run/slurm/Dockerfile Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * Update 3.test_cases/21.nemo-run/slurm/Dockerfile Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * ignore virtual env * update scripts * Enable the use of NVLink SHARP (NVLS) by default. It will be disabled if it is not supported by the hardware. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable * remove k8s subdirectory for now * remove gres option * move Dockerfile * update Dockerfile * Clean up * Update 3.test_cases/21.nemo-run/slurm/run.py Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * increment test case numbering --------- Co-authored-by: Keita Watanabe <keitaw09@gmail.com> Co-authored-by: Keita Watanabe <mlkeita@amazon.com> Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Description of changes: This change includes code on how to run pretraining job on a sagemaker hyperpod cluster using NeMo 2.0 framework
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.