Add optional SageMaker Training Plan support for HyperPod compute instance groups by newabdosheham · Pull Request #930 · awslabs/awsome-distributed-training

newabdosheham · 2026-01-13T11:04:14Z

This PR adds optional support for attaching a SageMaker Training Plan to a specific HyperPod instance group (e.g. compute/workers) in the hyperpod-slurm-tf Terraform module.

The change enables users to run HyperPod clusters under a Training Plan while preserving full backward compatibility for existing workflows.

What’s new

Introduces optional variables to enable Training Plan usage:
- use_training_plan
- training_plan_arn
- training_plan_instance_group_name
Attaches training_plan_arn only to the configured instance group (default: compute)
Supports custom instance group names (e.g. workers, compute-nodes)
Adds optional safety validation for:
- instance type
- instance count
  via:
- training_plan_expected_instance_type
- training_plan_expected_instance_count

Behavior

Scenario | Result -- | -- use_training_plan = false | No change from current behavior use_training_plan = true + valid config | Training Plan attached to target group ARN missing | Terraform fails fast with clear error Group name not found | Terraform fails fast with clear error Instance type/count mismatch (when provided) | Terraform fails fast with clear error

Validation is implemented using Terraform precondition to avoid runtime cluster failures.

Backward compatibility

The feature is fully optional and disabled by default
Existing configurations continue to work without modification
No behavior changes unless use_training_plan = true is explicitly set

Implementation details

Training Plan is injected using a conditional merge() in instance_groups
Target group is resolved dynamically from var.instance_groups map key
Validation is implemented using Terraform 1.2+ lifecycle.precondition
Default target group name is compute (matching existing examples)

Example usage


use_training_plan = true
training_plan_arn = "arn:aws:sagemaker:us-west-2:123456789012:training-plan/my-plan"
training_plan_instance_group_name = "compute-nodes"
training_plan_expected_instance_type  = "ml.trn1.32xlarge"

training_plan_expected_instance_count = 4

Testing

terraform validate
terraform plan
Verified:
- Training Plan attached only to target group
- Validation triggers correctly on mismatch
- No behavior change when feature is disabled

Why this matters

SageMaker Training Plans are increasingly used for:

capacity reservation
cost optimization
scheduling control

This change allows users to adopt Training Plans without forking the module and keeps the official AWS sample aligned with current SageMaker capabilities.

* add os grafana stack Co-authored-by: Matthew Nightingale <nghtm@amazon.com> * remove sg * update * update * add OS grafana README * remove unrelated file --------- Co-authored-by: Matthew Nightingale <nghtm@amazon.com>

…#545)

…ch (awslabs#549)

Signed-off-by: Nisha <nisha.nadkarni@gmail.com>

This reverts commit 0133d3d.

…ion process (POSIX + IAM) (awslabs#542) * Automating multihead cluster creation process + automating user creation process (POSIX + IAM)

…s#560)

) * Adding Helm Chart Injector and Nested CloudFormation stacks * adding resource prefix to studio stack * adding hyperpod helper script * adding modifications to nested stacks * modified cidr format checking in helper script * updating main template with asset bucket name map * udpated readmes to reflect changes * adding get_yes_no validation to helper script and correcting TemplateURL to use FindInMap * updating default accel instance type * standardized resource naming convention * renamed helper script to allow for standardization * updated route table name * updated regex of resource prefix function to limit length to 28 * remove create log group permission from lambda to avoid recreation after stack delete * improved error handling and status reporting for deploy stack function * added dynamic az id default lookup by region

…script (awslabs#530) * Update the Neuron SDK to 2.21.0 * Update the Llama3-70B pretraining with the Neuron SDK 2.21 * Fix a typo * Add --hw_backend trn1 in the convert_checkpoint command * More update * Update the update_neuron_sdk.sh by removing the neuron-top check * Keep enable_update_neuron_sdk as Flase by default * Update automate-eks-cluster-creation.sh (awslabs#529) Minor bug fix * Update according to the review comments. * minor updates in doc --------- Co-authored-by: Aman Shanbhag <55571601+amanshanbhag@users.noreply.github.com> Co-authored-by: Keita Watanabe <mlkeita@amazon.com>

awslabs#562) * Update Welcome message to clarify which HP cluster is going to be created

…wslabs#561) * Changed docker run command for os observability stack to use IMDSv2 * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml Co-authored-by: Keita Watanabe <mlkeita@amazon.com> * Added MD options to LaunchTemplate * reverting from --network host to -p 3000:3000 * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * Removing systemctl start docker because we are specifying --now in systemctl enable docker * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com> Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

Signed-off-by: Ankur Srivastava <awsankur@amazon.com>

* updated instance cound environment variables * updated message for IAM execution role creation * added check_jq function * removed old todos * updated order of hyperpod cluster config message * updated hyperpod cluster stack to conditionally disable deep health checks * put S3 endpoint into separate cfn stack * updated helm chart injector to use kube-system namespace * syntax fix in lambda function * enabled pathrough of existing resource ids from tmp_env_vars to env_vars * fixed execution role stack boolean variable and security group stack display * bump k8s version to 1.31

* Add NCCL tuner flag to megatron-lm * Remove torchrun from megatron * Update 3.test_cases/1.megatron-lm/2.distributed-training.sbatch --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>

* Adding torchtitan sample showcasing how to pre-train Llama-3 8B and leverage torchtitan features( torch.compile, FP8 linear ops, FP8 Allgather to accelerate pre-training * updating directory structure * updated README and sbatch script * adding separate README for slurm * Update 3.test_cases/21.torchtitan/README.md * Update 3.test_cases/21.torchtitan/slurm/README.md --------- Co-authored-by: Keita Watanabe <keitaw09@gmail.com>

* Update 0.NemoMegatron-aws-optimized.Dockerfile * Update README.md * Update 0.NemoMegatron-aws-optimized.Dockerfile

* Initial commit * add readme * update readme * update readme * update readme * update readme and add new files * re-organize directory * Apply suggestions to Dockerfile from code review Co-authored-by: Keita Watanabe <keitaw09@gmail.com> * Update 3.test_cases/21.nemo-run/slurm/README.md * Update 3.test_cases/21.nemo-run/slurm/README.md * Update 3.test_cases/21.nemo-run/README.md * Delete 3.test_cases/21.nemo-run/slurm/cluster-config-template.json * Update 3.test_cases/21.nemo-run/slurm/README.md * Update README.md * Update venv.sh * Update 3.test_cases/21.nemo-run/README.md * update * Update 3.test_cases/21.nemo-run/slurm/Dockerfile Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * Update 3.test_cases/21.nemo-run/slurm/Dockerfile Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * ignore virtual env * update scripts * Enable the use of NVLink SHARP (NVLS) by default. It will be disabled if it is not supported by the hardware. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable * remove k8s subdirectory for now * remove gres option * move Dockerfile * update Dockerfile * Clean up * Update 3.test_cases/21.nemo-run/slurm/run.py Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * increment test case numbering --------- Co-authored-by: Keita Watanabe <keitaw09@gmail.com> Co-authored-by: Keita Watanabe <mlkeita@amazon.com> Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

Co-authored-by: Daisuke Miyamoto <midaisuk@gmail.com>

Reference: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/training-troubleshooting.html#multiworker-execution-hangs-during-nccl-init

…r versions (awslabs#887) Signed-off-by: rpovelik <rpovelik@amazon.co.uk>

Signed-off-by: Nathan Na <nzhenye@amazon.com> Thank you for your contribution and customer obsession!

* Add OpenZFS support to SMHP Terraform modules * Add support and validation for different OpenZFS deployment types * Add support and validation for different OpenZFS deployment types

…awslabs#881) * Custom aws-ofi-nccl support * Cleanup LD_LIBRARY_PATH and PATH in favor of /etc/ld.so.conf.d

…#899) * update openzfs mounting logic * adding forceful creation of symlink for .ssh * Added chown to the user for .ssh * updating information for cluster user config based on openzfs present * updated to include users other than ubuntu * fixed relative path for shared_users.txt * Adding xargs to strip carriage returns * fixed the ownership of symlink .ssh dir * Updated to include user flag for login * Fixing race condition during file access testing for fsx lustre and openzfs

* RLVR Recipe in added post-training section * fix env_vars * rm verl submodule, and other revisions * rm post-training, cleaned env_vars, download grafana dashboards * Delete .gitmodules * rlvr revisions * rlvr readme quick fix --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>

* updates to hyperpod cluster and helm chart modules for RIG support * updates to IAM exec role for RIG * added booleans for autoscaling, automatic node recovery, and continuous provisioning * changes made from testing RIG deployment * adding karpenter role and policy for autoscaling * converged private subnet routing and added override_vpc_config for RIG * added SQS and Lambda VPC endpoints for RFT with RIG * updated readme for RIG and scoped down Lambda/SQS permissions

…rt (awslabs#893) * P6-b200: use Secrets Manager for SSH keys, remove NCCL cmd from bootstrap, update CFN to include secret and ECR * Simplify AWS Batch P6 deployment with inline setup script - Remove jq dependency and JSON parsing - Auto-generate EC2 SSH key pair during CloudFormation deployment - Store private key in Secrets Manager automatically - Replace custom Dockerfile and bootstrap.sh with inline command in Job Definition - Use base nccl-tests image directly from public ECR - All setup logic now in single CloudFormation template - Remove intermediate variables, use env vars directly Author: yusongw@ * removed CHANGES.md * Simplify AWS Batch P6 setup: remove jq dependency, inline container setup, manual SSH key generation * Auto-create resource group in P6 template, simplify deployment to 3 steps * Fix P6 deployment: use capacity reservation ID directly, add AL2023 ECS image, fix IMDSv2 and PATH issues * Add SSH key parameter for deployment, start sshd, fix main node self-registration and worker IP passing * Fix MNP networking: use container IP and exclude bridge interfaces - Use hostname -i for container IP in awsvpc mode - Set NCCL_SOCKET_IFNAME=^lo,docker,ecs to exclude bridge interfaces - Add BatchJobRole with ecs-tasks trust for container credentials - Simplify SSH key generation with runtime generation - Remove debug output and set NCCL_DEBUG=WARN * updated README.md to have P6 support * Fix table of contents links in README * fix: correct VPC template filename reference in README * fix link * delete backup file * fix: address security scan findings - Remove ECR repository (using public ECR image) - Add KMS encryption with key rotation for Secrets Manager - Convert inline IAM policies to managed policies - Remove explicit resource names for auto-generation - Enforce IMDSv2 on Launch Template - Add suppression for SSH key rotation (not applicable) * feat: update NCCL tests image to specific version for better P6 performance Use public.ecr.aws/hpc-cloud/nccl-tests:cuda12.8.1-efa1.42.0-ofiv1.16.0-ncclv2.27.5-1-testsv2.16.4 - CUDA 12.8.1 - EFA 1.42.0 - OFI (libfabric) 1.16.0 - NCCL 2.27.5 - NCCL tests 2.16.4

Co-authored-by: Anshuman Kumar <anshumnn@amazon.com>

* dynamically set Global Batch Size removing the static GBS to dynamically set GBS to correspond to num of nodes. * Configure TP/PP/GBS based on node count Updated TP/PP based on node count and adjusted global batch size calculation.

* ray dashboard integration improvement * scrape target disclaimer

… cleanup (awslabs#881)" (awslabs#910) This reverts commit 8d9c95e.

* auto-disabled igs and lcs for rig mode for better UX * simplified logic for disabling s3_bucket module, added conditional outputs for s3_bucket module

…wslabs#915)

* Improved lifecycle script for HP-EKS - Separate bootstrap script and main script to redirect all stdout/stderr to CloudWatch Logs - Redirect Kubelet data path in addition to containerd. - Allow choosing volume for containerd and kubelet. * Updated message to explain why 60 seconds * Updating Terraform for HyperPod EKS, to upload multiple lifecycle script files to S3

…labs#918) This reverts commit fca2364.

newabdosheham · 2026-01-13T11:06:29Z

Hi maintainers,
This PR adds optional support for attaching a SageMaker Training Plan to a specific HyperPod instance group (e.g. compute/workers), with full backward compatibility.

Happy to adjust naming, defaults, or structure based on your feedback. Thanks for reviewing!

bluecrayon52 · 2026-02-03T20:29:20Z

...sagemaker-hyperpod/terraform-modules/hyperpod-slurm-tf/modules/hyperpod_cluster/variables.tf

Convert instance_group to list of objects to preserve the user-defined ordering and embed the training_plan_arn parameter into the instance group object definition. See the HyperPod EKS implementation here for reference: https://github.com/aws-samples/awsome-distributed-training/blob/8fae1a909e91a9bc5288e0cb95006573b047f89b/1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/hyperpod-eks-tf/modules/hyperpod_cluster/variables.tf#L42

Doing this should remove the need for the additional environment variables (use_training_plan, training_plan_instance_group_name). Given that the variables training_plan_expected_instance_type and training_plan_expected_instance_count are user defined, they provide no greater protection. This is essentially asking the user to verify their own inputs twice.

bluecrayon52 · 2026-02-03T20:34:14Z

...es/5.sagemaker-hyperpod/terraform-modules/hyperpod-slurm-tf/modules/hyperpod_cluster/main.tf

Move instance_group_name value into the object definition (config.name) of the instance group, see HyperPod EKS implementation here for reference: https://github.com/aws-samples/awsome-distributed-training/blob/8fae1a909e91a9bc5288e0cb95006573b047f89b/1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/hyperpod-eks-tf/modules/hyperpod_cluster/main.tf#L13

Make training_plane_arn conditional based on config.training_plan_arn, see HyperPod EKS implementation here for reference:
https://github.com/aws-samples/awsome-distributed-training/blob/8fae1a909e91a9bc5288e0cb95006573b047f89b/1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/hyperpod-eks-tf/modules/hyperpod_cluster/main.tf#L47

bluecrayon52

Left comments recommending changes. We want to standardize on a list of objects for instance_groups and embed the name and training_plan_arn into the object definition for an instance group.

KeitaW · 2026-02-17T13:34:04Z

Hi @newabdosheham — this PR was automatically closed by GitHub because the repository history was rewritten as part of a cleanup to reduce clone size (details in #959).

Your fork's branch still references the old (pre-rewrite) history, so it can't be reopened directly. To restore your PR, please rebase your fork onto the new main:

# Update your fork's main branch
cd your-fork
git remote add upstream https://github.com/aws-samples/awsome-distributed-training.git  # if not already set
git fetch upstream
git checkout main
git reset --hard upstream/main
git push --force origin main

# Rebase your feature branch
git checkout feature/hyperpod-training-plan
git rebase main
# Resolve any conflicts if needed
git push --force origin feature/hyperpod-training-plan

# Then open a new PR from your fork

Apologies for the inconvenience! If you run into any issues rebasing, feel free to ask for help.

KeitaW and others added 30 commits February 6, 2025 08:20

add os grafana stack (awslabs#526)

1119ac8

* add os grafana stack Co-authored-by: Matthew Nightingale <nghtm@amazon.com> * remove sg * update * update * add OS grafana README * remove unrelated file --------- Co-authored-by: Matthew Nightingale <nghtm@amazon.com>

Fix a crash with empty InstanceGroup in resource_config.json (awslabs…

7f4a505

…#545)

exponential backoff for node exporter and dcgm exporter containers

f5ada55

Fixing Dockerfile to install torchvision torchaudio consistent to tor…

861ea1e

…ch (awslabs#549)

updating task governance

0133d3d

Signed-off-by: Nisha <nisha.nadkarni@gmail.com>

Revert "updating task governance"

5dd9703

This reverts commit 0133d3d.

Automating multihead cluster creation process + automating user creat…

d5b7dcd

…ion process (POSIX + IAM) (awslabs#542) * Automating multihead cluster creation process + automating user creation process (POSIX + IAM)

Minor change to OS cluster observability CF stack description (awslab…

cf205ec

…s#560)

avoid redownloading conda binary (awslabs#558)

d74c7ef

Update Welcome message to clarify which HP cluster is going to be cre… (

5aa0395

awslabs#562) * Update Welcome message to clarify which HP cluster is going to be created

Added Nsight slurm workshop artifacts

8b04d80

Signed-off-by: Ankur Srivastava <awsankur@amazon.com>

Updated

ea4075a

Signed-off-by: Ankur Srivastava <awsankur@amazon.com>

updated BUCKET_NAME to S3_BUCKET_NAME in helper script (awslabs#566)

3b289ba

Fix exclusion of eth in NCCL tests (awslabs#569)

27531ab

Fix Open MPI conduit in NCCL tests (awslabs#571)

3770b49

Update nccl-tests-ami.sbatch

4039347

remove .gitmodules (awslabs#576)

21ad8ed

Improvements/megatron lm (awslabs#575)

0cd69fc

* Add NCCL tuner flag to megatron-lm * Remove torchrun from megatron * Update 3.test_cases/1.megatron-lm/2.distributed-training.sbatch --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>

Update nemo-launcher test case (awslabs#580)

6cad516

* Update 0.NemoMegatron-aws-optimized.Dockerfile * Update README.md * Update 0.NemoMegatron-aws-optimized.Dockerfile

exit when targetted instance unreachable (awslabs#563)

be00160

remove own_account.md (awslabs#581)

f69aabc

updated etcd version tag from latest to v3.5.19 (awslabs#588)

f7ee4ed

update nemorun testcase (awslabs#590)

39b82e2

Co-authored-by: Daisuke Miyamoto <midaisuk@gmail.com>

Update nccom-tests.sbatch (awslabs#591)

6d86576

Reference: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/training-troubleshooting.html#multiworker-execution-hangs-during-nccl-init

aravneelaws and others added 20 commits October 30, 2025 17:33

minor patch for AL2023 (awslabs#883)

5e27a40

Update EFA Node exporter dockerfile to latest procfs and node exporte…

6dde96a

…r versions (awslabs#887) Signed-off-by: rpovelik <rpovelik@amazon.co.uk>

Fix version parsing logic of ofi nccl plugin (awslabs#894)

8675b1b

Signed-off-by: Nathan Na <nzhenye@amazon.com> Thank you for your contribution and customer obsession!

Add FSx OpenZFS support to SMHP Terraform modules (awslabs#890)

d69470b

* Add OpenZFS support to SMHP Terraform modules * Add support and validation for different OpenZFS deployment types * Add support and validation for different OpenZFS deployment types

nccl-tests/Dockerfile - Add support for custom aws-ofi-nccl & cleanup (…

8d9c95e

…awslabs#881) * Custom aws-ofi-nccl support * Cleanup LD_LIBRARY_PATH and PATH in favor of /etc/ld.so.conf.d

fixing and updating megatron-lm sample (awslabs#903)

dce1eea

feat(nodeadm): enable CDI by default in containerd config (awslabs#907)

ef2cc4a

Co-authored-by: Anshuman Kumar <anshumnn@amazon.com>

dynamically set Global Batch Size (awslabs#904)

61aa8e5

* dynamically set Global Batch Size removing the static GBS to dynamically set GBS to correspond to num of nodes. * Configure TP/PP/GBS based on node count Updated TP/PP based on node count and adjusted global batch size calculation.

ray dashboard integration improvement (awslabs#905)

82c78b5

* ray dashboard integration improvement * scrape target disclaimer

Revert "nccl-tests/Dockerfile - Add support for custom aws-ofi-nccl &…

b2ec3fd

… cleanup (awslabs#881)" (awslabs#910) This reverts commit 8d9c95e.

Minor Updates for RIG Support with Better UX (awslabs#906)

61a36b4

* auto-disabled igs and lcs for rig mode for better UX * simplified logic for disabling s3_bucket module, added conditional outputs for s3_bucket module

Adding no_root_squash option to prevent root squashing from NFS side (a…

2c46dd1

…wslabs#915)

Revert "HyperPod EKS Lifecycle Script Improvement (awslabs#916)" (aws…

eb8f9ca

…labs#918) This reverts commit fca2364.

Adding 3rd party license information of slurm_exporter. (awslabs#928)

29b0f12

Add optional Training Plan support for HyperPod instance groups

902a18c

bluecrayon52 self-requested a review February 3, 2026 19:51

bluecrayon52 reviewed Feb 3, 2026

View reviewed changes

bluecrayon52 requested changes Feb 3, 2026

View reviewed changes

KeitaW closed this Feb 17, 2026

KeitaW force-pushed the main branch from 45ceafa to 359f589 Compare February 17, 2026 13:23

KeitaW mentioned this pull request Feb 17, 2026

Repository clone size is unnecessarily large due to deleted binary in history and untracked large files #959

Closed

newabdosheham mentioned this pull request Feb 26, 2026

Add optional Training Plan support for HyperPod instance groups #1004

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add optional SageMaker Training Plan support for HyperPod compute instance groups#930

Add optional SageMaker Training Plan support for HyperPod compute instance groups#930
newabdosheham wants to merge 1366 commits intoawslabs:mainfrom
newabdosheham:feature/hyperpod-training-plan

newabdosheham commented Jan 13, 2026

Uh oh!

newabdosheham commented Jan 13, 2026

Uh oh!

bluecrayon52 Feb 3, 2026

Uh oh!

bluecrayon52 Feb 3, 2026 •

edited

Loading

Uh oh!

bluecrayon52 left a comment

Uh oh!

KeitaW commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

newabdosheham commented Jan 13, 2026

What’s new

Behavior

Backward compatibility

Implementation details

Example usage

Testing

Why this matters

Uh oh!

newabdosheham commented Jan 13, 2026

Uh oh!

bluecrayon52 Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

bluecrayon52 Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bluecrayon52 left a comment

Choose a reason for hiding this comment

Uh oh!

KeitaW commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

bluecrayon52 Feb 3, 2026 •

edited

Loading