Skip to content

Add optional SageMaker Training Plan support for HyperPod compute instance groups#930

Closed
newabdosheham wants to merge 1366 commits intoawslabs:mainfrom
newabdosheham:feature/hyperpod-training-plan
Closed

Add optional SageMaker Training Plan support for HyperPod compute instance groups#930
newabdosheham wants to merge 1366 commits intoawslabs:mainfrom
newabdosheham:feature/hyperpod-training-plan

Conversation

@newabdosheham
Copy link
Copy Markdown

This PR adds optional support for attaching a SageMaker Training Plan to a specific HyperPod instance group (e.g. compute/workers) in the hyperpod-slurm-tf Terraform module.

The change enables users to run HyperPod clusters under a Training Plan while preserving full backward compatibility for existing workflows.


What’s new

  • Introduces optional variables to enable Training Plan usage:

    • use_training_plan

    • training_plan_arn

    • training_plan_instance_group_name

  • Attaches training_plan_arn only to the configured instance group (default: compute)

  • Supports custom instance group names (e.g. workers, compute-nodes)

  • Adds optional safety validation for:

    • instance type

    • instance count
      via:

    • training_plan_expected_instance_type

    • training_plan_expected_instance_count


Behavior

Scenario | Result -- | -- use_training_plan = false | No change from current behavior use_training_plan = true + valid config | Training Plan attached to target group ARN missing | Terraform fails fast with clear error Group name not found | Terraform fails fast with clear error Instance type/count mismatch (when provided) | Terraform fails fast with clear error

Validation is implemented using Terraform precondition to avoid runtime cluster failures.


Backward compatibility

  • The feature is fully optional and disabled by default

  • Existing configurations continue to work without modification

  • No behavior changes unless use_training_plan = true is explicitly set


Implementation details

  • Training Plan is injected using a conditional merge() in instance_groups

  • Target group is resolved dynamically from var.instance_groups map key

  • Validation is implemented using Terraform 1.2+ lifecycle.precondition

  • Default target group name is compute (matching existing examples)


Example usage

use_training_plan = true training_plan_arn = "arn:aws:sagemaker:us-west-2:123456789012:training-plan/my-plan" training_plan_instance_group_name = "compute-nodes"

training_plan_expected_instance_type = "ml.trn1.32xlarge"
training_plan_expected_instance_count = 4


Testing

  • terraform validate

  • terraform plan

  • Verified:

    • Training Plan attached only to target group

    • Validation triggers correctly on mismatch

    • No behavior change when feature is disabled


Why this matters

SageMaker Training Plans are increasingly used for:

  • capacity reservation

  • cost optimization

  • scheduling control

This change allows users to adopt Training Plans without forking the module and keeps the official AWS sample aligned with current SageMaker capabilities.

KeitaW and others added 30 commits February 6, 2025 08:20
* add os grafana stack

Co-authored-by: Matthew Nightingale <nghtm@amazon.com>

* remove  sg

* update

* update

* add OS grafana README

* remove unrelated file

---------

Co-authored-by: Matthew Nightingale <nghtm@amazon.com>
Signed-off-by: Nisha <nisha.nadkarni@gmail.com>
…ion process (POSIX + IAM) (awslabs#542)

* Automating multihead cluster creation process + automating user creation process (POSIX + IAM)
)

* Adding Helm Chart Injector and Nested CloudFormation stacks

* adding resource prefix to studio stack

* adding hyperpod helper script

* adding modifications to nested stacks

* modified cidr format checking in helper script

* updating main template with asset bucket name map

* udpated readmes to reflect changes

* adding get_yes_no validation to helper script and correcting TemplateURL to use FindInMap

* updating default accel instance type

* standardized resource naming convention

* renamed helper script to allow for standardization

* updated route table name

* updated regex of resource prefix function to limit length to 28

* remove create log group permission from lambda to avoid recreation after stack delete

* improved error handling and status reporting for deploy stack function

* added dynamic az id default lookup by region
…script (awslabs#530)

* Update the Neuron SDK to 2.21.0

* Update the Llama3-70B pretraining with the Neuron SDK 2.21

* Fix a typo

* Add --hw_backend trn1 in the convert_checkpoint command

* More update

* Update the update_neuron_sdk.sh by removing the neuron-top check

* Keep enable_update_neuron_sdk as Flase by default

* Update automate-eks-cluster-creation.sh (awslabs#529)

Minor bug fix

* Update according to the review comments.

* minor updates in doc

---------

Co-authored-by: Aman Shanbhag <55571601+amanshanbhag@users.noreply.github.com>
Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
awslabs#562)

* Update Welcome message to clarify which HP cluster is going to be created
…wslabs#561)

* Changed docker run command for os observability stack to use IMDSv2

* Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml

* Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml

* Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml

* Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml

Co-authored-by: Keita Watanabe <mlkeita@amazon.com>

* Added MD options to LaunchTemplate

* reverting from --network host to -p 3000:3000

* Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

* Removing systemctl start docker because we are specifying --now in systemctl enable docker

* Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

---------

Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
* updated instance cound environment variables

* updated message for IAM execution role creation

* added check_jq function

* removed old todos

* updated order of hyperpod cluster config message

* updated hyperpod cluster stack to conditionally disable deep health checks

* put S3 endpoint into separate cfn stack

* updated helm chart injector to use kube-system namespace

* syntax fix in lambda function

* enabled pathrough of existing resource ids from tmp_env_vars to env_vars

* fixed execution role stack boolean variable and security group stack display

* bump k8s version to 1.31
* Add NCCL tuner flag to megatron-lm

* Remove torchrun from megatron

* Update 3.test_cases/1.megatron-lm/2.distributed-training.sbatch

---------

Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
* Adding torchtitan sample showcasing how to pre-train Llama-3 8B and leverage torchtitan features( torch.compile, FP8 linear ops, FP8 Allgather to accelerate pre-training

* updating directory structure

* updated README and sbatch script

* adding separate README for slurm

* Update 3.test_cases/21.torchtitan/README.md

* Update 3.test_cases/21.torchtitan/slurm/README.md

---------

Co-authored-by: Keita Watanabe <keitaw09@gmail.com>
* Update 0.NemoMegatron-aws-optimized.Dockerfile

* Update README.md

* Update 0.NemoMegatron-aws-optimized.Dockerfile
* Initial commit

* add readme

* update readme

* update readme

* update readme

* update readme and add new files

* re-organize directory

* Apply suggestions to Dockerfile from code review

Co-authored-by: Keita Watanabe <keitaw09@gmail.com>

* Update 3.test_cases/21.nemo-run/slurm/README.md

* Update 3.test_cases/21.nemo-run/slurm/README.md

* Update 3.test_cases/21.nemo-run/README.md

* Delete 3.test_cases/21.nemo-run/slurm/cluster-config-template.json

* Update 3.test_cases/21.nemo-run/slurm/README.md

* Update README.md

* Update venv.sh

* Update 3.test_cases/21.nemo-run/README.md

* update

* Update 3.test_cases/21.nemo-run/slurm/Dockerfile

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

* Update 3.test_cases/21.nemo-run/slurm/Dockerfile

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

* ignore virtual env

* update scripts

* Enable the use of NVLink SHARP (NVLS) by default.

It will be disabled if it is not supported by the hardware.
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable

* remove k8s subdirectory for now

* remove gres option

* move Dockerfile

* update Dockerfile

* Clean up

* Update 3.test_cases/21.nemo-run/slurm/run.py

Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>

* increment test case numbering

---------

Co-authored-by: Keita Watanabe <keitaw09@gmail.com>
Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Co-authored-by: Daisuke Miyamoto <midaisuk@gmail.com>
aravneelaws and others added 20 commits October 30, 2025 17:33
…r versions (awslabs#887)

Signed-off-by: rpovelik <rpovelik@amazon.co.uk>
Signed-off-by: Nathan Na <nzhenye@amazon.com>

Thank you for your contribution and customer obsession!
* Add OpenZFS support to SMHP Terraform modules

* Add support and validation  for different OpenZFS deployment types

* Add support and validation  for different OpenZFS deployment types
…awslabs#881)

* Custom aws-ofi-nccl support
* Cleanup LD_LIBRARY_PATH and PATH in favor of /etc/ld.so.conf.d
…#899)

* update openzfs mounting logic

* adding forceful creation of symlink for .ssh

* Added chown to the user for .ssh

* updating information for cluster user config based on openzfs present

* updated to include users other than ubuntu

* fixed relative path for shared_users.txt

* Adding xargs to strip carriage returns

* fixed the ownership of symlink .ssh dir

* Updated to include user flag for login

* Fixing race condition during file access testing for fsx lustre and openzfs
* RLVR Recipe in added post-training section

* fix env_vars

* rm verl submodule, and other revisions

* rm post-training, cleaned env_vars, download grafana dashboards

* Delete .gitmodules

* rlvr revisions

* rlvr readme quick fix

---------

Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
* updates to hyperpod cluster and helm chart modules for RIG support

* updates to IAM exec role for RIG

* added booleans for autoscaling, automatic node recovery, and continuous provisioning

* changes made from testing RIG deployment

* adding karpenter role and policy for autoscaling

* converged private subnet routing and added override_vpc_config for RIG

* added SQS and Lambda VPC endpoints for RFT with RIG

* updated readme for RIG and scoped down Lambda/SQS permissions
…rt (awslabs#893)

* P6-b200: use Secrets Manager for SSH keys, remove NCCL cmd from bootstrap, update CFN to include secret and ECR

* Simplify AWS Batch P6 deployment with inline setup script

- Remove jq dependency and JSON parsing
- Auto-generate EC2 SSH key pair during CloudFormation deployment
- Store private key in Secrets Manager automatically
- Replace custom Dockerfile and bootstrap.sh with inline command in Job Definition
- Use base nccl-tests image directly from public ECR
- All setup logic now in single CloudFormation template
- Remove intermediate variables, use env vars directly

Author: yusongw@

* removed CHANGES.md

* Simplify AWS Batch P6 setup: remove jq dependency, inline container setup, manual SSH key generation

* Auto-create resource group in P6 template, simplify deployment to 3 steps

* Fix P6 deployment: use capacity reservation ID directly, add AL2023 ECS image, fix IMDSv2 and PATH issues

* Add SSH key parameter for deployment, start sshd, fix main node self-registration and worker IP passing

* Fix MNP networking: use container IP and exclude bridge interfaces

- Use hostname -i for container IP in awsvpc mode
- Set NCCL_SOCKET_IFNAME=^lo,docker,ecs to exclude bridge interfaces
- Add BatchJobRole with ecs-tasks trust for container credentials
- Simplify SSH key generation with runtime generation
- Remove debug output and set NCCL_DEBUG=WARN

* updated README.md to have P6 support

* Fix table of contents links in README

* fix: correct VPC template filename reference in README

* fix link

* delete backup file

* fix: address security scan findings

- Remove ECR repository (using public ECR image)
- Add KMS encryption with key rotation for Secrets Manager
- Convert inline IAM policies to managed policies
- Remove explicit resource names for auto-generation
- Enforce IMDSv2 on Launch Template
- Add suppression for SSH key rotation (not applicable)

* feat: update NCCL tests image to specific version for better P6 performance

Use public.ecr.aws/hpc-cloud/nccl-tests:cuda12.8.1-efa1.42.0-ofiv1.16.0-ncclv2.27.5-1-testsv2.16.4
- CUDA 12.8.1
- EFA 1.42.0
- OFI (libfabric) 1.16.0
- NCCL 2.27.5
- NCCL tests 2.16.4
Co-authored-by: Anshuman Kumar <anshumnn@amazon.com>
* dynamically set Global Batch Size

removing the static GBS to dynamically set GBS to correspond to num of nodes.

* Configure TP/PP/GBS based on node count

Updated TP/PP based on node count and adjusted global batch size calculation.
* ray dashboard integration improvement

* scrape target disclaimer
* auto-disabled igs and lcs for rig mode for better UX

* simplified logic for disabling s3_bucket module, added conditional outputs for s3_bucket module
* Improved lifecycle script for HP-EKS
- Separate bootstrap script and main script to redirect all stdout/stderr to CloudWatch Logs
- Redirect Kubelet data path in addition to containerd.
- Allow choosing volume for containerd and kubelet.

* Updated message to explain why 60 seconds

* Updating Terraform for HyperPod EKS, to upload multiple lifecycle script files to S3
@newabdosheham
Copy link
Copy Markdown
Author

Hi maintainers,
This PR adds optional support for attaching a SageMaker Training Plan to a specific HyperPod instance group (e.g. compute/workers), with full backward compatibility.

Happy to adjust naming, defaults, or structure based on your feedback. Thanks for reviewing!

@bluecrayon52 bluecrayon52 self-requested a review February 3, 2026 19:51
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert instance_group to list of objects to preserve the user-defined ordering and embed the training_plan_arn parameter into the instance group object definition. See the HyperPod EKS implementation here for reference: https://github.com/aws-samples/awsome-distributed-training/blob/8fae1a909e91a9bc5288e0cb95006573b047f89b/1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/hyperpod-eks-tf/modules/hyperpod_cluster/variables.tf#L42

Doing this should remove the need for the additional environment variables (use_training_plan, training_plan_instance_group_name). Given that the variables training_plan_expected_instance_type and training_plan_expected_instance_count are user defined, they provide no greater protection. This is essentially asking the user to verify their own inputs twice.

Copy link
Copy Markdown
Contributor

@bluecrayon52 bluecrayon52 Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@bluecrayon52 bluecrayon52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments recommending changes. We want to standardize on a list of objects for instance_groups and embed the name and training_plan_arn into the object definition for an instance group.

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Feb 17, 2026

Hi @newabdosheham — this PR was automatically closed by GitHub because the repository history was rewritten as part of a cleanup to reduce clone size (details in #959).

Your fork's branch still references the old (pre-rewrite) history, so it can't be reopened directly. To restore your PR, please rebase your fork onto the new main:

# Update your fork's main branch
cd your-fork
git remote add upstream https://github.com/aws-samples/awsome-distributed-training.git  # if not already set
git fetch upstream
git checkout main
git reset --hard upstream/main
git push --force origin main

# Rebase your feature branch
git checkout feature/hyperpod-training-plan
git rebase main
# Resolve any conflicts if needed
git push --force origin feature/hyperpod-training-plan

# Then open a new PR from your fork

Apologies for the inconvenience! If you run into any issues rebasing, feel free to ask for help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.