Add optional SageMaker Training Plan support for HyperPod compute instance groups#930
Add optional SageMaker Training Plan support for HyperPod compute instance groups#930newabdosheham wants to merge 1366 commits intoawslabs:mainfrom
Conversation
* add os grafana stack Co-authored-by: Matthew Nightingale <nghtm@amazon.com> * remove sg * update * update * add OS grafana README * remove unrelated file --------- Co-authored-by: Matthew Nightingale <nghtm@amazon.com>
Signed-off-by: Nisha <nisha.nadkarni@gmail.com>
This reverts commit 0133d3d.
…ion process (POSIX + IAM) (awslabs#542) * Automating multihead cluster creation process + automating user creation process (POSIX + IAM)
) * Adding Helm Chart Injector and Nested CloudFormation stacks * adding resource prefix to studio stack * adding hyperpod helper script * adding modifications to nested stacks * modified cidr format checking in helper script * updating main template with asset bucket name map * udpated readmes to reflect changes * adding get_yes_no validation to helper script and correcting TemplateURL to use FindInMap * updating default accel instance type * standardized resource naming convention * renamed helper script to allow for standardization * updated route table name * updated regex of resource prefix function to limit length to 28 * remove create log group permission from lambda to avoid recreation after stack delete * improved error handling and status reporting for deploy stack function * added dynamic az id default lookup by region
…script (awslabs#530) * Update the Neuron SDK to 2.21.0 * Update the Llama3-70B pretraining with the Neuron SDK 2.21 * Fix a typo * Add --hw_backend trn1 in the convert_checkpoint command * More update * Update the update_neuron_sdk.sh by removing the neuron-top check * Keep enable_update_neuron_sdk as Flase by default * Update automate-eks-cluster-creation.sh (awslabs#529) Minor bug fix * Update according to the review comments. * minor updates in doc --------- Co-authored-by: Aman Shanbhag <55571601+amanshanbhag@users.noreply.github.com> Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
awslabs#562) * Update Welcome message to clarify which HP cluster is going to be created
…wslabs#561) * Changed docker run command for os observability stack to use IMDSv2 * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml Co-authored-by: Keita Watanabe <mlkeita@amazon.com> * Added MD options to LaunchTemplate * reverting from --network host to -p 3000:3000 * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * Removing systemctl start docker because we are specifying --now in systemctl enable docker * Update 4.validation_and_observability/4.prometheus-grafana/cluster-observability-os-grafana.yaml Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com> Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Signed-off-by: Ankur Srivastava <awsankur@amazon.com>
* updated instance cound environment variables * updated message for IAM execution role creation * added check_jq function * removed old todos * updated order of hyperpod cluster config message * updated hyperpod cluster stack to conditionally disable deep health checks * put S3 endpoint into separate cfn stack * updated helm chart injector to use kube-system namespace * syntax fix in lambda function * enabled pathrough of existing resource ids from tmp_env_vars to env_vars * fixed execution role stack boolean variable and security group stack display * bump k8s version to 1.31
* Add NCCL tuner flag to megatron-lm * Remove torchrun from megatron * Update 3.test_cases/1.megatron-lm/2.distributed-training.sbatch --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
* Adding torchtitan sample showcasing how to pre-train Llama-3 8B and leverage torchtitan features( torch.compile, FP8 linear ops, FP8 Allgather to accelerate pre-training * updating directory structure * updated README and sbatch script * adding separate README for slurm * Update 3.test_cases/21.torchtitan/README.md * Update 3.test_cases/21.torchtitan/slurm/README.md --------- Co-authored-by: Keita Watanabe <keitaw09@gmail.com>
* Update 0.NemoMegatron-aws-optimized.Dockerfile * Update README.md * Update 0.NemoMegatron-aws-optimized.Dockerfile
* Initial commit * add readme * update readme * update readme * update readme * update readme and add new files * re-organize directory * Apply suggestions to Dockerfile from code review Co-authored-by: Keita Watanabe <keitaw09@gmail.com> * Update 3.test_cases/21.nemo-run/slurm/README.md * Update 3.test_cases/21.nemo-run/slurm/README.md * Update 3.test_cases/21.nemo-run/README.md * Delete 3.test_cases/21.nemo-run/slurm/cluster-config-template.json * Update 3.test_cases/21.nemo-run/slurm/README.md * Update README.md * Update venv.sh * Update 3.test_cases/21.nemo-run/README.md * update * Update 3.test_cases/21.nemo-run/slurm/Dockerfile Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * Update 3.test_cases/21.nemo-run/slurm/Dockerfile Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * ignore virtual env * update scripts * Enable the use of NVLink SHARP (NVLS) by default. It will be disabled if it is not supported by the hardware. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-nvls-enable * remove k8s subdirectory for now * remove gres option * move Dockerfile * update Dockerfile * Clean up * Update 3.test_cases/21.nemo-run/slurm/run.py Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com> * increment test case numbering --------- Co-authored-by: Keita Watanabe <keitaw09@gmail.com> Co-authored-by: Keita Watanabe <mlkeita@amazon.com> Co-authored-by: mhuguesaws <71357145+mhuguesaws@users.noreply.github.com>
Co-authored-by: Daisuke Miyamoto <midaisuk@gmail.com>
…r versions (awslabs#887) Signed-off-by: rpovelik <rpovelik@amazon.co.uk>
Signed-off-by: Nathan Na <nzhenye@amazon.com> Thank you for your contribution and customer obsession!
* Add OpenZFS support to SMHP Terraform modules * Add support and validation for different OpenZFS deployment types * Add support and validation for different OpenZFS deployment types
…awslabs#881) * Custom aws-ofi-nccl support * Cleanup LD_LIBRARY_PATH and PATH in favor of /etc/ld.so.conf.d
…#899) * update openzfs mounting logic * adding forceful creation of symlink for .ssh * Added chown to the user for .ssh * updating information for cluster user config based on openzfs present * updated to include users other than ubuntu * fixed relative path for shared_users.txt * Adding xargs to strip carriage returns * fixed the ownership of symlink .ssh dir * Updated to include user flag for login * Fixing race condition during file access testing for fsx lustre and openzfs
* RLVR Recipe in added post-training section * fix env_vars * rm verl submodule, and other revisions * rm post-training, cleaned env_vars, download grafana dashboards * Delete .gitmodules * rlvr revisions * rlvr readme quick fix --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
* updates to hyperpod cluster and helm chart modules for RIG support * updates to IAM exec role for RIG * added booleans for autoscaling, automatic node recovery, and continuous provisioning * changes made from testing RIG deployment * adding karpenter role and policy for autoscaling * converged private subnet routing and added override_vpc_config for RIG * added SQS and Lambda VPC endpoints for RFT with RIG * updated readme for RIG and scoped down Lambda/SQS permissions
…rt (awslabs#893) * P6-b200: use Secrets Manager for SSH keys, remove NCCL cmd from bootstrap, update CFN to include secret and ECR * Simplify AWS Batch P6 deployment with inline setup script - Remove jq dependency and JSON parsing - Auto-generate EC2 SSH key pair during CloudFormation deployment - Store private key in Secrets Manager automatically - Replace custom Dockerfile and bootstrap.sh with inline command in Job Definition - Use base nccl-tests image directly from public ECR - All setup logic now in single CloudFormation template - Remove intermediate variables, use env vars directly Author: yusongw@ * removed CHANGES.md * Simplify AWS Batch P6 setup: remove jq dependency, inline container setup, manual SSH key generation * Auto-create resource group in P6 template, simplify deployment to 3 steps * Fix P6 deployment: use capacity reservation ID directly, add AL2023 ECS image, fix IMDSv2 and PATH issues * Add SSH key parameter for deployment, start sshd, fix main node self-registration and worker IP passing * Fix MNP networking: use container IP and exclude bridge interfaces - Use hostname -i for container IP in awsvpc mode - Set NCCL_SOCKET_IFNAME=^lo,docker,ecs to exclude bridge interfaces - Add BatchJobRole with ecs-tasks trust for container credentials - Simplify SSH key generation with runtime generation - Remove debug output and set NCCL_DEBUG=WARN * updated README.md to have P6 support * Fix table of contents links in README * fix: correct VPC template filename reference in README * fix link * delete backup file * fix: address security scan findings - Remove ECR repository (using public ECR image) - Add KMS encryption with key rotation for Secrets Manager - Convert inline IAM policies to managed policies - Remove explicit resource names for auto-generation - Enforce IMDSv2 on Launch Template - Add suppression for SSH key rotation (not applicable) * feat: update NCCL tests image to specific version for better P6 performance Use public.ecr.aws/hpc-cloud/nccl-tests:cuda12.8.1-efa1.42.0-ofiv1.16.0-ncclv2.27.5-1-testsv2.16.4 - CUDA 12.8.1 - EFA 1.42.0 - OFI (libfabric) 1.16.0 - NCCL 2.27.5 - NCCL tests 2.16.4
Co-authored-by: Anshuman Kumar <anshumnn@amazon.com>
* dynamically set Global Batch Size removing the static GBS to dynamically set GBS to correspond to num of nodes. * Configure TP/PP/GBS based on node count Updated TP/PP based on node count and adjusted global batch size calculation.
* ray dashboard integration improvement * scrape target disclaimer
… cleanup (awslabs#881)" (awslabs#910) This reverts commit 8d9c95e.
* auto-disabled igs and lcs for rig mode for better UX * simplified logic for disabling s3_bucket module, added conditional outputs for s3_bucket module
* Improved lifecycle script for HP-EKS - Separate bootstrap script and main script to redirect all stdout/stderr to CloudWatch Logs - Redirect Kubelet data path in addition to containerd. - Allow choosing volume for containerd and kubelet. * Updated message to explain why 60 seconds * Updating Terraform for HyperPod EKS, to upload multiple lifecycle script files to S3
|
Hi maintainers, Happy to adjust naming, defaults, or structure based on your feedback. Thanks for reviewing! |
There was a problem hiding this comment.
Convert instance_group to list of objects to preserve the user-defined ordering and embed the training_plan_arn parameter into the instance group object definition. See the HyperPod EKS implementation here for reference: https://github.com/aws-samples/awsome-distributed-training/blob/8fae1a909e91a9bc5288e0cb95006573b047f89b/1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/hyperpod-eks-tf/modules/hyperpod_cluster/variables.tf#L42
Doing this should remove the need for the additional environment variables (use_training_plan, training_plan_instance_group_name). Given that the variables training_plan_expected_instance_type and training_plan_expected_instance_count are user defined, they provide no greater protection. This is essentially asking the user to verify their own inputs twice.
There was a problem hiding this comment.
Move instance_group_name value into the object definition (config.name) of the instance group, see HyperPod EKS implementation here for reference: https://github.com/aws-samples/awsome-distributed-training/blob/8fae1a909e91a9bc5288e0cb95006573b047f89b/1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/hyperpod-eks-tf/modules/hyperpod_cluster/main.tf#L13
Make training_plane_arn conditional based on config.training_plan_arn, see HyperPod EKS implementation here for reference:
https://github.com/aws-samples/awsome-distributed-training/blob/8fae1a909e91a9bc5288e0cb95006573b047f89b/1.architectures/7.sagemaker-hyperpod-eks/terraform-modules/hyperpod-eks-tf/modules/hyperpod_cluster/main.tf#L47
bluecrayon52
left a comment
There was a problem hiding this comment.
Left comments recommending changes. We want to standardize on a list of objects for instance_groups and embed the name and training_plan_arn into the object definition for an instance group.
|
Hi @newabdosheham — this PR was automatically closed by GitHub because the repository history was rewritten as part of a cleanup to reduce clone size (details in #959). Your fork's branch still references the old (pre-rewrite) history, so it can't be reopened directly. To restore your PR, please rebase your fork onto the new # Update your fork's main branch
cd your-fork
git remote add upstream https://github.com/aws-samples/awsome-distributed-training.git # if not already set
git fetch upstream
git checkout main
git reset --hard upstream/main
git push --force origin main
# Rebase your feature branch
git checkout feature/hyperpod-training-plan
git rebase main
# Resolve any conflicts if needed
git push --force origin feature/hyperpod-training-plan
# Then open a new PR from your forkApologies for the inconvenience! If you run into any issues rebasing, feel free to ask for help. |
This PR adds optional support for attaching a SageMaker Training Plan to a specific HyperPod instance group (e.g. compute/workers) in the
hyperpod-slurm-tfTerraform module.The change enables users to run HyperPod clusters under a Training Plan while preserving full backward compatibility for existing workflows.
What’s new
Introduces optional variables to enable Training Plan usage:
use_training_plantraining_plan_arntraining_plan_instance_group_nameAttaches
training_plan_arnonly to the configured instance group (default:compute)Supports custom instance group names (e.g.
workers,compute-nodes)Adds optional safety validation for:
instance type
instance count
via:
training_plan_expected_instance_typetraining_plan_expected_instance_countBehavior
Validation is implemented using Terraform
preconditionto avoid runtime cluster failures.Backward compatibility
The feature is fully optional and disabled by default
Existing configurations continue to work without modification
No behavior changes unless
use_training_plan = trueis explicitly setImplementation details
Training Plan is injected using a conditional
merge()ininstance_groupsTarget group is resolved dynamically from
var.instance_groupsmap keyValidation is implemented using Terraform 1.2+
lifecycle.preconditionDefault target group name is
compute(matching existing examples)Example usage
Testing
terraform validateterraform planVerified:
Training Plan attached only to target group
Validation triggers correctly on mismatch
No behavior change when feature is disabled
Why this matters
SageMaker Training Plans are increasingly used for:
capacity reservation
cost optimization
scheduling control
This change allows users to adopt Training Plans without forking the module and keeps the official AWS sample aligned with current SageMaker capabilities.