Skip to content

Terraform Mods for RIG support#902

Merged
nghtm merged 8 commits intomainfrom
rig-tf-updates
Nov 21, 2025
Merged

Terraform Mods for RIG support#902
nghtm merged 8 commits intomainfrom
rig-tf-updates

Conversation

@bluecrayon52
Copy link
Copy Markdown
Contributor

Issue #, if available:

Description of changes:
Added conditional boolean variables to intelligently modify helm chart instillations, modify coredns and vpc cni plugins, extend execution role permissions, and add lambda and sqs vpc endpoints based on restricted instance group configuration.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Copy Markdown
Contributor

@nghtm nghtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went to try to deploy with updates, but not clear what custom.tfvars values need to be set for RIG deployment. Can we update the readme with an example to test?

Looking at:
https://registry.terraform.io/providers/hashicorp/awscc/latest/docs/resources/sagemaker_cluster#nestedatt--instance_groups:~:text=restricted_instance_groups%20(Attributes%20List)%20The%20restricted%20instance%20groups%20of%20the%20SageMaker%20HyperPod%20cluster.%20(see%20below%20for%20nested%20schema)

I believe it should be:

cat > custom.tfvars << EOL 
kubernetes_version = "1.32"
eks_cluster_name = "my-eks-cluster"
hyperpod_cluster_name = "my-hp-rig-cluster"
resource_name_prefix = "hp-eks-rig-test"
aws_region = "us-east-1"
availability_zone_id  = "use1-az6"
restricted_instance_groups = {
    accelerated-instance-group-1 = {
        instance_type = "ml.g5.12xlarge",
        instance_count = 1,
        ebs_volume_size_in_gb = 100,
        threads_per_core = 2,
        enable_stress_check = false,
        enable_connectivity_check = false,
        lifecycle_script = "on_create.sh"
        fsxl_per_unit_storage_throughput = 250
        fsxl_size_in_gi_b = 2400
    }
}
EOL

@bluecrayon52 bluecrayon52 requested a review from nghtm November 21, 2025 16:55
@bluecrayon52
Copy link
Copy Markdown
Contributor Author

@nghtm I've updated the README.md with a detailed RIG section. There is an example rig_custom.tfvars file that can be used as a reference.

@nghtm nghtm merged commit 8d33454 into main Nov 21, 2025
4 checks passed
@nghtm nghtm deleted the rig-tf-updates branch November 21, 2025 20:23
KeitaW pushed a commit that referenced this pull request Feb 17, 2026
* updates to hyperpod cluster and helm chart modules for RIG support

* updates to IAM exec role for RIG

* added booleans for autoscaling, automatic node recovery, and continuous provisioning

* changes made from testing RIG deployment

* adding karpenter role and policy for autoscaling

* converged private subnet routing and added override_vpc_config for RIG

* added SQS and Lambda VPC endpoints for RFT with RIG

* updated readme for RIG and scoped down Lambda/SQS permissions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants