Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS by dmvevents · Pull Request #1010 · awslabs/awsome-distributed-training

dmvevents · 2026-03-09T20:30:04Z

Summary

Add a new test case for NVIDIA NeMo RL GRPO (Group Relative Policy Optimization) training on Amazon EKS with fault tolerance via NVIDIA Resiliency Extension (NVRx)
Includes a multi-arch Dockerfile (g5 A10G, g6e L40S, p5 H100), Kubernetes RayJob manifests, training scripts, and a workshop walkthrough
Demonstrates resilient LLM fine-tuning with automatic checkpoint recovery on GPU failure

Details

This example shows how to run reinforcement learning-based LLM alignment using NeMo RL's GRPO algorithm on EKS GPU clusters with:

KubeRay for declarative Ray cluster management
NVIDIA NVRx (ft_launcher, heartbeat monitoring, straggler detection) for process-level fault tolerance
Amazon FSx for Lustre for shared checkpoint storage and model caching
EFA networking via aws-ofi-nccl for low-latency GPU communication
LoRA for parameter-efficient training on smaller GPUs (A10G 24GB)

What's included

File	Purpose
`README.md`	Overview, prerequisites, step-by-step instructions
`Dockerfile`	Multi-arch container image (g5/g6e/p5)
`kubernetes/rayjob.yaml`	RayJob manifest for EKS deployment
`kubernetes/dataset-download-job.yaml`	Pre-cache dataset and model to FSx
`scripts/rayjob_entrypoint.sh`	Training entrypoint
`scripts/run_grpo_nvrx.py`	NVRx wrapper (heartbeat + health check)
`scripts/evaluate_before_after.py`	Before/after training evaluation
`patches/patch_nvrx_features.py`	Runtime NVRx feature patches

Tested on

2x g5.8xlarge (1x A10G each) with EKS 1.31, KubeRay 1.3.0
Qwen2.5-1.5B-Instruct, 20 GRPO steps, ~15 min end-to-end
Fault injection (kill -9) with automatic recovery from FSx checkpoint

Test plan

Deploy on fresh EKS cluster with 2x g5.8xlarge
Run dataset download job to completion
Deploy RayJob and verify 20/20 training steps complete
Inject fault at step 6+ and verify checkpoint recovery
Run evaluation script and verify improvement (3/6 -> 6/6)
Build Docker image from Dockerfile on CPU-only instance

New test case for NVIDIA NeMo RL GRPO training on Amazon EKS with: - NVIDIA Resiliency Extension (NVRx) for fault tolerance - Multi-arch Dockerfile (g5 A10G, g6e L40S, p5 H100) - KubeRay RayJob manifests with FSx Lustre shared storage - Training scripts with heartbeat monitoring and straggler detection - Before/after evaluation demonstrating math reasoning improvement Tested on 2x g5.8xlarge with Qwen2.5-1.5B-Instruct, 20 GRPO steps.

KeitaW

Review Batch 1/3 — Structure & Repository Hygiene

This PR adds a comprehensive new test case for NVIDIA NeMo RL GRPO training with fault tolerance (NVRx) on Amazon EKS. The contribution is well-structured and clearly written, though there are several repo-convention issues that I'd suggest addressing before merging.

Missing copyright and license headers on all files

None of the 8 files in this PR include the required copyright header:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

I'd suggest adding it to the top of each file — for the Dockerfile, before the first comment block; for Python files, before the docstring; for YAML files, as the first line(s); for the shell script, after the shebang line.

Reference: CLAUDE.md conventions

KeitaW · 2026-03-10T02:20:28Z

3.test_cases/pytorch/nemo-rl/Dockerfile

+# Build (multi-arch, ~40 min on c5.4xlarge):
+#   docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop .


Dockerfile header references wrong file path

This looks like a leftover from when the file was at a different path in the upstream NeMo RL repo. The actual file is 3.test_cases/pytorch/nemo-rl/Dockerfile, not docker/Dockerfile.workshop.

I'd suggest updating these comments to match the actual location in this repo:

Suggested change

# Build (multi-arch, ~40 min on c5.4xlarge):

# docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop .

# Build (multi-arch, ~40 min on c5.4xlarge):

# docker build -t nemo-rl-workshop .

KeitaW · 2026-03-10T02:20:28Z

3.test_cases/pytorch/nemo-rl/Dockerfile

+RUN cd /tmp && \
+    curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
+        | tar xz && \


EFA installer uses unpinned latest URL

The EFA installer is fetched via aws-efa-installer-latest.tar.gz, which resolves to whatever the current version is at build time. This violates the repo convention that all external dependencies must be pinned to a specific version.

I'd suggest pinning to a specific version, consistent with how OFI_NCCL_VERSION is already pinned:

Suggested change

RUN cd /tmp && \

curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \

| tar xz && \

RUN cd /tmp && \

curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-1.38.0.tar.gz \

| tar xz && \

Alternatively, add an ARG EFA_INSTALLER_VERSION=1.38.0 alongside the other version ARGs and use it here.

Reference: CI version check enforces EFA >= 1.47.0.

KeitaW

Review Batch 2/3 — Deployment Pipeline & Infrastructure

KeitaW · 2026-03-10T02:20:46Z

3.test_cases/pytorch/nemo-rl/scripts/rayjob_entrypoint.sh

@@ -0,0 +1,89 @@
+#!/bin/bash
+set -e


Shell script uses set -e instead of set -euo pipefail

The repo convention requires shell scripts to start with set -euo pipefail (or set -ex at minimum). Using only set -e means unset variable references won't be caught (-u) and failures in piped commands won't propagate (-o pipefail).

Suggested change

set -e

set -exuo pipefail

KeitaW · 2026-03-10T02:20:46Z

3.test_cases/pytorch/nemo-rl/kubernetes/rayjob.yaml

+#   3. NVRx scripts on FSx: /shared/nvrx-demo/{patches,scripts}/
+#   4. NVIDIA + EFA device plugins on g5 nodes
+#
+# Deploy:   kubectl apply -f g5-rayjob-qwen-nvrx.yaml


rayjob.yaml comment references wrong filename

The deploy comment references g5-rayjob-qwen-nvrx.yaml but the actual filename is rayjob.yaml.

Suggested change

# Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml

# Deploy: kubectl apply -f kubernetes/rayjob.yaml

KeitaW · 2026-03-10T02:20:46Z

3.test_cases/pytorch/nemo-rl/kubernetes/rayjob.yaml

+                - name: NCCL_SOCKET_IFNAME
+                  value: "ens5"


NCCL_SOCKET_IFNAME uses positive interface selection

The repo convention requires exclusion-based patterns (^lo for K8s, ^docker,lo,veth for Dockerfiles), not positive selection like ens5. While the comment explains the g5 dual-ENI rationale, positive selection breaks portability on other instance types.

I noticed the Dockerfile correctly uses ^lo,docker,veth,eni (line 218). I'd suggest using an exclusion pattern here too, since hostNetwork: true means container interfaces match the host:

Suggested change

- name: NCCL_SOCKET_IFNAME

value: "ens5"

- name: NCCL_SOCKET_IFNAME

value: "^lo,docker,veth"

Reference: EFA cheatsheet.

KeitaW

Review Batch 3/3 — Things That Look Great

Excellent README structure: The walkthrough is thorough, with clear architecture diagrams, a troubleshooting table, and step-by-step instructions. The ASCII architecture and resiliency stack diagrams are genuinely helpful.
Multi-arch Dockerfile is well-engineered: The multi-stage build with clear separation (clone → deps → python → release → EFA) is clean. The CUDA arch handling (SM 9.0-only for deep_ep/deep_gemm) shows deep understanding of the compilation requirements.
YAML anchor reuse: The &common-env, &common-mounts, and &common-volumes anchors in rayjob.yaml eliminate duplication between head and worker specs — exactly the right pattern.
Defensive patching code: patch_nvrx_features.py is careful about idempotency (checks if already patched), provides clear logging, and handles missing files gracefully.
Workshop-optimized design: The CLEAR_CHECKPOINTS env var, pre-cache job, and evaluation script show this was designed for live demos with real operator ergonomics in mind.
Conservative backoffLimit: backoffLimit: 2 in both manifests follows the repo convention for GPU jobs.
HF_HOME over TRANSFORMERS_CACHE: Correctly uses the non-deprecated env var throughout.
EFA configuration in Dockerfile: Proper --skip-kmod --skip-limit-conf --no-verify flags and exclusion-based NCCL_SOCKET_IFNAME pattern.

KeitaW

thanks for the PR. left few comments

KeitaW · 2026-03-17T06:23:10Z

any update?

KeitaW · 2026-03-18T07:39:00Z

Test case should live under `3.test_cases/megatron/`, not `pytorch/`

NeMo RL is built on top of NVIDIA's Megatron-based stack — the training loop uses Megatron-Core for model parallelism and DTensor for distributed policy optimization. The pytorch/ top-level directory is reserved for frameworks that use PyTorch directly (FSDP, DDP, TorchTitan, etc.), while Megatron-derived frameworks belong under 3.test_cases/megatron/.

I'd suggest moving the entire test case to 3.test_cases/megatron/nemo-rl/. This keeps the framework taxonomy consistent and helps users find NeMo RL alongside related Megatron-based training examples.

(Same feedback on PR #1025.)

paragao

Cross-referencing with PR #1025 (same author, same NeMo RL GRPO pattern for P5en/H200). I tested #1025 hands-on and many of the same patterns appear here. These comments focus on issues not already covered by @KeitaW's review.

I agree with all of @KeitaW's existing comments (unpinned EFA installer, set -euo pipefail, YAML filename reference, NCCL_SOCKET_IFNAME, Dockerfile path).

paragao · 2026-03-19T15:11:29Z

3.test_cases/pytorch/nemo-rl/scripts/rayjob_entrypoint.sh

+echo "Date: $(date -u +%Y-%m-%dT%H:%M:%SZ)"
+
+# Fix hostname for Gloo
+MY_IP=$(hostname -I | awk '{print $1}')


/etc/hosts write fails without privileged containers

Same issue as PR #1025. This writes to /etc/hosts which requires either privileged: true or SYS_ADMIN capability. The rayjob.yaml only grants IPC_LOCK and SYS_RESOURCE — this line will fail with Permission denied.

With hostNetwork: true already set in the YAML, the container shares the host's /etc/hosts (which already maps the hostname), making this write redundant in the K8s case anyway.

If this is needed for Gloo backend hostname resolution, use Ray's built-in IP configuration instead:

Set RAY_ADDRESS environment variable, or

Pass --node-ip-address=$MY_IP to ray start

This avoids requiring privileged containers just for hostname resolution.

paragao · 2026-03-19T15:11:29Z

3.test_cases/pytorch/nemo-rl/patches/patch_nvrx_features.py

+    if init_anchor not in content:
+        print(f"[NVRx patch] FAIL: no __init__ anchor found in {target}")
+        return False
+


Missing nullcontext import in injected code

The straggler detection patch injects code that uses nullcontext() (the string template around line 159):

_nvrx_ctx = ( _NVRxStragglerDetector.detection_section("forward_backward") if self._nvrx_straggler_enabled else nullcontext() )

But nullcontext is never imported in the target file (dtensor_policy_worker_v2.py). This will cause a NameError at runtime when _nvrx_straggler_enabled is False.

Fix: add from contextlib import nullcontext to the import block injected at line ~53 of this file.

paragao · 2026-03-19T15:11:29Z

3.test_cases/pytorch/nemo-rl/kubernetes/rayjob.yaml

+          containers:
+            - name: ray-head
+              image: public.ecr.aws/v9l4g5s4/nemo-rl-full:main-efa-20260218
+              imagePullPolicy: Always


Container image doesn't match the Dockerfile in this PR

The YAML references public.ecr.aws/v9l4g5s4/nemo-rl-full:main-efa-20260218 but the Dockerfile in this PR builds nemo-rl-workshop. The README (Section 0) instructs users to build nemo-rl-workshop:latest.

Users following the README will build one image but the manifests deploy a different one. Either:

Use a placeholder like <your-account>.dkr.ecr.<region>.amazonaws.com/nemo-rl-workshop:latest with instructions to replace, or

Document that the pre-built image at this ECR path is the canonical one and skip the self-build step

paragao · 2026-03-19T15:11:29Z

3.test_cases/pytorch/nemo-rl/kubernetes/rayjob.yaml

+                  value: "ens5"
+                - name: NCCL_DEBUG
+                  value: "WARN"
+                - name: NCCL_TIMEOUT


NCCL_TIMEOUT=7200 (2 hours) seems too long for a fault-tolerance demo

For a fault-tolerance demo, a 2-hour NCCL timeout effectively disables NCCL-level failure detection. When a GPU fails or a rank dies, NCCL operations on surviving ranks will block for up to 2 hours before erroring out. Combined with TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=7200, this means:

NVRx heartbeat detects failure in ~30s ✓

But NCCL collective operations on surviving ranks still block for up to 2h ✗

Consider a shorter timeout (e.g., 300-600s) that lets NVRx handle recovery without NCCL blocking for hours. If these values were set to work around EFA Gen1 transient issues on g5, that's valid but should be documented.

paragao · 2026-03-19T15:11:29Z

3.test_cases/pytorch/nemo-rl/patches/patch_nvrx_features.py

@@ -0,0 +1,272 @@
+#!/usr/bin/env python3


Runtime source-file patching is fragile (design concern)

This script modifies installed NeMo RL source files at startup using exact string matching (e.g., finding peft_config: NotRequired[Any] and self.cp_size = distributed_manager.cp_size). This approach:

Breaks silently if NeMo RL updates the source (different whitespace, renamed variables, reordered code)

Requires deleting all __pycache__ to force recompilation

Makes debugging harder since the running code doesn't match the installed source

A more robust approach would be to subclass DTensorPolicyWorkerV2 and override train(), or use Python's unittest.mock.patch / functools.wraps to wrap methods at the API level. For CheckpointingConfig, consider proposing the is_async field upstream to NeMo RL.

Not blocking — the approach works as a workshop demo — but worth noting for maintainability.

paragao · 2026-03-19T15:11:29Z

3.test_cases/pytorch/nemo-rl/kubernetes/dataset-download-job.yaml

+      containers:
+        - name: downloader
+          image: public.ecr.aws/v9l4g5s4/nemo-rl-full:main-efa-20260218
+          imagePullPolicy: Always


Same image mismatch as rayjob.yaml

Uses nemo-rl-full:main-efa-20260218 — same issue as the rayjob. Should reference the image built from the Dockerfile in this PR, or document the pre-built image as the canonical one.

paragao · 2026-03-19T15:11:29Z

3.test_cases/pytorch/nemo-rl/Dockerfile

@@ -0,0 +1,222 @@
+# NeMo RL Workshop — Multi-arch (g5/g6e/p5) + EFA


Missing license headers on all 8 files

None of the files in this PR include the Apache 2.0 license header required by the repository. This applies to all 8 new files:

Dockerfile

README.md

kubernetes/rayjob.yaml

kubernetes/dataset-download-job.yaml

scripts/rayjob_entrypoint.sh

scripts/run_grpo_nvrx.py

scripts/evaluate_before_after.py

patches/patch_nvrx_features.py

dmvevents · 2026-04-03T12:17:37Z

Thanks @KeitaW — agreed. NeMo RL is built on Megatron-Core for model parallelism and DTensor for distributed policy optimization, so it belongs under 3.test_cases/megatron/ rather than pytorch/.

I'll restructure the test case to 3.test_cases/megatron/nemo-rl/ and update all internal paths. Will push the updated structure shortly.

@paragao — will address your feedback in the same update.

KeitaW reviewed Mar 10, 2026

View reviewed changes

KeitaW requested changes Mar 10, 2026

View reviewed changes

paragao reviewed Mar 19, 2026

View reviewed changes

		# Build (multi-arch, ~40 min on c5.4xlarge):
		# docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop .

	# Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml
	# Deploy: kubectl apply -f kubernetes/rayjob.yaml

		@@ -0,0 +1,222 @@
		# NeMo RL Workshop — Multi-arch (g5/g6e/p5) + EFA

Conversation

dmvevents commented Mar 9, 2026

Summary

Details

What's included

Tested on

Test plan

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/3 — Structure & Repository Hygiene

Missing copyright and license headers on all files

Uh oh!

Choose a reason for hiding this comment

Dockerfile header references wrong file path

Uh oh!

Choose a reason for hiding this comment

EFA installer uses unpinned latest URL

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 2/3 — Deployment Pipeline & Infrastructure

Uh oh!

Choose a reason for hiding this comment

Shell script uses set -e instead of set -euo pipefail

Uh oh!

Choose a reason for hiding this comment

rayjob.yaml comment references wrong filename

Uh oh!

Choose a reason for hiding this comment

NCCL_SOCKET_IFNAME uses positive interface selection

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/3 — Things That Look Great

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

KeitaW commented Mar 17, 2026

Uh oh!

KeitaW commented Mar 18, 2026

Test case should live under 3.test_cases/megatron/, not pytorch/

Uh oh!

paragao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

/etc/hosts write fails without privileged containers

Uh oh!

Choose a reason for hiding this comment

Missing nullcontext import in injected code

Uh oh!

Choose a reason for hiding this comment

Container image doesn't match the Dockerfile in this PR

Uh oh!

Choose a reason for hiding this comment

NCCL_TIMEOUT=7200 (2 hours) seems too long for a fault-tolerance demo

Uh oh!

Choose a reason for hiding this comment

Runtime source-file patching is fragile (design concern)

Uh oh!

Choose a reason for hiding this comment

Same image mismatch as rayjob.yaml

Uh oh!

Choose a reason for hiding this comment

Missing license headers on all 8 files

Uh oh!

dmvevents commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EFA installer uses unpinned `latest` URL

Shell script uses `set -e` instead of `set -euo pipefail`

Test case should live under `3.test_cases/megatron/`, not `pytorch/`

`/etc/hosts` write fails without privileged containers

Missing `nullcontext` import in injected code

`NCCL_TIMEOUT=7200` (2 hours) seems too long for a fault-tolerance demo