Skip to content

Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS#1010

Open
dmvevents wants to merge 1 commit intoawslabs:mainfrom
dmvevents:add-nemo-rl-grpo-example
Open

Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS#1010
dmvevents wants to merge 1 commit intoawslabs:mainfrom
dmvevents:add-nemo-rl-grpo-example

Conversation

@dmvevents
Copy link
Copy Markdown

Summary

  • Add a new test case for NVIDIA NeMo RL GRPO (Group Relative Policy Optimization) training on Amazon EKS with fault tolerance via NVIDIA Resiliency Extension (NVRx)
  • Includes a multi-arch Dockerfile (g5 A10G, g6e L40S, p5 H100), Kubernetes RayJob manifests, training scripts, and a workshop walkthrough
  • Demonstrates resilient LLM fine-tuning with automatic checkpoint recovery on GPU failure

Details

This example shows how to run reinforcement learning-based LLM alignment using NeMo RL's GRPO algorithm on EKS GPU clusters with:

  • KubeRay for declarative Ray cluster management
  • NVIDIA NVRx (ft_launcher, heartbeat monitoring, straggler detection) for process-level fault tolerance
  • Amazon FSx for Lustre for shared checkpoint storage and model caching
  • EFA networking via aws-ofi-nccl for low-latency GPU communication
  • LoRA for parameter-efficient training on smaller GPUs (A10G 24GB)

What's included

File Purpose
README.md Overview, prerequisites, step-by-step instructions
Dockerfile Multi-arch container image (g5/g6e/p5)
kubernetes/rayjob.yaml RayJob manifest for EKS deployment
kubernetes/dataset-download-job.yaml Pre-cache dataset and model to FSx
scripts/rayjob_entrypoint.sh Training entrypoint
scripts/run_grpo_nvrx.py NVRx wrapper (heartbeat + health check)
scripts/evaluate_before_after.py Before/after training evaluation
patches/patch_nvrx_features.py Runtime NVRx feature patches

Tested on

  • 2x g5.8xlarge (1x A10G each) with EKS 1.31, KubeRay 1.3.0
  • Qwen2.5-1.5B-Instruct, 20 GRPO steps, ~15 min end-to-end
  • Fault injection (kill -9) with automatic recovery from FSx checkpoint

Test plan

  • Deploy on fresh EKS cluster with 2x g5.8xlarge
  • Run dataset download job to completion
  • Deploy RayJob and verify 20/20 training steps complete
  • Inject fault at step 6+ and verify checkpoint recovery
  • Run evaluation script and verify improvement (3/6 -> 6/6)
  • Build Docker image from Dockerfile on CPU-only instance

New test case for NVIDIA NeMo RL GRPO training on Amazon EKS with:
- NVIDIA Resiliency Extension (NVRx) for fault tolerance
- Multi-arch Dockerfile (g5 A10G, g6e L40S, p5 H100)
- KubeRay RayJob manifests with FSx Lustre shared storage
- Training scripts with heartbeat monitoring and straggler detection
- Before/after evaluation demonstrating math reasoning improvement

Tested on 2x g5.8xlarge with Qwen2.5-1.5B-Instruct, 20 GRPO steps.
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/3 — Structure & Repository Hygiene

This PR adds a comprehensive new test case for NVIDIA NeMo RL GRPO training with fault tolerance (NVRx) on Amazon EKS. The contribution is well-structured and clearly written, though there are several repo-convention issues that I'd suggest addressing before merging.

Missing copyright and license headers on all files

None of the 8 files in this PR include the required copyright header:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

I'd suggest adding it to the top of each file — for the Dockerfile, before the first comment block; for Python files, before the docstring; for YAML files, as the first line(s); for the shell script, after the shebang line.

Reference: CLAUDE.md conventions

Comment on lines +22 to +23
# Build (multi-arch, ~40 min on c5.4xlarge):
# docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop .
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dockerfile header references wrong file path

This looks like a leftover from when the file was at a different path in the upstream NeMo RL repo. The actual file is 3.test_cases/pytorch/nemo-rl/Dockerfile, not docker/Dockerfile.workshop.

I'd suggest updating these comments to match the actual location in this repo:

Suggested change
# Build (multi-arch, ~40 min on c5.4xlarge):
# docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop .
# Build (multi-arch, ~40 min on c5.4xlarge):
# docker build -t nemo-rl-workshop .

Comment on lines +182 to +184
RUN cd /tmp && \
curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
| tar xz && \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EFA installer uses unpinned latest URL

The EFA installer is fetched via aws-efa-installer-latest.tar.gz, which resolves to whatever the current version is at build time. This violates the repo convention that all external dependencies must be pinned to a specific version.

I'd suggest pinning to a specific version, consistent with how OFI_NCCL_VERSION is already pinned:

Suggested change
RUN cd /tmp && \
curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
| tar xz && \
RUN cd /tmp && \
curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-1.38.0.tar.gz \
| tar xz && \

Alternatively, add an ARG EFA_INSTALLER_VERSION=1.38.0 alongside the other version ARGs and use it here.

Reference: CI version check enforces EFA >= 1.47.0.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/3 — Deployment Pipeline & Infrastructure

@@ -0,0 +1,89 @@
#!/bin/bash
set -e
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shell script uses set -e instead of set -euo pipefail

The repo convention requires shell scripts to start with set -euo pipefail (or set -ex at minimum). Using only set -e means unset variable references won't be caught (-u) and failures in piped commands won't propagate (-o pipefail).

Suggested change
set -e
set -exuo pipefail

# 3. NVRx scripts on FSx: /shared/nvrx-demo/{patches,scripts}/
# 4. NVIDIA + EFA device plugins on g5 nodes
#
# Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rayjob.yaml comment references wrong filename

The deploy comment references g5-rayjob-qwen-nvrx.yaml but the actual filename is rayjob.yaml.

Suggested change
# Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml
# Deploy: kubectl apply -f kubernetes/rayjob.yaml

Comment on lines +85 to +86
- name: NCCL_SOCKET_IFNAME
value: "ens5"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL_SOCKET_IFNAME uses positive interface selection

The repo convention requires exclusion-based patterns (^lo for K8s, ^docker,lo,veth for Dockerfiles), not positive selection like ens5. While the comment explains the g5 dual-ENI rationale, positive selection breaks portability on other instance types.

I noticed the Dockerfile correctly uses ^lo,docker,veth,eni (line 218). I'd suggest using an exclusion pattern here too, since hostNetwork: true means container interfaces match the host:

Suggested change
- name: NCCL_SOCKET_IFNAME
value: "ens5"
- name: NCCL_SOCKET_IFNAME
value: "^lo,docker,veth"

Reference: EFA cheatsheet.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/3 — Things That Look Great

  • Excellent README structure: The walkthrough is thorough, with clear architecture diagrams, a troubleshooting table, and step-by-step instructions. The ASCII architecture and resiliency stack diagrams are genuinely helpful.
  • Multi-arch Dockerfile is well-engineered: The multi-stage build with clear separation (clone → deps → python → release → EFA) is clean. The CUDA arch handling (SM 9.0-only for deep_ep/deep_gemm) shows deep understanding of the compilation requirements.
  • YAML anchor reuse: The &common-env, &common-mounts, and &common-volumes anchors in rayjob.yaml eliminate duplication between head and worker specs — exactly the right pattern.
  • Defensive patching code: patch_nvrx_features.py is careful about idempotency (checks if already patched), provides clear logging, and handles missing files gracefully.
  • Workshop-optimized design: The CLEAR_CHECKPOINTS env var, pre-cache job, and evaluation script show this was designed for live demos with real operator ergonomics in mind.
  • Conservative backoffLimit: backoffLimit: 2 in both manifests follows the repo convention for GPU jobs.
  • HF_HOME over TRANSFORMERS_CACHE: Correctly uses the non-deprecated env var throughout.
  • EFA configuration in Dockerfile: Proper --skip-kmod --skip-limit-conf --no-verify flags and exclusion-based NCCL_SOCKET_IFNAME pattern.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR. left few comments

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 17, 2026

any update?

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 18, 2026

Test case should live under 3.test_cases/megatron/, not pytorch/

NeMo RL is built on top of NVIDIA's Megatron-based stack — the training loop uses Megatron-Core for model parallelism and DTensor for distributed policy optimization. The pytorch/ top-level directory is reserved for frameworks that use PyTorch directly (FSDP, DDP, TorchTitan, etc.), while Megatron-derived frameworks belong under 3.test_cases/megatron/.

I'd suggest moving the entire test case to 3.test_cases/megatron/nemo-rl/. This keeps the framework taxonomy consistent and helps users find NeMo RL alongside related Megatron-based training examples.

(Same feedback on PR #1025.)

Copy link
Copy Markdown
Contributor

@paragao paragao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross-referencing with PR #1025 (same author, same NeMo RL GRPO pattern for P5en/H200). I tested #1025 hands-on and many of the same patterns appear here. These comments focus on issues not already covered by @KeitaW's review.

I agree with all of @KeitaW's existing comments (unpinned EFA installer, set -euo pipefail, YAML filename reference, NCCL_SOCKET_IFNAME, Dockerfile path).

echo "Date: $(date -u +%Y-%m-%dT%H:%M:%SZ)"

# Fix hostname for Gloo
MY_IP=$(hostname -I | awk '{print $1}')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/etc/hosts write fails without privileged containers

Same issue as PR #1025. This writes to /etc/hosts which requires either privileged: true or SYS_ADMIN capability. The rayjob.yaml only grants IPC_LOCK and SYS_RESOURCE — this line will fail with Permission denied.

With hostNetwork: true already set in the YAML, the container shares the host's /etc/hosts (which already maps the hostname), making this write redundant in the K8s case anyway.

If this is needed for Gloo backend hostname resolution, use Ray's built-in IP configuration instead:

  • Set RAY_ADDRESS environment variable, or
  • Pass --node-ip-address=$MY_IP to ray start

This avoids requiring privileged containers just for hostname resolution.

if init_anchor not in content:
print(f"[NVRx patch] FAIL: no __init__ anchor found in {target}")
return False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing nullcontext import in injected code

The straggler detection patch injects code that uses nullcontext() (the string template around line 159):

_nvrx_ctx = (
    _NVRxStragglerDetector.detection_section("forward_backward")
    if self._nvrx_straggler_enabled
    else nullcontext()
)

But nullcontext is never imported in the target file (dtensor_policy_worker_v2.py). This will cause a NameError at runtime when _nvrx_straggler_enabled is False.

Fix: add from contextlib import nullcontext to the import block injected at line ~53 of this file.

containers:
- name: ray-head
image: public.ecr.aws/v9l4g5s4/nemo-rl-full:main-efa-20260218
imagePullPolicy: Always
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container image doesn't match the Dockerfile in this PR

The YAML references public.ecr.aws/v9l4g5s4/nemo-rl-full:main-efa-20260218 but the Dockerfile in this PR builds nemo-rl-workshop. The README (Section 0) instructs users to build nemo-rl-workshop:latest.

Users following the README will build one image but the manifests deploy a different one. Either:

  1. Use a placeholder like <your-account>.dkr.ecr.<region>.amazonaws.com/nemo-rl-workshop:latest with instructions to replace, or
  2. Document that the pre-built image at this ECR path is the canonical one and skip the self-build step

value: "ens5"
- name: NCCL_DEBUG
value: "WARN"
- name: NCCL_TIMEOUT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL_TIMEOUT=7200 (2 hours) seems too long for a fault-tolerance demo

For a fault-tolerance demo, a 2-hour NCCL timeout effectively disables NCCL-level failure detection. When a GPU fails or a rank dies, NCCL operations on surviving ranks will block for up to 2 hours before erroring out. Combined with TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=7200, this means:

  • NVRx heartbeat detects failure in ~30s ✓
  • But NCCL collective operations on surviving ranks still block for up to 2h ✗

Consider a shorter timeout (e.g., 300-600s) that lets NVRx handle recovery without NCCL blocking for hours. If these values were set to work around EFA Gen1 transient issues on g5, that's valid but should be documented.

@@ -0,0 +1,272 @@
#!/usr/bin/env python3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime source-file patching is fragile (design concern)

This script modifies installed NeMo RL source files at startup using exact string matching (e.g., finding peft_config: NotRequired[Any] and self.cp_size = distributed_manager.cp_size). This approach:

  • Breaks silently if NeMo RL updates the source (different whitespace, renamed variables, reordered code)
  • Requires deleting all __pycache__ to force recompilation
  • Makes debugging harder since the running code doesn't match the installed source

A more robust approach would be to subclass DTensorPolicyWorkerV2 and override train(), or use Python's unittest.mock.patch / functools.wraps to wrap methods at the API level. For CheckpointingConfig, consider proposing the is_async field upstream to NeMo RL.

Not blocking — the approach works as a workshop demo — but worth noting for maintainability.

containers:
- name: downloader
image: public.ecr.aws/v9l4g5s4/nemo-rl-full:main-efa-20260218
imagePullPolicy: Always
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same image mismatch as rayjob.yaml

Uses nemo-rl-full:main-efa-20260218 — same issue as the rayjob. Should reference the image built from the Dockerfile in this PR, or document the pre-built image as the canonical one.

@@ -0,0 +1,222 @@
# NeMo RL Workshop — Multi-arch (g5/g6e/p5) + EFA
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing license headers on all 8 files

None of the files in this PR include the Apache 2.0 license header required by the repository. This applies to all 8 new files:

  • Dockerfile
  • README.md
  • kubernetes/rayjob.yaml
  • kubernetes/dataset-download-job.yaml
  • scripts/rayjob_entrypoint.sh
  • scripts/run_grpo_nvrx.py
  • scripts/evaluate_before_after.py
  • patches/patch_nvrx_features.py

@dmvevents
Copy link
Copy Markdown
Author

Thanks @KeitaW — agreed. NeMo RL is built on Megatron-Core for model parallelism and DTensor for distributed policy optimization, so it belongs under 3.test_cases/megatron/ rather than pytorch/.

I'll restructure the test case to 3.test_cases/megatron/nemo-rl/ and update all internal paths. Will push the updated structure shortly.

@paragao — will address your feedback in the same update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants