Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS#1010
Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS#1010dmvevents wants to merge 1 commit intoawslabs:mainfrom
Conversation
New test case for NVIDIA NeMo RL GRPO training on Amazon EKS with: - NVIDIA Resiliency Extension (NVRx) for fault tolerance - Multi-arch Dockerfile (g5 A10G, g6e L40S, p5 H100) - KubeRay RayJob manifests with FSx Lustre shared storage - Training scripts with heartbeat monitoring and straggler detection - Before/after evaluation demonstrating math reasoning improvement Tested on 2x g5.8xlarge with Qwen2.5-1.5B-Instruct, 20 GRPO steps.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/3 — Structure & Repository Hygiene
This PR adds a comprehensive new test case for NVIDIA NeMo RL GRPO training with fault tolerance (NVRx) on Amazon EKS. The contribution is well-structured and clearly written, though there are several repo-convention issues that I'd suggest addressing before merging.
Missing copyright and license headers on all files
None of the 8 files in this PR include the required copyright header:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
I'd suggest adding it to the top of each file — for the Dockerfile, before the first comment block; for Python files, before the docstring; for YAML files, as the first line(s); for the shell script, after the shebang line.
Reference: CLAUDE.md conventions
| # Build (multi-arch, ~40 min on c5.4xlarge): | ||
| # docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop . |
There was a problem hiding this comment.
Dockerfile header references wrong file path
This looks like a leftover from when the file was at a different path in the upstream NeMo RL repo. The actual file is 3.test_cases/pytorch/nemo-rl/Dockerfile, not docker/Dockerfile.workshop.
I'd suggest updating these comments to match the actual location in this repo:
| # Build (multi-arch, ~40 min on c5.4xlarge): | |
| # docker build -f docker/Dockerfile.workshop -t nemo-rl-workshop . | |
| # Build (multi-arch, ~40 min on c5.4xlarge): | |
| # docker build -t nemo-rl-workshop . |
| RUN cd /tmp && \ | ||
| curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \ | ||
| | tar xz && \ |
There was a problem hiding this comment.
EFA installer uses unpinned latest URL
The EFA installer is fetched via aws-efa-installer-latest.tar.gz, which resolves to whatever the current version is at build time. This violates the repo convention that all external dependencies must be pinned to a specific version.
I'd suggest pinning to a specific version, consistent with how OFI_NCCL_VERSION is already pinned:
| RUN cd /tmp && \ | |
| curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \ | |
| | tar xz && \ | |
| RUN cd /tmp && \ | |
| curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-1.38.0.tar.gz \ | |
| | tar xz && \ |
Alternatively, add an ARG EFA_INSTALLER_VERSION=1.38.0 alongside the other version ARGs and use it here.
Reference: CI version check enforces EFA >= 1.47.0.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/3 — Deployment Pipeline & Infrastructure
| @@ -0,0 +1,89 @@ | |||
| #!/bin/bash | |||
| set -e | |||
There was a problem hiding this comment.
Shell script uses set -e instead of set -euo pipefail
The repo convention requires shell scripts to start with set -euo pipefail (or set -ex at minimum). Using only set -e means unset variable references won't be caught (-u) and failures in piped commands won't propagate (-o pipefail).
| set -e | |
| set -exuo pipefail |
| # 3. NVRx scripts on FSx: /shared/nvrx-demo/{patches,scripts}/ | ||
| # 4. NVIDIA + EFA device plugins on g5 nodes | ||
| # | ||
| # Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml |
There was a problem hiding this comment.
rayjob.yaml comment references wrong filename
The deploy comment references g5-rayjob-qwen-nvrx.yaml but the actual filename is rayjob.yaml.
| # Deploy: kubectl apply -f g5-rayjob-qwen-nvrx.yaml | |
| # Deploy: kubectl apply -f kubernetes/rayjob.yaml |
| - name: NCCL_SOCKET_IFNAME | ||
| value: "ens5" |
There was a problem hiding this comment.
NCCL_SOCKET_IFNAME uses positive interface selection
The repo convention requires exclusion-based patterns (^lo for K8s, ^docker,lo,veth for Dockerfiles), not positive selection like ens5. While the comment explains the g5 dual-ENI rationale, positive selection breaks portability on other instance types.
I noticed the Dockerfile correctly uses ^lo,docker,veth,eni (line 218). I'd suggest using an exclusion pattern here too, since hostNetwork: true means container interfaces match the host:
| - name: NCCL_SOCKET_IFNAME | |
| value: "ens5" | |
| - name: NCCL_SOCKET_IFNAME | |
| value: "^lo,docker,veth" |
Reference: EFA cheatsheet.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/3 — Things That Look Great
- Excellent README structure: The walkthrough is thorough, with clear architecture diagrams, a troubleshooting table, and step-by-step instructions. The ASCII architecture and resiliency stack diagrams are genuinely helpful.
- Multi-arch Dockerfile is well-engineered: The multi-stage build with clear separation (clone → deps → python → release → EFA) is clean. The CUDA arch handling (SM 9.0-only for deep_ep/deep_gemm) shows deep understanding of the compilation requirements.
- YAML anchor reuse: The
&common-env,&common-mounts, and&common-volumesanchors in rayjob.yaml eliminate duplication between head and worker specs — exactly the right pattern. - Defensive patching code:
patch_nvrx_features.pyis careful about idempotency (checks if already patched), provides clear logging, and handles missing files gracefully. - Workshop-optimized design: The
CLEAR_CHECKPOINTSenv var, pre-cache job, and evaluation script show this was designed for live demos with real operator ergonomics in mind. - Conservative backoffLimit:
backoffLimit: 2in both manifests follows the repo convention for GPU jobs. - HF_HOME over TRANSFORMERS_CACHE: Correctly uses the non-deprecated env var throughout.
- EFA configuration in Dockerfile: Proper
--skip-kmod --skip-limit-conf --no-verifyflags and exclusion-basedNCCL_SOCKET_IFNAMEpattern.
KeitaW
left a comment
There was a problem hiding this comment.
thanks for the PR. left few comments
|
any update? |
Test case should live under
|
paragao
left a comment
There was a problem hiding this comment.
Cross-referencing with PR #1025 (same author, same NeMo RL GRPO pattern for P5en/H200). I tested #1025 hands-on and many of the same patterns appear here. These comments focus on issues not already covered by @KeitaW's review.
I agree with all of @KeitaW's existing comments (unpinned EFA installer, set -euo pipefail, YAML filename reference, NCCL_SOCKET_IFNAME, Dockerfile path).
| echo "Date: $(date -u +%Y-%m-%dT%H:%M:%SZ)" | ||
|
|
||
| # Fix hostname for Gloo | ||
| MY_IP=$(hostname -I | awk '{print $1}') |
There was a problem hiding this comment.
/etc/hosts write fails without privileged containers
Same issue as PR #1025. This writes to /etc/hosts which requires either privileged: true or SYS_ADMIN capability. The rayjob.yaml only grants IPC_LOCK and SYS_RESOURCE — this line will fail with Permission denied.
With hostNetwork: true already set in the YAML, the container shares the host's /etc/hosts (which already maps the hostname), making this write redundant in the K8s case anyway.
If this is needed for Gloo backend hostname resolution, use Ray's built-in IP configuration instead:
- Set
RAY_ADDRESSenvironment variable, or - Pass
--node-ip-address=$MY_IPtoray start
This avoids requiring privileged containers just for hostname resolution.
| if init_anchor not in content: | ||
| print(f"[NVRx patch] FAIL: no __init__ anchor found in {target}") | ||
| return False | ||
|
|
There was a problem hiding this comment.
Missing nullcontext import in injected code
The straggler detection patch injects code that uses nullcontext() (the string template around line 159):
_nvrx_ctx = (
_NVRxStragglerDetector.detection_section("forward_backward")
if self._nvrx_straggler_enabled
else nullcontext()
)But nullcontext is never imported in the target file (dtensor_policy_worker_v2.py). This will cause a NameError at runtime when _nvrx_straggler_enabled is False.
Fix: add from contextlib import nullcontext to the import block injected at line ~53 of this file.
| containers: | ||
| - name: ray-head | ||
| image: public.ecr.aws/v9l4g5s4/nemo-rl-full:main-efa-20260218 | ||
| imagePullPolicy: Always |
There was a problem hiding this comment.
Container image doesn't match the Dockerfile in this PR
The YAML references public.ecr.aws/v9l4g5s4/nemo-rl-full:main-efa-20260218 but the Dockerfile in this PR builds nemo-rl-workshop. The README (Section 0) instructs users to build nemo-rl-workshop:latest.
Users following the README will build one image but the manifests deploy a different one. Either:
- Use a placeholder like
<your-account>.dkr.ecr.<region>.amazonaws.com/nemo-rl-workshop:latestwith instructions to replace, or - Document that the pre-built image at this ECR path is the canonical one and skip the self-build step
| value: "ens5" | ||
| - name: NCCL_DEBUG | ||
| value: "WARN" | ||
| - name: NCCL_TIMEOUT |
There was a problem hiding this comment.
NCCL_TIMEOUT=7200 (2 hours) seems too long for a fault-tolerance demo
For a fault-tolerance demo, a 2-hour NCCL timeout effectively disables NCCL-level failure detection. When a GPU fails or a rank dies, NCCL operations on surviving ranks will block for up to 2 hours before erroring out. Combined with TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=7200, this means:
- NVRx heartbeat detects failure in ~30s ✓
- But NCCL collective operations on surviving ranks still block for up to 2h ✗
Consider a shorter timeout (e.g., 300-600s) that lets NVRx handle recovery without NCCL blocking for hours. If these values were set to work around EFA Gen1 transient issues on g5, that's valid but should be documented.
| @@ -0,0 +1,272 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Runtime source-file patching is fragile (design concern)
This script modifies installed NeMo RL source files at startup using exact string matching (e.g., finding peft_config: NotRequired[Any] and self.cp_size = distributed_manager.cp_size). This approach:
- Breaks silently if NeMo RL updates the source (different whitespace, renamed variables, reordered code)
- Requires deleting all
__pycache__to force recompilation - Makes debugging harder since the running code doesn't match the installed source
A more robust approach would be to subclass DTensorPolicyWorkerV2 and override train(), or use Python's unittest.mock.patch / functools.wraps to wrap methods at the API level. For CheckpointingConfig, consider proposing the is_async field upstream to NeMo RL.
Not blocking — the approach works as a workshop demo — but worth noting for maintainability.
| containers: | ||
| - name: downloader | ||
| image: public.ecr.aws/v9l4g5s4/nemo-rl-full:main-efa-20260218 | ||
| imagePullPolicy: Always |
There was a problem hiding this comment.
Same image mismatch as rayjob.yaml
Uses nemo-rl-full:main-efa-20260218 — same issue as the rayjob. Should reference the image built from the Dockerfile in this PR, or document the pre-built image as the canonical one.
| @@ -0,0 +1,222 @@ | |||
| # NeMo RL Workshop — Multi-arch (g5/g6e/p5) + EFA | |||
There was a problem hiding this comment.
Missing license headers on all 8 files
None of the files in this PR include the Apache 2.0 license header required by the repository. This applies to all 8 new files:
DockerfileREADME.mdkubernetes/rayjob.yamlkubernetes/dataset-download-job.yamlscripts/rayjob_entrypoint.shscripts/run_grpo_nvrx.pyscripts/evaluate_before_after.pypatches/patch_nvrx_features.py
|
Thanks @KeitaW — agreed. NeMo RL is built on Megatron-Core for model parallelism and DTensor for distributed policy optimization, so it belongs under I'll restructure the test case to @paragao — will address your feedback in the same update. |
Summary
Details
This example shows how to run reinforcement learning-based LLM alignment using NeMo RL's GRPO algorithm on EKS GPU clusters with:
What's included
README.mdDockerfilekubernetes/rayjob.yamlkubernetes/dataset-download-job.yamlscripts/rayjob_entrypoint.shscripts/run_grpo_nvrx.pyscripts/evaluate_before_after.pypatches/patch_nvrx_features.pyTested on
Test plan