Add NeMo RL GRPO training on P5en with EFA RDMA by dmvevents · Pull Request #1025 · awslabs/awsome-distributed-training

dmvevents · 2026-03-17T18:09:19Z

Summary

Adds a new test case for multi-node GRPO (Group Relative Policy Optimization) training using NVIDIA NeMo RL on Amazon EKS with P5en.48xlarge instances (H200 GPUs + EFA networking)
Trains Nemotron-Mini-4B-Instruct on synthetic math problems using attention-only LoRA, achieving +7 percentage point accuracy improvement (28% to 35%) in 50 training steps
Includes production-ready Dockerfile, Kubernetes RayJob manifest, training entrypoint, evaluation script, and full SBOM

Architecture

Hardware: 2x P5en.48xlarge (16x H200 141GB, 32x EFA 200Gbps)
Orchestration: Ray 2.49.2 RayJob with backoffLimit for fault tolerance
Training: NeMo RL GRPO with DTensor policy optimization + vLLM generation rollouts
Networking: NCCL 2.27.5 + aws-ofi-nccl 1.18.0 + Libfabric 2.4 over EFA RDMA
Storage: FSx for Lustre (model weights, checkpoints, dataset)

Artifacts

File	Purpose
`Dockerfile`	Multi-stage build: NeMo RL + EFA stack (libfabric 2.4, aws-ofi-nccl 1.18, GDRCopy)
`rayjob-grpo.yaml`	K8s RayJob manifest for 2-node P5en training with fault recovery
`rayjob_entrypoint.sh`	Training entrypoint with NCCL H200/NVSwitch workarounds
`eval_nemotron_goldilocks.py`	Before/after accuracy evaluation on 200 held-out problems
`sbom/`	Software Bill of Materials (pip packages, dpkg, platform summary)

Related upstream work

NVIDIA/nccl#2024 — Ring Broadcast hang on H200/NVSwitch (filed during this work)
ray-project/ray#61073 — NCCL performance regression via Ray on H200
NVIDIA/nvidia-resiliency-ext#278 — NVRx fault tolerance integration
ai-dynamo/dynamo#7390 — Local model path fix for Dynamo inference

Test plan

Verified training completes 50 GRPO steps on 2x P5en.48xlarge (CGK ap-southeast-3)
Verified evaluation script reports +7pp accuracy improvement (28% to 35%)
Verified fault injection and checkpoint recovery via backoffLimit: 2
Verified Dockerfile builds and produces working image
Verified SBOM matches running container inventory
Reviewer: confirm README follows repo conventions
Reviewer: confirm namespace/PVC placeholders are clear for adaptation

Multi-node GRPO reinforcement learning training using NVIDIA NeMo RL on Amazon EKS with 2x P5en.48xlarge (16x H200, 32x EFA). Trains Nemotron-Mini-4B-Instruct on synthetic math with attention-only LoRA, achieving +7pp accuracy improvement (28% -> 35%) in 50 steps. Includes: - Dockerfile with EFA networking stack (libfabric 2.4, aws-ofi-nccl 1.18) - RayJob K8s manifest for 2-node training with fault tolerance - Training entrypoint with NCCL H200/NVSwitch workarounds - Evaluation script for before/after accuracy comparison - Full SBOM (Software Bill of Materials) Stack: NeMo RL + Ray 2.49 + vLLM + DTensor + NCCL 2.27.5 + EFA

…pecific configs

KeitaW

Review Batch 1/3 — Structure & Repository Hygiene

Thanks for this contribution, Anton! The Dockerfile and README are excellent quality. I have several structural and documentation findings below.

Test case should live under `3.test_cases/megatron/`, not `pytorch/`

NeMo RL is built on top of NVIDIA's Megatron-based stack — the training loop uses Megatron-Core for model parallelism and DTensor for distributed policy optimization. The pytorch/ top-level directory is reserved for frameworks that use PyTorch directly (FSDP, DDP, TorchTitan, etc.), while Megatron-derived frameworks belong under 3.test_cases/megatron/. Other Megatron-ecosystem test cases already live there.

I'd suggest moving the entire test case to 3.test_cases/megatron/nemo-rl/grpo/.

Missing license header on all files

The repo convention requires a copyright header on all contributed files:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

I'd suggest adding this to the top of each file (after shebang lines where applicable).

Kubernetes manifest not in a platform-specific subdirectory

The repo convention places Kubernetes manifests in a kubernetes/ subdirectory within each test case. Currently rayjob-grpo.yaml is at the root level. I'd suggest moving it to kubernetes/rayjob-grpo.yaml to follow the pattern in other test cases like FSDP/kubernetes/.

KeitaW · 2026-03-18T07:37:13Z

3.test_cases/pytorch/nemo-rl/grpo/Dockerfile

+
+# GDRCopy — GPU Direct RDMA copy library (needed for EFA GDRDMA)
+RUN cd /tmp && \
+    git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \


Unpinned dependency: GDRCopy

This clones HEAD of GDRCopy with no tag or commit pin. Upstream changes could silently break the build. I'd suggest pinning to the version that was tested:

Suggested change

git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \

git clone --depth 1 --branch v2.4.4 https://github.com/NVIDIA/gdrcopy.git && \

Reference: repo convention on fixed version tags

KeitaW · 2026-03-18T07:37:13Z

3.test_cases/pytorch/nemo-rl/grpo/Dockerfile

+    dpkg --purge --force-depends libfabric1-aws libfabric-aws-bin libfabric-aws-dev 2>/dev/null || true
+    rm -rf /opt/amazon/efa
+    cd /tmp
+    curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz | tar xz


Unpinned dependency: EFA installer

aws-efa-installer-latest.tar.gz always pulls the latest version. The repo convention requires pinned versions for reproducible builds. I'd suggest pinning to the version that was tested, e.g.:

Suggested change

curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz | tar xz

curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-1.38.0.tar.gz | tar xz

(Adjust version number to match what you tested with.) The same issue appears on line 272 for the minimal mode path.

KeitaW

Review Batch 2/3 — Deployment Pipeline + Training Code + Infrastructure

KeitaW · 2026-03-18T07:37:33Z

3.test_cases/pytorch/nemo-rl/grpo/rayjob-grpo.yaml

+  namespace: antonai
+  labels:
+    app: grpo-nvrx-rayjob
+spec:
+  entrypoint: bash /shared/nvrx-demo/scripts/rayjob_countdown_entrypoint.sh


Hardcoded namespace and personal paths in RayJob manifest

The manifest hardcodes namespace: antonai and the entrypoint references /shared/nvrx-demo/scripts/rayjob_countdown_entrypoint.sh — a path on your personal shared filesystem. Additionally, the entrypoint script included in this PR (rayjob_entrypoint.sh) is not what the manifest points to.

I'd suggest:

Change the namespace to a placeholder like <your-namespace> or a generic name like nemo-rl

Update the entrypoint to reference the included rayjob_entrypoint.sh (e.g., bake it into the container image or mount it)

Add a comment noting which values users need to customize

The same /shared/nvrx-demo/ prefix appears in 10+ places across the entrypoint script and manifest env vars. Centralizing this as a documented variable (e.g., $SHARED_DIR) would make adaptation clearer.

KeitaW · 2026-03-18T07:37:33Z

3.test_cases/megatron/nemo-rl/grpo/eval_nemotron_goldilocks.py

+    try:
+        from peft import PeftModel
+    except ImportError:
+        import subprocess
+        subprocess.check_call([sys.executable, "-m", "pip", "install", "peft", "-q"])
+        from peft import PeftModel


Runtime pip install peft fallback

This try/except runs pip install peft -q at runtime if the package isn't installed. Runtime pip installs are non-reproducible (unpinned version, network dependency) and can fail in air-gapped environments. I'd suggest either:

Adding peft to the Dockerfile's Python dependencies, or

Documenting peft as a prerequisite in the README and removing the runtime fallback

KeitaW · 2026-03-18T07:37:33Z

3.test_cases/pytorch/nemo-rl/grpo/README.md

+- `pip-freeze.txt` — Python packages (235 packages)
+- `dpkg.txt` — System packages (356 packages)
+
+Key versions: PyTorch 2.9.0+cu129, CUDA 12.9, NCCL 2.27.5, NeMo RL 0.5.0rc0, Ray 2.49.2, vLLM (in worker venvs), NVRx 0.4.1


NCCL version mismatch between README and SBOM

This line states "NCCL 2.27.5" but the SBOM (sbom/dpkg.txt) shows libnccl2,2.26.5-1+cuda12.9. Could you confirm which version was actually used in testing and update accordingly?

KeitaW

Review Batch 3/3 — Documentation Consistency

Things That Look Great

Multi-stage Dockerfile with excellent layer caching: The strategy of copying only pyproject.toml + uv.lock first to cache the expensive dependency install stage, then copying full source later, is a production-quality pattern that saves significant rebuild time.
Dual EFA install mode: Supporting both full and minimal EFA installation modes makes the image adaptable across instance types (P5en RDMA vs. socket-only).
NRL_GIT_REF pinned to commit hash: Using a specific commit (c40dba37789c) rather than a branch name ensures build reproducibility for the NeMo RL source.
Comprehensive SBOM: Including dpkg.txt, pip-freeze.txt, and sbom.txt gives full supply chain visibility — this is above and beyond what most test cases provide.
Excellent README: The verified instance types table, architecture diagram, instance-specific configuration snippets, and known issues table with upstream issue links make this one of the most thorough READMEs in the repo.
NVRx fault tolerance documentation: The fault injection test procedure and checkpoint resume behavior are well-documented with concrete steps, making the fault tolerance claims verifiable.
YAML anchors for DRY configuration: Using &common-env, &common-mounts, and &common-volumes with YAML aliases avoids config drift between head and worker specs.

KeitaW · 2026-03-18T07:38:04Z

3.test_cases/pytorch/nemo-rl/grpo/rayjob-grpo.yaml

+# ============================================================
+# RayJob: Qwen2.5-1.5B GRPO + NVRx on P5en (H200)


Model name mismatch between README, manifest, and entrypoint

There's a three-way inconsistency in which model is being trained:

This comment (line 2): says "Qwen2.5-1.5B GRPO"

GRPO_MODEL env var (line 115): sets Qwen/Qwen2.5-0.5B-Instruct (0.5B, not 1.5B)

README title and results: describes training "Nemotron-Mini-4B-Instruct"

Entrypoint script: references Nemotron and a phase 1 merge workflow

It looks like the manifest was adapted from a Qwen experiment but the README and entrypoint describe the Nemotron workflow. I'd suggest aligning these — a user following the README would encounter unexpected behavior from the manifest as-is.

KeitaW · 2026-03-18T07:38:04Z

3.test_cases/pytorch/nemo-rl/grpo/rayjob_entrypoint.sh

+if [ ! -f "$MERGED/model.safetensors" ]; then
+  echo "Merging phase 1 LoRA checkpoint..."
+  pip install peft -q 2>&1 | tail -1
+  python3 /shared/nvrx-demo/scripts/merge_lora_checkpoint.py \


Entrypoint references non-existent files

This line references /shared/nvrx-demo/scripts/merge_lora_checkpoint.py, which isn't included in the PR. Other missing references:

/shared/nvrx-demo/goldilocks/train.jsonl (dataset, line 80)

/shared/nvrx-demo/miracle-checkpoints (phase 1 checkpoints, line 34)

These are on your FSx volume — other users won't have them. I'd suggest either:

Including merge_lora_checkpoint.py in the PR (or documenting how to create it)

Adding a "Data Preparation" section to the README explaining how to set up these paths

Using a Hugging Face dataset directly, or documenting how to generate the JSONL

KeitaW · 2026-03-18T07:38:04Z

3.test_cases/pytorch/nemo-rl/grpo/eval_nemotron_goldilocks.py

+os.environ.setdefault("HF_HOME", "/shared/nvrx-demo/hf_cache")
+
+MODEL = "nvidia/Nemotron-Mini-4B-Instruct"
+DATASET = "/shared/nvrx-demo/goldilocks/train.jsonl"


Eval script hardcodes local paths

The DATASET path points to /shared/nvrx-demo/goldilocks/train.jsonl on your shared filesystem. Users following the README won't have this file. I'd suggest making DATASET a command-line argument (like --model already is) with a helpful error message if the file doesn't exist.

Suggested change

os.environ.setdefault("HF_HOME", "/shared/nvrx-demo/hf_cache")

MODEL = "nvidia/Nemotron-Mini-4B-Instruct"

DATASET = "/shared/nvrx-demo/goldilocks/train.jsonl"

os.environ.setdefault("HF_HOME", "/shared/hf_cache")

MODEL = "nvidia/Nemotron-Mini-4B-Instruct"

DATASET = os.environ.get("EVAL_DATASET", "/shared/goldilocks/train.jsonl")

KeitaW

Nice addition few comments.

KeitaW · 2026-03-18T07:40:20Z

Question: Why include the SBOM files in the repo?

I noticed the sbom/ directory adds ~1,200 lines of package inventory (dpkg.txt, pip-freeze.txt, sbom.txt). These are a snapshot of one specific container build and will drift from the actual image contents as the Dockerfile evolves — the SBOM would need to be regenerated and recommitted on every dependency change to stay accurate.

The Dockerfile already has generate_sbom.sh baked in, so the SBOM is generated at build time inside the image itself. Could you help me understand the motivation for also checking it into the repo? If the goal is supply chain visibility, the in-image SBOM (already produced by the build) may be sufficient without committing a static copy that can go stale.

paragao

Hands-On Testing Report — P5en H200 Cluster (Slurm + Pyxis)

I tested this PR on a 16-node P5en.48xlarge cluster (8x NVIDIA H200 143GB per node, EFA, FSx Lustre) running Slurm with Pyxis/Enroot.

What works

Container image pulls and runs correctly on H200 hardware via srun --container-image='docker://public.ecr.aws#v9l4g5s4/nemo-rl-workshop:p5en'
Verified versions inside container: PyTorch 2.9.0+cu129, CUDA 12.9, Ray 2.49.2, NeMo RL 0.5.0rc0, NCCL 2.27.5 (pip/runtime)
EFA stack is functional: libfabric 2.4.0amzn1.0, EFA installer 1.47.0, libnccl-ofi 1.18.0, GDRCopy present
1-node GRPO training (8x H200) completes successfully — I ran 3 steps of the default NeMo RL GRPO example (Qwen2.5-1.5B + OpenMathInstruct-2) and all steps completed without errors. Step 1: 33s (cold), Steps 2-3: ~10.3s each.

What doesn't work

PR entrypoint fails immediately — all 4 data dependencies are missing (goldilocks/train.jsonl, merge_lora_checkpoint.py, miracle-checkpoints/, nemotron-phase1-merged/). These live on the author's FSx volume and are not included in the PR or documented for reproduction.
/etc/hosts write in entrypoint fails — echo "$MY_IP $(hostname)" >> /etc/hosts returns Permission denied in containers that don't run as privileged/writable. The K8s YAML sets privileged: true but this assumption should be documented.

Container EFA/NCCL Audit

I inspected the container's NCCL and EFA configuration in detail and found several issues:

1. Old OFI-NCCL path still exists

Both paths are present in the container:

Path	Status
`/opt/amazon/aws-ofi-nccl/`	OLD — should not exist, leftover from base image
`/opt/amazon/ofi-nccl/`	NEW — correct path from EFA installer 1.47.0

The old path contains libnccl-net-aws-ofi.so and libnccl-tuner-aws-ofi.so which could be loaded accidentally. The Dockerfile should clean up /opt/amazon/aws-ofi-nccl/ after the EFA install.

2. `NCCL_NET_PLUGIN` container default is wrong

The container bakes NCCL_NET_PLUGIN=aws-ofi (old naming), but the correct value for EFA installer 1.47.0 is ofi. The PR YAML correctly overrides this to ofi, but anyone running the container without the YAML override gets the wrong plugin.

3. `NCCL_TUNER_PLUGIN` not set

The YAML does not set NCCL_TUNER_PLUGIN. The EFA tuner plugin exists at /opt/amazon/ofi-nccl/lib/libnccl-tuner-ofi.so and should be explicitly enabled for optimal performance.

4. `AWS_OFI_NCCL_VERSION` env var is stale

The container has AWS_OFI_NCCL_VERSION=1.14.0 baked in (from the base image), but the actually installed version is 1.18.0 (confirmed via dpkg -s libnccl-ofi).

5. Dual NCCL versions

Source	Version
pip `nvidia-nccl-cu12` (what PyTorch loads at runtime)	2.27.5
dpkg `libnccl2` (system package)	2.26.5
`NCCL_VERSION` env var	2.26.5
`nccl.h` header	2.26.5

PyTorch uses the pip wheel version (2.27.5) at runtime across all three Ray venvs (main, DTensor, vLLM). The system package 2.26.5 is dead weight. The README correctly states 2.27.5 for the runtime version, but this should be clarified since the SBOM and env var show 2.26.5.

6. GDRCopy should be pinned to v2.5.1

The Dockerfile clones GDRCopy HEAD with no tag (git clone --depth 1). This should be pinned to --branch v2.5.1.

paragao · 2026-03-19T13:57:45Z

3.test_cases/pytorch/nemo-rl/grpo/Dockerfile

+
+# GDRCopy — GPU Direct RDMA copy library (needed for EFA GDRDMA)
+RUN cd /tmp && \
+    git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \


GDRCopy: pin to v2.5.1

This clones HEAD of GDRCopy with no tag or commit pin. Should be pinned to the tested version:

Suggested change

git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \

git clone --depth 1 --branch v2.5.1 https://github.com/NVIDIA/gdrcopy.git && \

paragao · 2026-03-19T13:57:45Z

3.test_cases/pytorch/nemo-rl/grpo/Dockerfile

+    dpkg --purge --force-depends libfabric1-aws libfabric-aws-bin libfabric-aws-dev 2>/dev/null || true
+    rm -rf /opt/amazon/efa
+    cd /tmp
+    curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz | tar xz


EFA installer: pin version and clean up old path

Two issues here:

aws-efa-installer-latest.tar.gz is not reproducible. Pin to a specific version.

After the EFA install, the old path /opt/amazon/aws-ofi-nccl/ from the base image still exists alongside the new /opt/amazon/ofi-nccl/. Add a cleanup step:

rm -rf /opt/amazon/aws-ofi-nccl

This prevents the old libnccl-net-aws-ofi.so from being loaded accidentally.

paragao · 2026-03-19T13:57:45Z

3.test_cases/megatron/nemo-rl/grpo/Dockerfile

+# EFA / NCCL defaults
+ENV FI_PROVIDER=efa
+ENV FI_EFA_USE_DEVICE_RDMA=1
+ENV NCCL_NET_PLUGIN=ofi


Container bakes wrong NCCL_NET_PLUGIN default

The container has NCCL_NET_PLUGIN=aws-ofi baked in (from the base image). With EFA installer 1.47.0, the correct value is ofi. The PR YAML correctly overrides this, but the container default should also be fixed.

Also, AWS_OFI_NCCL_VERSION=1.14.0 is stale — the actual installed version is 1.18.0.

paragao · 2026-03-19T13:57:46Z

3.test_cases/megatron/nemo-rl/grpo/rayjob-grpo.yaml

+                  value: "PHB"
+                - name: NCCL_CUMEM_ENABLE
+                  value: "0"
+                - name: NCCL_NVLS_ENABLE


Missing NCCL_TUNER_PLUGIN

The EFA tuner plugin is installed at /opt/amazon/ofi-nccl/lib/libnccl-tuner-ofi.so but NCCL_TUNER_PLUGIN is not set in the env vars. Add:

- name: NCCL_TUNER_PLUGIN value: "ofi"

paragao · 2026-03-19T13:57:46Z

3.test_cases/pytorch/nemo-rl/grpo/rayjob_entrypoint.sh

+echo "Starting from step_60 merged model (has format knowledge)"
+echo "=============================================="
+MY_IP=$(hostname -I | awk '{print $1}')
+echo "$MY_IP $(hostname)" >> /etc/hosts


Replace /etc/hosts write with Ray IP configuration

Tested on H200 cluster:

/etc/hosts: Permission denied

This line writes to /etc/hosts to ensure Ray resolves the container's hostname to its real network IP instead of loopback. However, this requires privileged: true on the container and fails in non-privileged environments.

Instead, use Ray's built-in IP configuration:

Set RAY_ADDRESS environment variable, or

Pass --node-ip-address=$MY_IP to ray start

This avoids the need for privileged containers just for hostname resolution, and is more portable across container runtimes (K8s, Enroot/Pyxis, etc.). With hostNetwork: true already set in the YAML, the container shares the host's /etc/hosts anyway, making this write redundant in the K8s case.

paragao · 2026-03-19T13:57:46Z

3.test_cases/pytorch/nemo-rl/grpo/rayjob_entrypoint.sh

@@ -0,0 +1,89 @@
+#!/bin/bash
+set -e


Use set -euo pipefail

Currently only set -e. Missing -u (unset variable detection) and -o pipefail (piped command failure detection). Recommend:

Suggested change

set -e

set -euo pipefail

paragao · 2026-03-19T13:57:46Z

3.test_cases/pytorch/nemo-rl/grpo/README.md

+- `pip-freeze.txt` — Python packages (235 packages)
+- `dpkg.txt` — System packages (356 packages)
+
+Key versions: PyTorch 2.9.0+cu129, CUDA 12.9, NCCL 2.27.5, NeMo RL 0.5.0rc0, Ray 2.49.2, vLLM (in worker venvs), NVRx 0.4.1


NCCL version clarification

Adding context to @KeitaW's comment: I verified inside the container that there are two NCCL versions:

pip nvidia-nccl-cu12: 2.27.5 — this is what PyTorch loads at runtime (confirmed via torch.cuda.nccl.version() on H200)

dpkg libnccl2: 2.26.5 — system package, not used by PyTorch

The README's "NCCL 2.27.5" is correct for the runtime version. I'd suggest adding a note: "NCCL 2.27.5 (PyTorch wheel; system package is 2.26.5)" to avoid confusion with the SBOM.

paragao · 2026-03-19T13:57:46Z

3.test_cases/pytorch/nemo-rl/grpo/Dockerfile

+#   MAX_JOBS                Parallel compilation limit (prevent OOM)
+#   EFA_INSTALL_MODE        "full" or "minimal" (default: full for P5en RDMA)
+
+# syntax=docker/dockerfile:1.10


syntax directive must be line 1

# syntax=docker/dockerfile:1.10 is on line 43 (after the header comments). For BuildKit to recognize it, it must be the very first line of the Dockerfile. Move it above all comments.

- Move test case from pytorch/nemo-rl to megatron/nemo-rl (NeMo RL is Megatron-based) - Remove sbom/ directory from repo (too large); add extraction instructions in README - Pin GDRCopy to v2.5.1 tag (was unpinned HEAD) - Pin EFA installer to v1.47.0 (was -latest) - Move `# syntax=docker/dockerfile:1.10` to line 1 of Dockerfile - Add NCCL_TUNER_PLUGIN=ofi in Dockerfile, YAML, and entrypoint - Fix model name: Qwen/Qwen2.5-0.5B-Instruct -> nvidia/Nemotron-Mini-4B-Instruct - Fix namespace: antonai -> default with <-- change comment - Fix entrypoint: set -euo pipefail, remove /etc/hosts write (Ray handles IP) - Remove missing merge_lora_checkpoint.py reference; entrypoint now falls back to HuggingFace model if no pre-merged model found on shared storage - Make all paths configurable via env vars (SHARED_DIR, CKPT_DIR, DATA_PATH, etc.) - Fix eval script: add --dataset CLI arg, use SHARED_DIR env var for defaults - Clarify NCCL version in README: 2.27.5 (pip/PyTorch) vs 2.26.5 (system dpkg) - Replace hardcoded image refs with <your-registry> placeholders

dmvevents · 2026-03-19T18:30:39Z

Thanks @KeitaW and @paragao for the thorough reviews! All feedback has been addressed in the latest push:

Structural:

✅ Moved to 3.test_cases/megatron/nemo-rl/grpo/ (NeMo RL uses Megatron Core)
✅ Removed sbom/ directory — added extraction instructions in README instead

Dockerfile:

✅ # syntax=docker/dockerfile:1.10 moved to line 1
✅ GDRCopy pinned to v2.5.1
✅ EFA installer pinned to 1.47.0
✅ Added NCCL_TUNER_PLUGIN=ofi

RayJob YAML:

✅ Namespace changed to default with comment to customize
✅ Model name consistently nvidia/Nemotron-Mini-4B-Instruct
✅ Removed hardcoded personal paths
✅ Added NCCL_TUNER_PLUGIN env var

Entrypoint:

✅ set -euo pipefail with ${VAR:-default} pattern throughout
✅ Removed /etc/hosts write (Ray GCS handles IP discovery)
✅ Removed dependency on merge_lora_checkpoint.py — falls back to HuggingFace model

Eval script:

✅ Paths configurable via --dataset, --checkpoint-dir CLI args and SHARED_DIR env var

README:

✅ NCCL version clarified: pip 2.27.5 (PyTorch runtime) vs system 2.26.5 (dpkg base image)
✅ Model name consistent
✅ SBOM extraction instructions added

@paragao — thanks for testing on the P5en Slurm cluster! This PR targets EKS (RayJob), but the container image works on Slurm+Pyxis as you confirmed. Happy to add a Slurm section in a follow-up PR if that would be useful.

KeitaW

Re-review 1/2 — Structure & Repository Hygiene

Great progress since the first review! Most findings addressed. A couple of structural items remain.

Missing license headers on all files

None of the 5 files include the repo-required copyright header:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

For the Dockerfile, place it after the # syntax= directive. For Python, after the shebang. For YAML, at the top. For the README, use an HTML comment.

KeitaW · 2026-03-23T13:07:17Z

3.test_cases/megatron/nemo-rl/grpo/rayjob-grpo.yaml

@@ -0,0 +1,213 @@
+# ============================================================


RayJob manifest not in a kubernetes/ subdirectory

The repo convention places Kubernetes manifests in a kubernetes/ subdirectory within each test case. I'd suggest moving this to 3.test_cases/megatron/nemo-rl/grpo/kubernetes/rayjob-grpo.yaml to follow the pattern in other test cases like FSDP/kubernetes/.

KeitaW

Re-review 2/2 — Training Code Quality

Dataset preparation not documented

The entrypoint expects a Goldilocks math dataset at $DATA_PATH (default: ${SHARED_DIR}/goldilocks/train.jsonl), but the README doesn't explain how to obtain or generate this file. A brief "Data Preparation" section would help — even just noting the expected JSONL format ({"problem": "...", "answer": "..."}) and where to source the data.

Things That Look Great

Excellent response to first review: Moved to megatron/, pinned EFA (1.47.0) and GDRCopy (v2.5.1), fixed model name alignment, parameterized namespace with inline # <-- change markers, and explained the dual NCCL version situation clearly.
SHARED_DIR env var pattern: The entrypoint and eval script now derive all paths from a single SHARED_DIR variable with sensible defaults — easy to adapt to different FSx mounts.
Configurable training parameters: GRPO_MODEL, GRPO_MAX_STEPS, GRPO_NUM_NODES, CKPT_DIR, and DATA_PATH are all overridable via environment variables.
Clear setup instructions in manifest comments: Numbered setup guide and # <-- change markers on every user-configurable value.
SBOM moved to build-time: No stale inventory to maintain.
NCCL version explanation: The README note about pip wheel 2.27.5 vs system 2.26.5 preempts common confusion.
Proper set -euo pipefail in the entrypoint script.
NCCL_TUNER_PLUGIN=ofi: Good addition for EFA-aware NCCL algorithm selection.

KeitaW · 2026-03-23T13:07:39Z

3.test_cases/megatron/nemo-rl/grpo/eval_nemotron_goldilocks.py

+    try:
+        from peft import PeftModel
+    except ImportError:
+        import subprocess
+        subprocess.check_call([sys.executable, "-m", "pip", "install", "peft", "-q"])
+        from peft import PeftModel


Runtime pip install peft fallback still present

This try/except runs pip install peft -q at runtime if the package isn't installed. Runtime pip installs are non-reproducible (unpinned version, network dependency) and fail in air-gapped environments. I'd suggest either:

Documenting peft as a prerequisite in the README's evaluation section, or

Adding peft to the Dockerfile

dmvevents added 2 commits March 17, 2026 18:08

Update README: multi-GPU universal (P5en/P5/G6E verified), instance-s…

ee1e12d

…pecific configs

KeitaW reviewed Mar 18, 2026

View reviewed changes

KeitaW mentioned this pull request Mar 18, 2026

Add NeMo RL GRPO training with fault tolerance (NVRx) on EKS #1010

Open

6 tasks

KeitaW requested changes Mar 18, 2026

View reviewed changes

paragao reviewed Mar 19, 2026

View reviewed changes

KeitaW reviewed Mar 23, 2026

View reviewed changes

	git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \
	git clone --depth 1 --branch v2.4.4 https://github.com/NVIDIA/gdrcopy.git && \

	curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \| tar xz
	curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-1.38.0.tar.gz \| tar xz

		# ============================================================
		# RayJob: Qwen2.5-1.5B GRPO + NVRx on P5en (H200)

	git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \
	git clone --depth 1 --branch v2.5.1 https://github.com/NVIDIA/gdrcopy.git && \

		@@ -0,0 +1,213 @@
		# ============================================================

Conversation

dmvevents commented Mar 17, 2026

Summary

Architecture

Artifacts

Related upstream work

Test plan

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/3 — Structure & Repository Hygiene

Test case should live under 3.test_cases/megatron/, not pytorch/

Missing license header on all files

Kubernetes manifest not in a platform-specific subdirectory

Uh oh!

Choose a reason for hiding this comment

Unpinned dependency: GDRCopy

Uh oh!

Choose a reason for hiding this comment

Unpinned dependency: EFA installer

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 2/3 — Deployment Pipeline + Training Code + Infrastructure

Uh oh!

Choose a reason for hiding this comment

Hardcoded namespace and personal paths in RayJob manifest

Uh oh!

Choose a reason for hiding this comment

Runtime pip install peft fallback

Uh oh!

Choose a reason for hiding this comment

NCCL version mismatch between README and SBOM

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/3 — Documentation Consistency

Things That Look Great

Uh oh!

Choose a reason for hiding this comment

Model name mismatch between README, manifest, and entrypoint

Uh oh!

Choose a reason for hiding this comment

Entrypoint references non-existent files

Uh oh!

Choose a reason for hiding this comment

Eval script hardcodes local paths

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Uh oh!

KeitaW commented Mar 18, 2026

Question: Why include the SBOM files in the repo?

Uh oh!

paragao left a comment

Choose a reason for hiding this comment

Hands-On Testing Report — P5en H200 Cluster (Slurm + Pyxis)

What works

What doesn't work

Container EFA/NCCL Audit

1. Old OFI-NCCL path still exists

2. NCCL_NET_PLUGIN container default is wrong

3. NCCL_TUNER_PLUGIN not set

4. AWS_OFI_NCCL_VERSION env var is stale

5. Dual NCCL versions

6. GDRCopy should be pinned to v2.5.1

Uh oh!

Choose a reason for hiding this comment

GDRCopy: pin to v2.5.1

Uh oh!

Choose a reason for hiding this comment

EFA installer: pin version and clean up old path

Uh oh!

Choose a reason for hiding this comment

Container bakes wrong NCCL_NET_PLUGIN default

Uh oh!

Choose a reason for hiding this comment

Missing NCCL_TUNER_PLUGIN

Uh oh!

Choose a reason for hiding this comment

Test case should live under `3.test_cases/megatron/`, not `pytorch/`

Runtime `pip install peft` fallback

2. `NCCL_NET_PLUGIN` container default is wrong

3. `NCCL_TUNER_PLUGIN` not set

4. `AWS_OFI_NCCL_VERSION` env var is stale

Container bakes wrong `NCCL_NET_PLUGIN` default

Missing `NCCL_TUNER_PLUGIN`

Replace `/etc/hosts` write with Ray IP configuration

Use `set -euo pipefail`

`syntax` directive must be line 1

RayJob manifest not in a `kubernetes/` subdirectory

Runtime `pip install peft` fallback still present