Add NeMo RL GRPO training on P5en with EFA RDMA#1025
Add NeMo RL GRPO training on P5en with EFA RDMA#1025dmvevents wants to merge 3 commits intoawslabs:mainfrom
Conversation
Multi-node GRPO reinforcement learning training using NVIDIA NeMo RL on Amazon EKS with 2x P5en.48xlarge (16x H200, 32x EFA). Trains Nemotron-Mini-4B-Instruct on synthetic math with attention-only LoRA, achieving +7pp accuracy improvement (28% -> 35%) in 50 steps. Includes: - Dockerfile with EFA networking stack (libfabric 2.4, aws-ofi-nccl 1.18) - RayJob K8s manifest for 2-node training with fault tolerance - Training entrypoint with NCCL H200/NVSwitch workarounds - Evaluation script for before/after accuracy comparison - Full SBOM (Software Bill of Materials) Stack: NeMo RL + Ray 2.49 + vLLM + DTensor + NCCL 2.27.5 + EFA
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/3 — Structure & Repository Hygiene
Thanks for this contribution, Anton! The Dockerfile and README are excellent quality. I have several structural and documentation findings below.
Test case should live under 3.test_cases/megatron/, not pytorch/
NeMo RL is built on top of NVIDIA's Megatron-based stack — the training loop uses Megatron-Core for model parallelism and DTensor for distributed policy optimization. The pytorch/ top-level directory is reserved for frameworks that use PyTorch directly (FSDP, DDP, TorchTitan, etc.), while Megatron-derived frameworks belong under 3.test_cases/megatron/. Other Megatron-ecosystem test cases already live there.
I'd suggest moving the entire test case to 3.test_cases/megatron/nemo-rl/grpo/.
Missing license header on all files
The repo convention requires a copyright header on all contributed files:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
I'd suggest adding this to the top of each file (after shebang lines where applicable).
Kubernetes manifest not in a platform-specific subdirectory
The repo convention places Kubernetes manifests in a kubernetes/ subdirectory within each test case. Currently rayjob-grpo.yaml is at the root level. I'd suggest moving it to kubernetes/rayjob-grpo.yaml to follow the pattern in other test cases like FSDP/kubernetes/.
|
|
||
| # GDRCopy — GPU Direct RDMA copy library (needed for EFA GDRDMA) | ||
| RUN cd /tmp && \ | ||
| git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \ |
There was a problem hiding this comment.
Unpinned dependency: GDRCopy
This clones HEAD of GDRCopy with no tag or commit pin. Upstream changes could silently break the build. I'd suggest pinning to the version that was tested:
| git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \ | |
| git clone --depth 1 --branch v2.4.4 https://github.com/NVIDIA/gdrcopy.git && \ |
Reference: repo convention on fixed version tags
| dpkg --purge --force-depends libfabric1-aws libfabric-aws-bin libfabric-aws-dev 2>/dev/null || true | ||
| rm -rf /opt/amazon/efa | ||
| cd /tmp | ||
| curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz | tar xz |
There was a problem hiding this comment.
Unpinned dependency: EFA installer
aws-efa-installer-latest.tar.gz always pulls the latest version. The repo convention requires pinned versions for reproducible builds. I'd suggest pinning to the version that was tested, e.g.:
| curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz | tar xz | |
| curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-1.38.0.tar.gz | tar xz |
(Adjust version number to match what you tested with.) The same issue appears on line 272 for the minimal mode path.
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/3 — Deployment Pipeline + Training Code + Infrastructure
| namespace: antonai | ||
| labels: | ||
| app: grpo-nvrx-rayjob | ||
| spec: | ||
| entrypoint: bash /shared/nvrx-demo/scripts/rayjob_countdown_entrypoint.sh |
There was a problem hiding this comment.
Hardcoded namespace and personal paths in RayJob manifest
The manifest hardcodes namespace: antonai and the entrypoint references /shared/nvrx-demo/scripts/rayjob_countdown_entrypoint.sh — a path on your personal shared filesystem. Additionally, the entrypoint script included in this PR (rayjob_entrypoint.sh) is not what the manifest points to.
I'd suggest:
- Change the namespace to a placeholder like
<your-namespace>or a generic name likenemo-rl - Update the entrypoint to reference the included
rayjob_entrypoint.sh(e.g., bake it into the container image or mount it) - Add a comment noting which values users need to customize
The same /shared/nvrx-demo/ prefix appears in 10+ places across the entrypoint script and manifest env vars. Centralizing this as a documented variable (e.g., $SHARED_DIR) would make adaptation clearer.
| try: | ||
| from peft import PeftModel | ||
| except ImportError: | ||
| import subprocess | ||
| subprocess.check_call([sys.executable, "-m", "pip", "install", "peft", "-q"]) | ||
| from peft import PeftModel |
There was a problem hiding this comment.
Runtime pip install peft fallback
This try/except runs pip install peft -q at runtime if the package isn't installed. Runtime pip installs are non-reproducible (unpinned version, network dependency) and can fail in air-gapped environments. I'd suggest either:
- Adding
peftto the Dockerfile's Python dependencies, or - Documenting
peftas a prerequisite in the README and removing the runtime fallback
| - `pip-freeze.txt` — Python packages (235 packages) | ||
| - `dpkg.txt` — System packages (356 packages) | ||
|
|
||
| Key versions: PyTorch 2.9.0+cu129, CUDA 12.9, NCCL 2.27.5, NeMo RL 0.5.0rc0, Ray 2.49.2, vLLM (in worker venvs), NVRx 0.4.1 |
There was a problem hiding this comment.
NCCL version mismatch between README and SBOM
This line states "NCCL 2.27.5" but the SBOM (sbom/dpkg.txt) shows libnccl2,2.26.5-1+cuda12.9. Could you confirm which version was actually used in testing and update accordingly?
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/3 — Documentation Consistency
Things That Look Great
- Multi-stage Dockerfile with excellent layer caching: The strategy of copying only
pyproject.toml+uv.lockfirst to cache the expensive dependency install stage, then copying full source later, is a production-quality pattern that saves significant rebuild time. - Dual EFA install mode: Supporting both
fullandminimalEFA installation modes makes the image adaptable across instance types (P5en RDMA vs. socket-only). - NRL_GIT_REF pinned to commit hash: Using a specific commit (
c40dba37789c) rather than a branch name ensures build reproducibility for the NeMo RL source. - Comprehensive SBOM: Including
dpkg.txt,pip-freeze.txt, andsbom.txtgives full supply chain visibility — this is above and beyond what most test cases provide. - Excellent README: The verified instance types table, architecture diagram, instance-specific configuration snippets, and known issues table with upstream issue links make this one of the most thorough READMEs in the repo.
- NVRx fault tolerance documentation: The fault injection test procedure and checkpoint resume behavior are well-documented with concrete steps, making the fault tolerance claims verifiable.
- YAML anchors for DRY configuration: Using
&common-env,&common-mounts, and&common-volumeswith YAML aliases avoids config drift between head and worker specs.
| # ============================================================ | ||
| # RayJob: Qwen2.5-1.5B GRPO + NVRx on P5en (H200) |
There was a problem hiding this comment.
Model name mismatch between README, manifest, and entrypoint
There's a three-way inconsistency in which model is being trained:
- This comment (line 2): says "Qwen2.5-1.5B GRPO"
- GRPO_MODEL env var (line 115): sets
Qwen/Qwen2.5-0.5B-Instruct(0.5B, not 1.5B) - README title and results: describes training "Nemotron-Mini-4B-Instruct"
- Entrypoint script: references Nemotron and a phase 1 merge workflow
It looks like the manifest was adapted from a Qwen experiment but the README and entrypoint describe the Nemotron workflow. I'd suggest aligning these — a user following the README would encounter unexpected behavior from the manifest as-is.
| if [ ! -f "$MERGED/model.safetensors" ]; then | ||
| echo "Merging phase 1 LoRA checkpoint..." | ||
| pip install peft -q 2>&1 | tail -1 | ||
| python3 /shared/nvrx-demo/scripts/merge_lora_checkpoint.py \ |
There was a problem hiding this comment.
Entrypoint references non-existent files
This line references /shared/nvrx-demo/scripts/merge_lora_checkpoint.py, which isn't included in the PR. Other missing references:
/shared/nvrx-demo/goldilocks/train.jsonl(dataset, line 80)/shared/nvrx-demo/miracle-checkpoints(phase 1 checkpoints, line 34)
These are on your FSx volume — other users won't have them. I'd suggest either:
- Including
merge_lora_checkpoint.pyin the PR (or documenting how to create it) - Adding a "Data Preparation" section to the README explaining how to set up these paths
- Using a Hugging Face dataset directly, or documenting how to generate the JSONL
| os.environ.setdefault("HF_HOME", "/shared/nvrx-demo/hf_cache") | ||
|
|
||
| MODEL = "nvidia/Nemotron-Mini-4B-Instruct" | ||
| DATASET = "/shared/nvrx-demo/goldilocks/train.jsonl" |
There was a problem hiding this comment.
Eval script hardcodes local paths
The DATASET path points to /shared/nvrx-demo/goldilocks/train.jsonl on your shared filesystem. Users following the README won't have this file. I'd suggest making DATASET a command-line argument (like --model already is) with a helpful error message if the file doesn't exist.
| os.environ.setdefault("HF_HOME", "/shared/nvrx-demo/hf_cache") | |
| MODEL = "nvidia/Nemotron-Mini-4B-Instruct" | |
| DATASET = "/shared/nvrx-demo/goldilocks/train.jsonl" | |
| os.environ.setdefault("HF_HOME", "/shared/hf_cache") | |
| MODEL = "nvidia/Nemotron-Mini-4B-Instruct" | |
| DATASET = os.environ.get("EVAL_DATASET", "/shared/goldilocks/train.jsonl") |
KeitaW
left a comment
There was a problem hiding this comment.
Nice addition few comments.
Question: Why include the SBOM files in the repo?I noticed the The Dockerfile already has |
paragao
left a comment
There was a problem hiding this comment.
Hands-On Testing Report — P5en H200 Cluster (Slurm + Pyxis)
I tested this PR on a 16-node P5en.48xlarge cluster (8x NVIDIA H200 143GB per node, EFA, FSx Lustre) running Slurm with Pyxis/Enroot.
What works
- Container image pulls and runs correctly on H200 hardware via
srun --container-image='docker://public.ecr.aws#v9l4g5s4/nemo-rl-workshop:p5en' - Verified versions inside container: PyTorch 2.9.0+cu129, CUDA 12.9, Ray 2.49.2, NeMo RL 0.5.0rc0, NCCL 2.27.5 (pip/runtime)
- EFA stack is functional: libfabric 2.4.0amzn1.0, EFA installer 1.47.0,
libnccl-ofi1.18.0, GDRCopy present - 1-node GRPO training (8x H200) completes successfully — I ran 3 steps of the default NeMo RL GRPO example (Qwen2.5-1.5B + OpenMathInstruct-2) and all steps completed without errors. Step 1: 33s (cold), Steps 2-3: ~10.3s each.
What doesn't work
- PR entrypoint fails immediately — all 4 data dependencies are missing (
goldilocks/train.jsonl,merge_lora_checkpoint.py,miracle-checkpoints/,nemotron-phase1-merged/). These live on the author's FSx volume and are not included in the PR or documented for reproduction. /etc/hostswrite in entrypoint fails —echo "$MY_IP $(hostname)" >> /etc/hostsreturnsPermission deniedin containers that don't run as privileged/writable. The K8s YAML setsprivileged: truebut this assumption should be documented.
Container EFA/NCCL Audit
I inspected the container's NCCL and EFA configuration in detail and found several issues:
1. Old OFI-NCCL path still exists
Both paths are present in the container:
| Path | Status |
|---|---|
/opt/amazon/aws-ofi-nccl/ |
OLD — should not exist, leftover from base image |
/opt/amazon/ofi-nccl/ |
NEW — correct path from EFA installer 1.47.0 |
The old path contains libnccl-net-aws-ofi.so and libnccl-tuner-aws-ofi.so which could be loaded accidentally. The Dockerfile should clean up /opt/amazon/aws-ofi-nccl/ after the EFA install.
2. NCCL_NET_PLUGIN container default is wrong
The container bakes NCCL_NET_PLUGIN=aws-ofi (old naming), but the correct value for EFA installer 1.47.0 is ofi. The PR YAML correctly overrides this to ofi, but anyone running the container without the YAML override gets the wrong plugin.
3. NCCL_TUNER_PLUGIN not set
The YAML does not set NCCL_TUNER_PLUGIN. The EFA tuner plugin exists at /opt/amazon/ofi-nccl/lib/libnccl-tuner-ofi.so and should be explicitly enabled for optimal performance.
4. AWS_OFI_NCCL_VERSION env var is stale
The container has AWS_OFI_NCCL_VERSION=1.14.0 baked in (from the base image), but the actually installed version is 1.18.0 (confirmed via dpkg -s libnccl-ofi).
5. Dual NCCL versions
| Source | Version |
|---|---|
pip nvidia-nccl-cu12 (what PyTorch loads at runtime) |
2.27.5 |
dpkg libnccl2 (system package) |
2.26.5 |
NCCL_VERSION env var |
2.26.5 |
nccl.h header |
2.26.5 |
PyTorch uses the pip wheel version (2.27.5) at runtime across all three Ray venvs (main, DTensor, vLLM). The system package 2.26.5 is dead weight. The README correctly states 2.27.5 for the runtime version, but this should be clarified since the SBOM and env var show 2.26.5.
6. GDRCopy should be pinned to v2.5.1
The Dockerfile clones GDRCopy HEAD with no tag (git clone --depth 1). This should be pinned to --branch v2.5.1.
|
|
||
| # GDRCopy — GPU Direct RDMA copy library (needed for EFA GDRDMA) | ||
| RUN cd /tmp && \ | ||
| git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \ |
There was a problem hiding this comment.
GDRCopy: pin to v2.5.1
This clones HEAD of GDRCopy with no tag or commit pin. Should be pinned to the tested version:
| git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \ | |
| git clone --depth 1 --branch v2.5.1 https://github.com/NVIDIA/gdrcopy.git && \ |
| dpkg --purge --force-depends libfabric1-aws libfabric-aws-bin libfabric-aws-dev 2>/dev/null || true | ||
| rm -rf /opt/amazon/efa | ||
| cd /tmp | ||
| curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz | tar xz |
There was a problem hiding this comment.
EFA installer: pin version and clean up old path
Two issues here:
aws-efa-installer-latest.tar.gzis not reproducible. Pin to a specific version.- After the EFA install, the old path
/opt/amazon/aws-ofi-nccl/from the base image still exists alongside the new/opt/amazon/ofi-nccl/. Add a cleanup step:
rm -rf /opt/amazon/aws-ofi-ncclThis prevents the old libnccl-net-aws-ofi.so from being loaded accidentally.
| # EFA / NCCL defaults | ||
| ENV FI_PROVIDER=efa | ||
| ENV FI_EFA_USE_DEVICE_RDMA=1 | ||
| ENV NCCL_NET_PLUGIN=ofi |
There was a problem hiding this comment.
Container bakes wrong NCCL_NET_PLUGIN default
The container has NCCL_NET_PLUGIN=aws-ofi baked in (from the base image). With EFA installer 1.47.0, the correct value is ofi. The PR YAML correctly overrides this, but the container default should also be fixed.
Also, AWS_OFI_NCCL_VERSION=1.14.0 is stale — the actual installed version is 1.18.0.
| value: "PHB" | ||
| - name: NCCL_CUMEM_ENABLE | ||
| value: "0" | ||
| - name: NCCL_NVLS_ENABLE |
There was a problem hiding this comment.
Missing NCCL_TUNER_PLUGIN
The EFA tuner plugin is installed at /opt/amazon/ofi-nccl/lib/libnccl-tuner-ofi.so but NCCL_TUNER_PLUGIN is not set in the env vars. Add:
- name: NCCL_TUNER_PLUGIN
value: "ofi"| echo "Starting from step_60 merged model (has format knowledge)" | ||
| echo "==============================================" | ||
| MY_IP=$(hostname -I | awk '{print $1}') | ||
| echo "$MY_IP $(hostname)" >> /etc/hosts |
There was a problem hiding this comment.
Replace /etc/hosts write with Ray IP configuration
Tested on H200 cluster:
/etc/hosts: Permission denied
This line writes to /etc/hosts to ensure Ray resolves the container's hostname to its real network IP instead of loopback. However, this requires privileged: true on the container and fails in non-privileged environments.
Instead, use Ray's built-in IP configuration:
- Set
RAY_ADDRESSenvironment variable, or - Pass
--node-ip-address=$MY_IPtoray start
This avoids the need for privileged containers just for hostname resolution, and is more portable across container runtimes (K8s, Enroot/Pyxis, etc.). With hostNetwork: true already set in the YAML, the container shares the host's /etc/hosts anyway, making this write redundant in the K8s case.
| @@ -0,0 +1,89 @@ | |||
| #!/bin/bash | |||
| set -e | |||
There was a problem hiding this comment.
Use set -euo pipefail
Currently only set -e. Missing -u (unset variable detection) and -o pipefail (piped command failure detection). Recommend:
| set -e | |
| set -euo pipefail |
| - `pip-freeze.txt` — Python packages (235 packages) | ||
| - `dpkg.txt` — System packages (356 packages) | ||
|
|
||
| Key versions: PyTorch 2.9.0+cu129, CUDA 12.9, NCCL 2.27.5, NeMo RL 0.5.0rc0, Ray 2.49.2, vLLM (in worker venvs), NVRx 0.4.1 |
There was a problem hiding this comment.
NCCL version clarification
Adding context to @KeitaW's comment: I verified inside the container that there are two NCCL versions:
- pip
nvidia-nccl-cu12: 2.27.5 — this is what PyTorch loads at runtime (confirmed viatorch.cuda.nccl.version()on H200) - dpkg
libnccl2: 2.26.5 — system package, not used by PyTorch
The README's "NCCL 2.27.5" is correct for the runtime version. I'd suggest adding a note: "NCCL 2.27.5 (PyTorch wheel; system package is 2.26.5)" to avoid confusion with the SBOM.
| # MAX_JOBS Parallel compilation limit (prevent OOM) | ||
| # EFA_INSTALL_MODE "full" or "minimal" (default: full for P5en RDMA) | ||
|
|
||
| # syntax=docker/dockerfile:1.10 |
There was a problem hiding this comment.
syntax directive must be line 1
# syntax=docker/dockerfile:1.10 is on line 43 (after the header comments). For BuildKit to recognize it, it must be the very first line of the Dockerfile. Move it above all comments.
- Move test case from pytorch/nemo-rl to megatron/nemo-rl (NeMo RL is Megatron-based) - Remove sbom/ directory from repo (too large); add extraction instructions in README - Pin GDRCopy to v2.5.1 tag (was unpinned HEAD) - Pin EFA installer to v1.47.0 (was -latest) - Move `# syntax=docker/dockerfile:1.10` to line 1 of Dockerfile - Add NCCL_TUNER_PLUGIN=ofi in Dockerfile, YAML, and entrypoint - Fix model name: Qwen/Qwen2.5-0.5B-Instruct -> nvidia/Nemotron-Mini-4B-Instruct - Fix namespace: antonai -> default with <-- change comment - Fix entrypoint: set -euo pipefail, remove /etc/hosts write (Ray handles IP) - Remove missing merge_lora_checkpoint.py reference; entrypoint now falls back to HuggingFace model if no pre-merged model found on shared storage - Make all paths configurable via env vars (SHARED_DIR, CKPT_DIR, DATA_PATH, etc.) - Fix eval script: add --dataset CLI arg, use SHARED_DIR env var for defaults - Clarify NCCL version in README: 2.27.5 (pip/PyTorch) vs 2.26.5 (system dpkg) - Replace hardcoded image refs with <your-registry> placeholders
|
Thanks @KeitaW and @paragao for the thorough reviews! All feedback has been addressed in the latest push: Structural:
Dockerfile:
RayJob YAML:
Entrypoint:
Eval script:
README:
@paragao — thanks for testing on the P5en Slurm cluster! This PR targets EKS (RayJob), but the container image works on Slurm+Pyxis as you confirmed. Happy to add a Slurm section in a follow-up PR if that would be useful. |
KeitaW
left a comment
There was a problem hiding this comment.
Re-review 1/2 — Structure & Repository Hygiene
Great progress since the first review! Most findings addressed. A couple of structural items remain.
Missing license headers on all files
None of the 5 files include the repo-required copyright header:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
For the Dockerfile, place it after the # syntax= directive. For Python, after the shebang. For YAML, at the top. For the README, use an HTML comment.
| @@ -0,0 +1,213 @@ | |||
| # ============================================================ | |||
There was a problem hiding this comment.
RayJob manifest not in a kubernetes/ subdirectory
The repo convention places Kubernetes manifests in a kubernetes/ subdirectory within each test case. I'd suggest moving this to 3.test_cases/megatron/nemo-rl/grpo/kubernetes/rayjob-grpo.yaml to follow the pattern in other test cases like FSDP/kubernetes/.
KeitaW
left a comment
There was a problem hiding this comment.
Re-review 2/2 — Training Code Quality
Dataset preparation not documented
The entrypoint expects a Goldilocks math dataset at $DATA_PATH (default: ${SHARED_DIR}/goldilocks/train.jsonl), but the README doesn't explain how to obtain or generate this file. A brief "Data Preparation" section would help — even just noting the expected JSONL format ({"problem": "...", "answer": "..."}) and where to source the data.
Things That Look Great
- Excellent response to first review: Moved to
megatron/, pinned EFA (1.47.0) and GDRCopy (v2.5.1), fixed model name alignment, parameterized namespace with inline# <-- changemarkers, and explained the dual NCCL version situation clearly. SHARED_DIRenv var pattern: The entrypoint and eval script now derive all paths from a singleSHARED_DIRvariable with sensible defaults — easy to adapt to different FSx mounts.- Configurable training parameters:
GRPO_MODEL,GRPO_MAX_STEPS,GRPO_NUM_NODES,CKPT_DIR, andDATA_PATHare all overridable via environment variables. - Clear setup instructions in manifest comments: Numbered setup guide and
# <-- changemarkers on every user-configurable value. - SBOM moved to build-time: No stale inventory to maintain.
- NCCL version explanation: The README note about pip wheel 2.27.5 vs system 2.26.5 preempts common confusion.
- Proper
set -euo pipefailin the entrypoint script. NCCL_TUNER_PLUGIN=ofi: Good addition for EFA-aware NCCL algorithm selection.
| try: | ||
| from peft import PeftModel | ||
| except ImportError: | ||
| import subprocess | ||
| subprocess.check_call([sys.executable, "-m", "pip", "install", "peft", "-q"]) | ||
| from peft import PeftModel |
There was a problem hiding this comment.
Runtime pip install peft fallback still present
This try/except runs pip install peft -q at runtime if the package isn't installed. Runtime pip installs are non-reproducible (unpinned version, network dependency) and fail in air-gapped environments. I'd suggest either:
- Documenting
peftas a prerequisite in the README's evaluation section, or - Adding
peftto the Dockerfile
Summary
Architecture
backoffLimitfor fault toleranceArtifacts
Dockerfilerayjob-grpo.yamlrayjob_entrypoint.sheval_nemotron_goldilocks.pysbom/Related upstream work
Test plan
backoffLimit: 2