Skip to content

Add NeMo RL GRPO training on P5en with EFA RDMA#1025

Open
dmvevents wants to merge 3 commits intoawslabs:mainfrom
dmvevents:feature/nemo-rl-grpo-p5en
Open

Add NeMo RL GRPO training on P5en with EFA RDMA#1025
dmvevents wants to merge 3 commits intoawslabs:mainfrom
dmvevents:feature/nemo-rl-grpo-p5en

Conversation

@dmvevents
Copy link
Copy Markdown

Summary

  • Adds a new test case for multi-node GRPO (Group Relative Policy Optimization) training using NVIDIA NeMo RL on Amazon EKS with P5en.48xlarge instances (H200 GPUs + EFA networking)
  • Trains Nemotron-Mini-4B-Instruct on synthetic math problems using attention-only LoRA, achieving +7 percentage point accuracy improvement (28% to 35%) in 50 training steps
  • Includes production-ready Dockerfile, Kubernetes RayJob manifest, training entrypoint, evaluation script, and full SBOM

Architecture

  • Hardware: 2x P5en.48xlarge (16x H200 141GB, 32x EFA 200Gbps)
  • Orchestration: Ray 2.49.2 RayJob with backoffLimit for fault tolerance
  • Training: NeMo RL GRPO with DTensor policy optimization + vLLM generation rollouts
  • Networking: NCCL 2.27.5 + aws-ofi-nccl 1.18.0 + Libfabric 2.4 over EFA RDMA
  • Storage: FSx for Lustre (model weights, checkpoints, dataset)

Artifacts

File Purpose
Dockerfile Multi-stage build: NeMo RL + EFA stack (libfabric 2.4, aws-ofi-nccl 1.18, GDRCopy)
rayjob-grpo.yaml K8s RayJob manifest for 2-node P5en training with fault recovery
rayjob_entrypoint.sh Training entrypoint with NCCL H200/NVSwitch workarounds
eval_nemotron_goldilocks.py Before/after accuracy evaluation on 200 held-out problems
sbom/ Software Bill of Materials (pip packages, dpkg, platform summary)

Related upstream work

Test plan

  • Verified training completes 50 GRPO steps on 2x P5en.48xlarge (CGK ap-southeast-3)
  • Verified evaluation script reports +7pp accuracy improvement (28% to 35%)
  • Verified fault injection and checkpoint recovery via backoffLimit: 2
  • Verified Dockerfile builds and produces working image
  • Verified SBOM matches running container inventory
  • Reviewer: confirm README follows repo conventions
  • Reviewer: confirm namespace/PVC placeholders are clear for adaptation

Multi-node GRPO reinforcement learning training using NVIDIA NeMo RL
on Amazon EKS with 2x P5en.48xlarge (16x H200, 32x EFA). Trains
Nemotron-Mini-4B-Instruct on synthetic math with attention-only LoRA,
achieving +7pp accuracy improvement (28% -> 35%) in 50 steps.

Includes:
- Dockerfile with EFA networking stack (libfabric 2.4, aws-ofi-nccl 1.18)
- RayJob K8s manifest for 2-node training with fault tolerance
- Training entrypoint with NCCL H200/NVSwitch workarounds
- Evaluation script for before/after accuracy comparison
- Full SBOM (Software Bill of Materials)

Stack: NeMo RL + Ray 2.49 + vLLM + DTensor + NCCL 2.27.5 + EFA
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/3 — Structure & Repository Hygiene

Thanks for this contribution, Anton! The Dockerfile and README are excellent quality. I have several structural and documentation findings below.

Test case should live under 3.test_cases/megatron/, not pytorch/

NeMo RL is built on top of NVIDIA's Megatron-based stack — the training loop uses Megatron-Core for model parallelism and DTensor for distributed policy optimization. The pytorch/ top-level directory is reserved for frameworks that use PyTorch directly (FSDP, DDP, TorchTitan, etc.), while Megatron-derived frameworks belong under 3.test_cases/megatron/. Other Megatron-ecosystem test cases already live there.

I'd suggest moving the entire test case to 3.test_cases/megatron/nemo-rl/grpo/.

Missing license header on all files

The repo convention requires a copyright header on all contributed files:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

I'd suggest adding this to the top of each file (after shebang lines where applicable).

Kubernetes manifest not in a platform-specific subdirectory

The repo convention places Kubernetes manifests in a kubernetes/ subdirectory within each test case. Currently rayjob-grpo.yaml is at the root level. I'd suggest moving it to kubernetes/rayjob-grpo.yaml to follow the pattern in other test cases like FSDP/kubernetes/.


# GDRCopy — GPU Direct RDMA copy library (needed for EFA GDRDMA)
RUN cd /tmp && \
git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpinned dependency: GDRCopy

This clones HEAD of GDRCopy with no tag or commit pin. Upstream changes could silently break the build. I'd suggest pinning to the version that was tested:

Suggested change
git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \
git clone --depth 1 --branch v2.4.4 https://github.com/NVIDIA/gdrcopy.git && \

Reference: repo convention on fixed version tags

dpkg --purge --force-depends libfabric1-aws libfabric-aws-bin libfabric-aws-dev 2>/dev/null || true
rm -rf /opt/amazon/efa
cd /tmp
curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz | tar xz
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unpinned dependency: EFA installer

aws-efa-installer-latest.tar.gz always pulls the latest version. The repo convention requires pinned versions for reproducible builds. I'd suggest pinning to the version that was tested, e.g.:

Suggested change
curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz | tar xz
curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-1.38.0.tar.gz | tar xz

(Adjust version number to match what you tested with.) The same issue appears on line 272 for the minimal mode path.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/3 — Deployment Pipeline + Training Code + Infrastructure

Comment on lines +35 to +39
namespace: antonai
labels:
app: grpo-nvrx-rayjob
spec:
entrypoint: bash /shared/nvrx-demo/scripts/rayjob_countdown_entrypoint.sh
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded namespace and personal paths in RayJob manifest

The manifest hardcodes namespace: antonai and the entrypoint references /shared/nvrx-demo/scripts/rayjob_countdown_entrypoint.sh — a path on your personal shared filesystem. Additionally, the entrypoint script included in this PR (rayjob_entrypoint.sh) is not what the manifest points to.

I'd suggest:

  1. Change the namespace to a placeholder like <your-namespace> or a generic name like nemo-rl
  2. Update the entrypoint to reference the included rayjob_entrypoint.sh (e.g., bake it into the container image or mount it)
  3. Add a comment noting which values users need to customize

The same /shared/nvrx-demo/ prefix appears in 10+ places across the entrypoint script and manifest env vars. Centralizing this as a documented variable (e.g., $SHARED_DIR) would make adaptation clearer.

Comment on lines +224 to +229
try:
from peft import PeftModel
except ImportError:
import subprocess
subprocess.check_call([sys.executable, "-m", "pip", "install", "peft", "-q"])
from peft import PeftModel
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime pip install peft fallback

This try/except runs pip install peft -q at runtime if the package isn't installed. Runtime pip installs are non-reproducible (unpinned version, network dependency) and can fail in air-gapped environments. I'd suggest either:

  • Adding peft to the Dockerfile's Python dependencies, or
  • Documenting peft as a prerequisite in the README and removing the runtime fallback

- `pip-freeze.txt` — Python packages (235 packages)
- `dpkg.txt` — System packages (356 packages)

Key versions: PyTorch 2.9.0+cu129, CUDA 12.9, NCCL 2.27.5, NeMo RL 0.5.0rc0, Ray 2.49.2, vLLM (in worker venvs), NVRx 0.4.1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL version mismatch between README and SBOM

This line states "NCCL 2.27.5" but the SBOM (sbom/dpkg.txt) shows libnccl2,2.26.5-1+cuda12.9. Could you confirm which version was actually used in testing and update accordingly?

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/3 — Documentation Consistency


Things That Look Great

  • Multi-stage Dockerfile with excellent layer caching: The strategy of copying only pyproject.toml + uv.lock first to cache the expensive dependency install stage, then copying full source later, is a production-quality pattern that saves significant rebuild time.
  • Dual EFA install mode: Supporting both full and minimal EFA installation modes makes the image adaptable across instance types (P5en RDMA vs. socket-only).
  • NRL_GIT_REF pinned to commit hash: Using a specific commit (c40dba37789c) rather than a branch name ensures build reproducibility for the NeMo RL source.
  • Comprehensive SBOM: Including dpkg.txt, pip-freeze.txt, and sbom.txt gives full supply chain visibility — this is above and beyond what most test cases provide.
  • Excellent README: The verified instance types table, architecture diagram, instance-specific configuration snippets, and known issues table with upstream issue links make this one of the most thorough READMEs in the repo.
  • NVRx fault tolerance documentation: The fault injection test procedure and checkpoint resume behavior are well-documented with concrete steps, making the fault tolerance claims verifiable.
  • YAML anchors for DRY configuration: Using &common-env, &common-mounts, and &common-volumes with YAML aliases avoids config drift between head and worker specs.

Comment on lines +1 to +2
# ============================================================
# RayJob: Qwen2.5-1.5B GRPO + NVRx on P5en (H200)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model name mismatch between README, manifest, and entrypoint

There's a three-way inconsistency in which model is being trained:

  1. This comment (line 2): says "Qwen2.5-1.5B GRPO"
  2. GRPO_MODEL env var (line 115): sets Qwen/Qwen2.5-0.5B-Instruct (0.5B, not 1.5B)
  3. README title and results: describes training "Nemotron-Mini-4B-Instruct"
  4. Entrypoint script: references Nemotron and a phase 1 merge workflow

It looks like the manifest was adapted from a Qwen experiment but the README and entrypoint describe the Nemotron workflow. I'd suggest aligning these — a user following the README would encounter unexpected behavior from the manifest as-is.

if [ ! -f "$MERGED/model.safetensors" ]; then
echo "Merging phase 1 LoRA checkpoint..."
pip install peft -q 2>&1 | tail -1
python3 /shared/nvrx-demo/scripts/merge_lora_checkpoint.py \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entrypoint references non-existent files

This line references /shared/nvrx-demo/scripts/merge_lora_checkpoint.py, which isn't included in the PR. Other missing references:

  • /shared/nvrx-demo/goldilocks/train.jsonl (dataset, line 80)
  • /shared/nvrx-demo/miracle-checkpoints (phase 1 checkpoints, line 34)

These are on your FSx volume — other users won't have them. I'd suggest either:

  1. Including merge_lora_checkpoint.py in the PR (or documenting how to create it)
  2. Adding a "Data Preparation" section to the README explaining how to set up these paths
  3. Using a Hugging Face dataset directly, or documenting how to generate the JSONL

Comment on lines +20 to +23
os.environ.setdefault("HF_HOME", "/shared/nvrx-demo/hf_cache")

MODEL = "nvidia/Nemotron-Mini-4B-Instruct"
DATASET = "/shared/nvrx-demo/goldilocks/train.jsonl"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eval script hardcodes local paths

The DATASET path points to /shared/nvrx-demo/goldilocks/train.jsonl on your shared filesystem. Users following the README won't have this file. I'd suggest making DATASET a command-line argument (like --model already is) with a helpful error message if the file doesn't exist.

Suggested change
os.environ.setdefault("HF_HOME", "/shared/nvrx-demo/hf_cache")
MODEL = "nvidia/Nemotron-Mini-4B-Instruct"
DATASET = "/shared/nvrx-demo/goldilocks/train.jsonl"
os.environ.setdefault("HF_HOME", "/shared/hf_cache")
MODEL = "nvidia/Nemotron-Mini-4B-Instruct"
DATASET = os.environ.get("EVAL_DATASET", "/shared/goldilocks/train.jsonl")

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition few comments.

@KeitaW
Copy link
Copy Markdown
Collaborator

KeitaW commented Mar 18, 2026

Question: Why include the SBOM files in the repo?

I noticed the sbom/ directory adds ~1,200 lines of package inventory (dpkg.txt, pip-freeze.txt, sbom.txt). These are a snapshot of one specific container build and will drift from the actual image contents as the Dockerfile evolves — the SBOM would need to be regenerated and recommitted on every dependency change to stay accurate.

The Dockerfile already has generate_sbom.sh baked in, so the SBOM is generated at build time inside the image itself. Could you help me understand the motivation for also checking it into the repo? If the goal is supply chain visibility, the in-image SBOM (already produced by the build) may be sufficient without committing a static copy that can go stale.

Copy link
Copy Markdown
Contributor

@paragao paragao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hands-On Testing Report — P5en H200 Cluster (Slurm + Pyxis)

I tested this PR on a 16-node P5en.48xlarge cluster (8x NVIDIA H200 143GB per node, EFA, FSx Lustre) running Slurm with Pyxis/Enroot.

What works

  • Container image pulls and runs correctly on H200 hardware via srun --container-image='docker://public.ecr.aws#v9l4g5s4/nemo-rl-workshop:p5en'
  • Verified versions inside container: PyTorch 2.9.0+cu129, CUDA 12.9, Ray 2.49.2, NeMo RL 0.5.0rc0, NCCL 2.27.5 (pip/runtime)
  • EFA stack is functional: libfabric 2.4.0amzn1.0, EFA installer 1.47.0, libnccl-ofi 1.18.0, GDRCopy present
  • 1-node GRPO training (8x H200) completes successfully — I ran 3 steps of the default NeMo RL GRPO example (Qwen2.5-1.5B + OpenMathInstruct-2) and all steps completed without errors. Step 1: 33s (cold), Steps 2-3: ~10.3s each.

What doesn't work

  • PR entrypoint fails immediately — all 4 data dependencies are missing (goldilocks/train.jsonl, merge_lora_checkpoint.py, miracle-checkpoints/, nemotron-phase1-merged/). These live on the author's FSx volume and are not included in the PR or documented for reproduction.
  • /etc/hosts write in entrypoint failsecho "$MY_IP $(hostname)" >> /etc/hosts returns Permission denied in containers that don't run as privileged/writable. The K8s YAML sets privileged: true but this assumption should be documented.

Container EFA/NCCL Audit

I inspected the container's NCCL and EFA configuration in detail and found several issues:

1. Old OFI-NCCL path still exists

Both paths are present in the container:

Path Status
/opt/amazon/aws-ofi-nccl/ OLD — should not exist, leftover from base image
/opt/amazon/ofi-nccl/ NEW — correct path from EFA installer 1.47.0

The old path contains libnccl-net-aws-ofi.so and libnccl-tuner-aws-ofi.so which could be loaded accidentally. The Dockerfile should clean up /opt/amazon/aws-ofi-nccl/ after the EFA install.

2. NCCL_NET_PLUGIN container default is wrong

The container bakes NCCL_NET_PLUGIN=aws-ofi (old naming), but the correct value for EFA installer 1.47.0 is ofi. The PR YAML correctly overrides this to ofi, but anyone running the container without the YAML override gets the wrong plugin.

3. NCCL_TUNER_PLUGIN not set

The YAML does not set NCCL_TUNER_PLUGIN. The EFA tuner plugin exists at /opt/amazon/ofi-nccl/lib/libnccl-tuner-ofi.so and should be explicitly enabled for optimal performance.

4. AWS_OFI_NCCL_VERSION env var is stale

The container has AWS_OFI_NCCL_VERSION=1.14.0 baked in (from the base image), but the actually installed version is 1.18.0 (confirmed via dpkg -s libnccl-ofi).

5. Dual NCCL versions

Source Version
pip nvidia-nccl-cu12 (what PyTorch loads at runtime) 2.27.5
dpkg libnccl2 (system package) 2.26.5
NCCL_VERSION env var 2.26.5
nccl.h header 2.26.5

PyTorch uses the pip wheel version (2.27.5) at runtime across all three Ray venvs (main, DTensor, vLLM). The system package 2.26.5 is dead weight. The README correctly states 2.27.5 for the runtime version, but this should be clarified since the SBOM and env var show 2.26.5.

6. GDRCopy should be pinned to v2.5.1

The Dockerfile clones GDRCopy HEAD with no tag (git clone --depth 1). This should be pinned to --branch v2.5.1.


# GDRCopy — GPU Direct RDMA copy library (needed for EFA GDRDMA)
RUN cd /tmp && \
git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDRCopy: pin to v2.5.1

This clones HEAD of GDRCopy with no tag or commit pin. Should be pinned to the tested version:

Suggested change
git clone --depth 1 https://github.com/NVIDIA/gdrcopy.git && \
git clone --depth 1 --branch v2.5.1 https://github.com/NVIDIA/gdrcopy.git && \

dpkg --purge --force-depends libfabric1-aws libfabric-aws-bin libfabric-aws-dev 2>/dev/null || true
rm -rf /opt/amazon/efa
cd /tmp
curl -sL https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz | tar xz
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EFA installer: pin version and clean up old path

Two issues here:

  1. aws-efa-installer-latest.tar.gz is not reproducible. Pin to a specific version.
  2. After the EFA install, the old path /opt/amazon/aws-ofi-nccl/ from the base image still exists alongside the new /opt/amazon/ofi-nccl/. Add a cleanup step:
rm -rf /opt/amazon/aws-ofi-nccl

This prevents the old libnccl-net-aws-ofi.so from being loaded accidentally.

# EFA / NCCL defaults
ENV FI_PROVIDER=efa
ENV FI_EFA_USE_DEVICE_RDMA=1
ENV NCCL_NET_PLUGIN=ofi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container bakes wrong NCCL_NET_PLUGIN default

The container has NCCL_NET_PLUGIN=aws-ofi baked in (from the base image). With EFA installer 1.47.0, the correct value is ofi. The PR YAML correctly overrides this, but the container default should also be fixed.

Also, AWS_OFI_NCCL_VERSION=1.14.0 is stale — the actual installed version is 1.18.0.

value: "PHB"
- name: NCCL_CUMEM_ENABLE
value: "0"
- name: NCCL_NVLS_ENABLE
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing NCCL_TUNER_PLUGIN

The EFA tuner plugin is installed at /opt/amazon/ofi-nccl/lib/libnccl-tuner-ofi.so but NCCL_TUNER_PLUGIN is not set in the env vars. Add:

- name: NCCL_TUNER_PLUGIN
  value: "ofi"

echo "Starting from step_60 merged model (has format knowledge)"
echo "=============================================="
MY_IP=$(hostname -I | awk '{print $1}')
echo "$MY_IP $(hostname)" >> /etc/hosts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace /etc/hosts write with Ray IP configuration

Tested on H200 cluster:

/etc/hosts: Permission denied

This line writes to /etc/hosts to ensure Ray resolves the container's hostname to its real network IP instead of loopback. However, this requires privileged: true on the container and fails in non-privileged environments.

Instead, use Ray's built-in IP configuration:

  • Set RAY_ADDRESS environment variable, or
  • Pass --node-ip-address=$MY_IP to ray start

This avoids the need for privileged containers just for hostname resolution, and is more portable across container runtimes (K8s, Enroot/Pyxis, etc.). With hostNetwork: true already set in the YAML, the container shares the host's /etc/hosts anyway, making this write redundant in the K8s case.

@@ -0,0 +1,89 @@
#!/bin/bash
set -e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use set -euo pipefail

Currently only set -e. Missing -u (unset variable detection) and -o pipefail (piped command failure detection). Recommend:

Suggested change
set -e
set -euo pipefail

- `pip-freeze.txt` — Python packages (235 packages)
- `dpkg.txt` — System packages (356 packages)

Key versions: PyTorch 2.9.0+cu129, CUDA 12.9, NCCL 2.27.5, NeMo RL 0.5.0rc0, Ray 2.49.2, vLLM (in worker venvs), NVRx 0.4.1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL version clarification

Adding context to @KeitaW's comment: I verified inside the container that there are two NCCL versions:

  • pip nvidia-nccl-cu12: 2.27.5 — this is what PyTorch loads at runtime (confirmed via torch.cuda.nccl.version() on H200)
  • dpkg libnccl2: 2.26.5 — system package, not used by PyTorch

The README's "NCCL 2.27.5" is correct for the runtime version. I'd suggest adding a note: "NCCL 2.27.5 (PyTorch wheel; system package is 2.26.5)" to avoid confusion with the SBOM.

# MAX_JOBS Parallel compilation limit (prevent OOM)
# EFA_INSTALL_MODE "full" or "minimal" (default: full for P5en RDMA)

# syntax=docker/dockerfile:1.10
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax directive must be line 1

# syntax=docker/dockerfile:1.10 is on line 43 (after the header comments). For BuildKit to recognize it, it must be the very first line of the Dockerfile. Move it above all comments.

- Move test case from pytorch/nemo-rl to megatron/nemo-rl (NeMo RL is Megatron-based)
- Remove sbom/ directory from repo (too large); add extraction instructions in README
- Pin GDRCopy to v2.5.1 tag (was unpinned HEAD)
- Pin EFA installer to v1.47.0 (was -latest)
- Move `# syntax=docker/dockerfile:1.10` to line 1 of Dockerfile
- Add NCCL_TUNER_PLUGIN=ofi in Dockerfile, YAML, and entrypoint
- Fix model name: Qwen/Qwen2.5-0.5B-Instruct -> nvidia/Nemotron-Mini-4B-Instruct
- Fix namespace: antonai -> default with <-- change comment
- Fix entrypoint: set -euo pipefail, remove /etc/hosts write (Ray handles IP)
- Remove missing merge_lora_checkpoint.py reference; entrypoint now falls back
  to HuggingFace model if no pre-merged model found on shared storage
- Make all paths configurable via env vars (SHARED_DIR, CKPT_DIR, DATA_PATH, etc.)
- Fix eval script: add --dataset CLI arg, use SHARED_DIR env var for defaults
- Clarify NCCL version in README: 2.27.5 (pip/PyTorch) vs 2.26.5 (system dpkg)
- Replace hardcoded image refs with <your-registry> placeholders
@dmvevents
Copy link
Copy Markdown
Author

Thanks @KeitaW and @paragao for the thorough reviews! All feedback has been addressed in the latest push:

Structural:

  • ✅ Moved to 3.test_cases/megatron/nemo-rl/grpo/ (NeMo RL uses Megatron Core)
  • ✅ Removed sbom/ directory — added extraction instructions in README instead

Dockerfile:

  • # syntax=docker/dockerfile:1.10 moved to line 1
  • ✅ GDRCopy pinned to v2.5.1
  • ✅ EFA installer pinned to 1.47.0
  • ✅ Added NCCL_TUNER_PLUGIN=ofi

RayJob YAML:

  • ✅ Namespace changed to default with comment to customize
  • ✅ Model name consistently nvidia/Nemotron-Mini-4B-Instruct
  • ✅ Removed hardcoded personal paths
  • ✅ Added NCCL_TUNER_PLUGIN env var

Entrypoint:

  • set -euo pipefail with ${VAR:-default} pattern throughout
  • ✅ Removed /etc/hosts write (Ray GCS handles IP discovery)
  • ✅ Removed dependency on merge_lora_checkpoint.py — falls back to HuggingFace model

Eval script:

  • ✅ Paths configurable via --dataset, --checkpoint-dir CLI args and SHARED_DIR env var

README:

  • ✅ NCCL version clarified: pip 2.27.5 (PyTorch runtime) vs system 2.26.5 (dpkg base image)
  • ✅ Model name consistent
  • ✅ SBOM extraction instructions added

@paragao — thanks for testing on the P5en Slurm cluster! This PR targets EKS (RayJob), but the container image works on Slurm+Pyxis as you confirmed. Happy to add a Slurm section in a follow-up PR if that would be useful.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review 1/2 — Structure & Repository Hygiene

Great progress since the first review! Most findings addressed. A couple of structural items remain.

Missing license headers on all files

None of the 5 files include the repo-required copyright header:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

For the Dockerfile, place it after the # syntax= directive. For Python, after the shebang. For YAML, at the top. For the README, use an HTML comment.

@@ -0,0 +1,213 @@
# ============================================================
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RayJob manifest not in a kubernetes/ subdirectory

The repo convention places Kubernetes manifests in a kubernetes/ subdirectory within each test case. I'd suggest moving this to 3.test_cases/megatron/nemo-rl/grpo/kubernetes/rayjob-grpo.yaml to follow the pattern in other test cases like FSDP/kubernetes/.

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review 2/2 — Training Code Quality

Dataset preparation not documented

The entrypoint expects a Goldilocks math dataset at $DATA_PATH (default: ${SHARED_DIR}/goldilocks/train.jsonl), but the README doesn't explain how to obtain or generate this file. A brief "Data Preparation" section would help — even just noting the expected JSONL format ({"problem": "...", "answer": "..."}) and where to source the data.


Things That Look Great

  • Excellent response to first review: Moved to megatron/, pinned EFA (1.47.0) and GDRCopy (v2.5.1), fixed model name alignment, parameterized namespace with inline # <-- change markers, and explained the dual NCCL version situation clearly.
  • SHARED_DIR env var pattern: The entrypoint and eval script now derive all paths from a single SHARED_DIR variable with sensible defaults — easy to adapt to different FSx mounts.
  • Configurable training parameters: GRPO_MODEL, GRPO_MAX_STEPS, GRPO_NUM_NODES, CKPT_DIR, and DATA_PATH are all overridable via environment variables.
  • Clear setup instructions in manifest comments: Numbered setup guide and # <-- change markers on every user-configurable value.
  • SBOM moved to build-time: No stale inventory to maintain.
  • NCCL version explanation: The README note about pip wheel 2.27.5 vs system 2.26.5 preempts common confusion.
  • Proper set -euo pipefail in the entrypoint script.
  • NCCL_TUNER_PLUGIN=ofi: Good addition for EFA-aware NCCL algorithm selection.

Comment on lines +244 to +249
try:
from peft import PeftModel
except ImportError:
import subprocess
subprocess.check_call([sys.executable, "-m", "pip", "install", "peft", "-q"])
from peft import PeftModel
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtime pip install peft fallback still present

This try/except runs pip install peft -q at runtime if the package isn't installed. Runtime pip installs are non-reproducible (unpinned version, network dependency) and fail in air-gapped environments. I'd suggest either:

  • Documenting peft as a prerequisite in the README's evaluation section, or
  • Adding peft to the Dockerfile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants