feat: add OpenRLHF GRPO training recipe for gpt-oss-20b on HyperPod EKS (g5.12xlarge)#1053
feat: add OpenRLHF GRPO training recipe for gpt-oss-20b on HyperPod EKS (g5.12xlarge)#1053
Conversation
Add complete OpenRLHF v0.9.0 recipe for GRPO training of openai/gpt-oss-20b (20B MoE) on 6x g5.12xlarge with Non-Hybrid Engine architecture. Architecture: 5 GPU workers (160Gi, 4xA10G, 1 EFA each) + 1 Ray head (8Gi, num-gpus=0). vLLM inference on 1 dedicated worker (TP=4), DeepSpeed ZeRO-3 training on 4 workers (16 GPUs, adam_offload ~80GB/node). Includes: Dockerfile (NGC 25.02 + EFA + numpy/cv2 fixes), KubeRay manifest, training script, custom reward function (language compliance), evaluation scripts, data loader, and CodeBuild spec. Training validated: 60+ steps completed, rewards 4.88-5.97, ~2.3 min/step, HF checkpoints saved at steps 20 and 40 (39GB each).
506185b to
34b476f
Compare
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 1/4 — Structure & Repository Hygiene
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 2/4 — Deployment Pipeline (K8s)
envsubst without explicit variable whitelist
File: README.md (line 145)
The README instructs envsubst < setup/raycluster.yaml | kubectl apply -f - without a variable whitelist. This replaces all $VAR patterns in the YAML, which could unintentionally substitute env vars that are meant to be literal values in the pod spec. I'd suggest using an explicit whitelist:
envsubst '$REGISTRY $IMAGE $TAG $INSTANCE_TYPE $FI_EFA_USE_DEVICE_RDMA $HF_TOKEN $HEAD_CPU $HEAD_MEMORY $WORKER_CPU $WORKER_MEMORY $NUM_NODES $NUM_GPU_PER_NODE $NUM_EFA_PER_NODE' \
< setup/raycluster.yaml | kubectl apply -f -| - name: HF_TOKEN | ||
| value: ${HF_TOKEN} |
There was a problem hiding this comment.
HF_TOKEN embedded directly in pod spec
The HuggingFace token is injected via envsubst as a plain-text env var, making it visible in the RayCluster resource (kubectl get raycluster -o yaml). I'd suggest using a Kubernetes Secret instead:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-secret
key: tokenWith a setup step: kubectl create secret generic hf-secret --from-literal=token=$HF_TOKEN
| @@ -0,0 +1,171 @@ | |||
| # Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
License header mismatch
This file uses Apache-2.0 while all other files in the PR (and the repo convention) use MIT-0.
| # SPDX-License-Identifier: Apache-2.0 | |
| # SPDX-License-Identifier: MIT-0 |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 3/4 — Infrastructure & NCCL Configuration
| - name: NCCL_PROTO | ||
| value: "simple" |
There was a problem hiding this comment.
NCCL_PROTO=simple hardcoded despite env_vars configurability
The env_vars.example correctly parameterizes NCCL_PROTO (empty for p5en with GPUDirect RDMA, simple for g5 without). But here and on line 128 it's hardcoded to "simple", so switching to p5en requires manually editing the YAML in addition to changing env_vars. I'd suggest making it consistent with the other substituted variables:
| - name: NCCL_PROTO | |
| value: "simple" | |
| - name: NCCL_PROTO | |
| value: "${NCCL_PROTO}" |
KeitaW
left a comment
There was a problem hiding this comment.
Review Batch 4/4 — Documentation Consistency
Things That Look Great
- Exceptional documentation: The README is one of the best I've seen in this repo — complete with ASCII architecture diagrams, detailed memory budget tables, a comparison with veRL, and a troubleshooting section that reads like production runbook notes.
- Memory optimization analysis: The "Why Non-Hybrid?" explanation with concrete numbers (adam_offload ~80GB + vLLM sleep ~40GB > 160Gi) is exactly the kind of practical reasoning that helps users make informed decisions.
- Key Discoveries section: Documenting the
num-gpus: "0"head node gotcha,RAY_memory_monitor_refresh_ms=0placement requirement, and NumPy/cv2 fix saves future users significant debugging time. - Reward function API documentation: The note about
extra_logsrequiring 1-d tensors (not 0-d scalars) fortorch.cat()compatibility is a subtle catch that would be hard to debug. - Dual instance type support: The
env_vars.examplewith both g5.12xlarge and p5en.48xlarge configurations, with clear comments on the differences (GPUDirect RDMA, NCCL_PROTO), makes it easy to adapt. - Clean Dockerfile pattern: Follows the repo convention — EFA installer with
--skip-kmod --no-verify, HPC-X cleanup, properNCCL_SOCKET_IFNAMEexclusion pattern (^docker,lo,veth). - Self-contained with shared reward function: The pattern of
kubectl cpthe reward function to FSx (accessible by all Ray workers) is practical and well-documented. - Copyright and license headers: Present on every file.
| gpu_memory_utilization=gpu_mem, | ||
| max_model_len=max_model_len, | ||
| enforce_eager=True, | ||
| trust_remote_code=True, |
There was a problem hiding this comment.
trust_remote_code=True used without comment
The review checklist flags unconditional trust_remote_code=True since it executes arbitrary code from the HuggingFace Hub. For openai/gpt-oss-20b this is likely required (MoE models often need custom code), but I'd suggest adding a brief comment explaining why:
| trust_remote_code=True, | |
| trust_remote_code=True, # Required for gpt-oss-20b MoE architecture |
Summary
Architecture
Why Non-Hybrid? On 24GB GPUs with
adam_offload, the vLLM sleep/wake CPU backup mechanism adds ~40GB/node, causing OOM when colocated with DeepSpeed. Separate nodes eliminate this entirely.Memory budget per training node:
adam_offload: 320GB / 16 GPUs x 4 GPUs/node = 80GB per nodeTraining Results
Validated on HyperPod EKS cluster (
trl-gptoss-eks), 60+ steps:Checkpoints saved at steps 20 and 40 in HuggingFace format directly (
--save_hf_ckpt).Key Differences from veRL (Companion PR #1054)
--adam_offload(optimizer to CPU)offload_policy=True(params + optimizer)Files
Dockerfilebuildspec.ymlREADME.mdrecipe/run_gptoss_grpo.shrecipe/language_reward.pyrecipe/evaluate_gptoss.pyrecipe/evaluate_gptoss.shsetup/env_vars.examplesetup/load_data_gptoss.shsetup/raycluster.yamlDocker Image Notes
Key fixes baked into the Dockerfile:
Testing
NCCL_PROTO=simplefor g5)Related