Skip to content

feat: add OpenRLHF GRPO training recipe for gpt-oss-20b on HyperPod EKS (g5.12xlarge)#1053

Merged
KeitaW merged 3 commits intomainfrom
feature/openrlhf-grpo-gptoss-g5
Apr 8, 2026
Merged

feat: add OpenRLHF GRPO training recipe for gpt-oss-20b on HyperPod EKS (g5.12xlarge)#1053
KeitaW merged 3 commits intomainfrom
feature/openrlhf-grpo-gptoss-g5

Conversation

@nkumaraws
Copy link
Copy Markdown
Contributor

@nkumaraws nkumaraws commented Apr 4, 2026

Summary

  • Add complete OpenRLHF v0.9.0 GRPO recipe for training openai/gpt-oss-20b (20B MoE) on g5.12xlarge (4x A10G 24GB GPUs) with HyperPod EKS
  • Non-Hybrid Engine architecture: vLLM inference on 1 dedicated worker (TP=4), DeepSpeed ZeRO-3 training on 4 separate workers (16 GPUs)
  • Same multilingual language compliance task and reward function as the veRL companion PR

Architecture

6x g5.12xlarge (4x A10G 24GB each)

  Ray Head (8Gi, 0 GPU)     5 GPU Workers (160Gi, 4 GPU, 1 EFA each)
  ┌──────────────────┐     ┌────────────┐  ┌──────────────────────┐
  │  Ray driver       │     │  Worker 1  │  │  Workers 2-5         │
  │  (orchestration)  │     │  vLLM TP=4 │  │  DeepSpeed ZeRO-3    │
  │                   │     │ (inference) │  │  (training, 16 GPUs) │
  └──────────────────┘     └────────────┘  └──────────────────────┘

Why Non-Hybrid? On 24GB GPUs with adam_offload, the vLLM sleep/wake CPU backup mechanism adds ~40GB/node, causing OOM when colocated with DeepSpeed. Separate nodes eliminate this entirely.

Memory budget per training node:

  • adam_offload: 320GB / 16 GPUs x 4 GPUs/node = 80GB per node
  • Pod limit: 160Gi → ~80GB headroom for checkpoints and overhead

Training Results

Validated on HyperPod EKS cluster (trl-gptoss-eks), 60+ steps:

Metric Value
Steps completed 60+ (61/61 rollout batches)
Reward range 4.88 – 5.97 / 6.0
Step time ~2.3 min
Total training time ~2h 38m
HF checkpoint size 39 GB (safetensors, no conversion needed)
DeepSpeed checkpoint size ~234 GB (optimizer states)

Checkpoints saved at steps 20 and 40 in HuggingFace format directly (--save_hf_ckpt).

Key Differences from veRL (Companion PR #1054)

Aspect OpenRLHF veRL
Training framework DeepSpeed ZeRO-3 FSDP2
GPU layout Separate vLLM + training nodes Shared (inline vLLM)
Memory offload --adam_offload (optimizer to CPU) offload_policy=True (params + optimizer)
Checkpoint format HuggingFace directly FSDP shards → merge step
Nodes used 6 (1 head + 5 workers) 4 (1 head + 3 workers)
GPUs for training 16 12

Files

File Description
Dockerfile NGC 25.02 + OpenRLHF v0.9.0 + vLLM 0.11.0 + EFA + numpy/cv2 fixes
buildspec.yml AWS CodeBuild spec for ECR image build
README.md Full documentation with architecture diagram, results, memory budget, troubleshooting
recipe/run_gptoss_grpo.sh Training launcher (ray job submit via REST API)
recipe/language_reward.py Custom reward function (OpenRLHF API)
recipe/evaluate_gptoss.py 50-question vLLM evaluation across 5 languages
recipe/evaluate_gptoss.sh Evaluation wrapper script
setup/env_vars.example g5.12xlarge + p5en.48xlarge config templates
setup/load_data_gptoss.sh HuggingFaceH4/Multilingual-Thinking data prep
setup/raycluster.yaml KubeRay manifest (num-gpus=0 head, RAY OOM killer disabled)

Docker Image Notes

Key fixes baked into the Dockerfile:

  • numpy < 2.3: vLLM 0.11.0 pulls numba which requires numpy <= 2.2 (NGC ships 2.4)
  • opencv-python-headless removed: Crashes with numpy 2.x in the NGC container
  • flash-attn not compiled separately: vLLM 0.11.0 bundles its own (can't compile in CodeBuild — no GPU)

Testing

  • Trained 60+ steps on 5x ml.g5.12xlarge workers (20 GPUs: 4 vLLM + 16 training)
  • HF checkpoints saved at steps 20 and 40 (39GB each)
  • Zero pod restarts, no OOM during training
  • EFA networking validated (NCCL over OFI, NCCL_PROTO=simple for g5)
  • Step 60 checkpoint save failed due to FSx disk full (1.2TB capacity) — not a code issue

Related

Add complete OpenRLHF v0.9.0 recipe for GRPO training of openai/gpt-oss-20b
(20B MoE) on 6x g5.12xlarge with Non-Hybrid Engine architecture.

Architecture: 5 GPU workers (160Gi, 4xA10G, 1 EFA each) + 1 Ray head
(8Gi, num-gpus=0). vLLM inference on 1 dedicated worker (TP=4),
DeepSpeed ZeRO-3 training on 4 workers (16 GPUs, adam_offload ~80GB/node).

Includes: Dockerfile (NGC 25.02 + EFA + numpy/cv2 fixes), KubeRay manifest,
training script, custom reward function (language compliance), evaluation
scripts, data loader, and CodeBuild spec.

Training validated: 60+ steps completed, rewards 4.88-5.97, ~2.3 min/step,
HF checkpoints saved at steps 20 and 40 (39GB each).
@nkumaraws nkumaraws force-pushed the feature/openrlhf-grpo-gptoss-g5 branch from 506185b to 34b476f Compare April 4, 2026 07:20
Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 1/4 — Structure & Repository Hygiene

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 2/4 — Deployment Pipeline (K8s)

envsubst without explicit variable whitelist

File: README.md (line 145)

The README instructs envsubst < setup/raycluster.yaml | kubectl apply -f - without a variable whitelist. This replaces all $VAR patterns in the YAML, which could unintentionally substitute env vars that are meant to be literal values in the pod spec. I'd suggest using an explicit whitelist:

envsubst '$REGISTRY $IMAGE $TAG $INSTANCE_TYPE $FI_EFA_USE_DEVICE_RDMA $HF_TOKEN $HEAD_CPU $HEAD_MEMORY $WORKER_CPU $WORKER_MEMORY $NUM_NODES $NUM_GPU_PER_NODE $NUM_EFA_PER_NODE' \
    < setup/raycluster.yaml | kubectl apply -f -

Comment on lines +56 to +57
- name: HF_TOKEN
value: ${HF_TOKEN}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HF_TOKEN embedded directly in pod spec

The HuggingFace token is injected via envsubst as a plain-text env var, making it visible in the RayCluster resource (kubectl get raycluster -o yaml). I'd suggest using a Kubernetes Secret instead:

- name: HF_TOKEN
  valueFrom:
    secretKeyRef:
      name: hf-secret
      key: token

With a setup step: kubectl create secret generic hf-secret --from-literal=token=$HF_TOKEN

@@ -0,0 +1,171 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: Apache-2.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License header mismatch

This file uses Apache-2.0 while all other files in the PR (and the repo convention) use MIT-0.

Suggested change
# SPDX-License-Identifier: Apache-2.0
# SPDX-License-Identifier: MIT-0

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 3/4 — Infrastructure & NCCL Configuration

Comment on lines +46 to +47
- name: NCCL_PROTO
value: "simple"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NCCL_PROTO=simple hardcoded despite env_vars configurability

The env_vars.example correctly parameterizes NCCL_PROTO (empty for p5en with GPUDirect RDMA, simple for g5 without). But here and on line 128 it's hardcoded to "simple", so switching to p5en requires manually editing the YAML in addition to changing env_vars. I'd suggest making it consistent with the other substituted variables:

Suggested change
- name: NCCL_PROTO
value: "simple"
- name: NCCL_PROTO
value: "${NCCL_PROTO}"

Copy link
Copy Markdown
Collaborator

@KeitaW KeitaW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Batch 4/4 — Documentation Consistency


Things That Look Great

  • Exceptional documentation: The README is one of the best I've seen in this repo — complete with ASCII architecture diagrams, detailed memory budget tables, a comparison with veRL, and a troubleshooting section that reads like production runbook notes.
  • Memory optimization analysis: The "Why Non-Hybrid?" explanation with concrete numbers (adam_offload ~80GB + vLLM sleep ~40GB > 160Gi) is exactly the kind of practical reasoning that helps users make informed decisions.
  • Key Discoveries section: Documenting the num-gpus: "0" head node gotcha, RAY_memory_monitor_refresh_ms=0 placement requirement, and NumPy/cv2 fix saves future users significant debugging time.
  • Reward function API documentation: The note about extra_logs requiring 1-d tensors (not 0-d scalars) for torch.cat() compatibility is a subtle catch that would be hard to debug.
  • Dual instance type support: The env_vars.example with both g5.12xlarge and p5en.48xlarge configurations, with clear comments on the differences (GPUDirect RDMA, NCCL_PROTO), makes it easy to adapt.
  • Clean Dockerfile pattern: Follows the repo convention — EFA installer with --skip-kmod --no-verify, HPC-X cleanup, proper NCCL_SOCKET_IFNAME exclusion pattern (^docker,lo,veth).
  • Self-contained with shared reward function: The pattern of kubectl cp the reward function to FSx (accessible by all Ray workers) is practical and well-documented.
  • Copyright and license headers: Present on every file.

gpu_memory_utilization=gpu_mem,
max_model_len=max_model_len,
enforce_eager=True,
trust_remote_code=True,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trust_remote_code=True used without comment

The review checklist flags unconditional trust_remote_code=True since it executes arbitrary code from the HuggingFace Hub. For openai/gpt-oss-20b this is likely required (MoE models often need custom code), but I'd suggest adding a brief comment explaining why:

Suggested change
trust_remote_code=True,
trust_remote_code=True, # Required for gpt-oss-20b MoE architecture

@KeitaW KeitaW merged commit 66e7860 into main Apr 8, 2026
4 checks passed
@KeitaW KeitaW deleted the feature/openrlhf-grpo-gptoss-g5 branch April 8, 2026 07:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants