feat: add OpenRLHF GRPO training recipe for gpt-oss-20b on HyperPod EKS (g5.12xlarge) by nkumaraws · Pull Request #1053 · awslabs/awsome-distributed-training

nkumaraws · 2026-04-04T06:40:02Z

Summary

Add complete OpenRLHF v0.9.0 GRPO recipe for training openai/gpt-oss-20b (20B MoE) on g5.12xlarge (4x A10G 24GB GPUs) with HyperPod EKS
Non-Hybrid Engine architecture: vLLM inference on 1 dedicated worker (TP=4), DeepSpeed ZeRO-3 training on 4 separate workers (16 GPUs)
Same multilingual language compliance task and reward function as the veRL companion PR

Architecture

6x g5.12xlarge (4x A10G 24GB each)

  Ray Head (8Gi, 0 GPU)     5 GPU Workers (160Gi, 4 GPU, 1 EFA each)
  ┌──────────────────┐     ┌────────────┐  ┌──────────────────────┐
  │  Ray driver       │     │  Worker 1  │  │  Workers 2-5         │
  │  (orchestration)  │     │  vLLM TP=4 │  │  DeepSpeed ZeRO-3    │
  │                   │     │ (inference) │  │  (training, 16 GPUs) │
  └──────────────────┘     └────────────┘  └──────────────────────┘

Why Non-Hybrid? On 24GB GPUs with adam_offload, the vLLM sleep/wake CPU backup mechanism adds ~40GB/node, causing OOM when colocated with DeepSpeed. Separate nodes eliminate this entirely.

Memory budget per training node:

adam_offload: 320GB / 16 GPUs x 4 GPUs/node = 80GB per node
Pod limit: 160Gi → ~80GB headroom for checkpoints and overhead

Training Results

Validated on HyperPod EKS cluster (trl-gptoss-eks), 60+ steps:

Metric	Value
Steps completed	60+ (61/61 rollout batches)
Reward range	4.88 – 5.97 / 6.0
Step time	~2.3 min
Total training time	~2h 38m
HF checkpoint size	39 GB (safetensors, no conversion needed)
DeepSpeed checkpoint size	~234 GB (optimizer states)

Checkpoints saved at steps 20 and 40 in HuggingFace format directly (--save_hf_ckpt).

Key Differences from veRL (Companion PR #1054)

Aspect	OpenRLHF	veRL
Training framework	DeepSpeed ZeRO-3	FSDP2
GPU layout	Separate vLLM + training nodes	Shared (inline vLLM)
Memory offload	`--adam_offload` (optimizer to CPU)	`offload_policy=True` (params + optimizer)
Checkpoint format	HuggingFace directly	FSDP shards → merge step
Nodes used	6 (1 head + 5 workers)	4 (1 head + 3 workers)
GPUs for training	16	12

Files

File	Description
`Dockerfile`	NGC 25.02 + OpenRLHF v0.9.0 + vLLM 0.11.0 + EFA + numpy/cv2 fixes
`buildspec.yml`	AWS CodeBuild spec for ECR image build
`README.md`	Full documentation with architecture diagram, results, memory budget, troubleshooting
`recipe/run_gptoss_grpo.sh`	Training launcher (ray job submit via REST API)
`recipe/language_reward.py`	Custom reward function (OpenRLHF API)
`recipe/evaluate_gptoss.py`	50-question vLLM evaluation across 5 languages
`recipe/evaluate_gptoss.sh`	Evaluation wrapper script
`setup/env_vars.example`	g5.12xlarge + p5en.48xlarge config templates
`setup/load_data_gptoss.sh`	HuggingFaceH4/Multilingual-Thinking data prep
`setup/raycluster.yaml`	KubeRay manifest (num-gpus=0 head, RAY OOM killer disabled)

Docker Image Notes

Key fixes baked into the Dockerfile:

numpy < 2.3: vLLM 0.11.0 pulls numba which requires numpy <= 2.2 (NGC ships 2.4)
opencv-python-headless removed: Crashes with numpy 2.x in the NGC container
flash-attn not compiled separately: vLLM 0.11.0 bundles its own (can't compile in CodeBuild — no GPU)

Testing

Trained 60+ steps on 5x ml.g5.12xlarge workers (20 GPUs: 4 vLLM + 16 training)
HF checkpoints saved at steps 20 and 40 (39GB each)
Zero pod restarts, no OOM during training
EFA networking validated (NCCL over OFI, NCCL_PROTO=simple for g5)
Step 60 checkpoint save failed due to FSx disk full (1.2TB capacity) — not a code issue

Add complete OpenRLHF v0.9.0 recipe for GRPO training of openai/gpt-oss-20b (20B MoE) on 6x g5.12xlarge with Non-Hybrid Engine architecture. Architecture: 5 GPU workers (160Gi, 4xA10G, 1 EFA each) + 1 Ray head (8Gi, num-gpus=0). vLLM inference on 1 dedicated worker (TP=4), DeepSpeed ZeRO-3 training on 4 workers (16 GPUs, adam_offload ~80GB/node). Includes: Dockerfile (NGC 25.02 + EFA + numpy/cv2 fixes), KubeRay manifest, training script, custom reward function (language compliance), evaluation scripts, data loader, and CodeBuild spec. Training validated: 60+ steps completed, rewards 4.88-5.97, ~2.3 min/step, HF checkpoints saved at steps 20 and 40 (39GB each).

KeitaW

Review Batch 1/4 — Structure & Repository Hygiene

3.test_cases/pytorch/openrlhf/Dockerfile

KeitaW

Review Batch 2/4 — Deployment Pipeline (K8s)

`envsubst` without explicit variable whitelist

File: README.md (line 145)

The README instructs envsubst < setup/raycluster.yaml | kubectl apply -f - without a variable whitelist. This replaces all $VAR patterns in the YAML, which could unintentionally substitute env vars that are meant to be literal values in the pod spec. I'd suggest using an explicit whitelist:

envsubst '$REGISTRY $IMAGE $TAG $INSTANCE_TYPE $FI_EFA_USE_DEVICE_RDMA $HF_TOKEN $HEAD_CPU $HEAD_MEMORY $WORKER_CPU $WORKER_MEMORY $NUM_NODES $NUM_GPU_PER_NODE $NUM_EFA_PER_NODE' \
    < setup/raycluster.yaml | kubectl apply -f -

KeitaW · 2026-04-08T07:01:53Z

3.test_cases/pytorch/openrlhf/hyperpod-eks/rlvr/setup/raycluster.yaml

+            - name: HF_TOKEN
+              value: ${HF_TOKEN}


HF_TOKEN embedded directly in pod spec

The HuggingFace token is injected via envsubst as a plain-text env var, making it visible in the RayCluster resource (kubectl get raycluster -o yaml). I'd suggest using a Kubernetes Secret instead:

- name: HF_TOKEN valueFrom: secretKeyRef: name: hf-secret key: token

With a setup step: kubectl create secret generic hf-secret --from-literal=token=$HF_TOKEN

KeitaW · 2026-04-08T07:01:53Z

3.test_cases/pytorch/openrlhf/hyperpod-eks/rlvr/setup/raycluster.yaml

@@ -0,0 +1,171 @@
+# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
+# SPDX-License-Identifier: Apache-2.0


License header mismatch

This file uses Apache-2.0 while all other files in the PR (and the repo convention) use MIT-0.

Suggested change

# SPDX-License-Identifier: Apache-2.0

# SPDX-License-Identifier: MIT-0

KeitaW

Review Batch 3/4 — Infrastructure & NCCL Configuration

KeitaW · 2026-04-08T07:02:13Z

3.test_cases/pytorch/openrlhf/hyperpod-eks/rlvr/setup/raycluster.yaml

+            - name: NCCL_PROTO
+              value: "simple"


NCCL_PROTO=simple hardcoded despite env_vars configurability

The env_vars.example correctly parameterizes NCCL_PROTO (empty for p5en with GPUDirect RDMA, simple for g5 without). But here and on line 128 it's hardcoded to "simple", so switching to p5en requires manually editing the YAML in addition to changing env_vars. I'd suggest making it consistent with the other substituted variables:

Suggested change

- name: NCCL_PROTO

value: "simple"

- name: NCCL_PROTO

value: "${NCCL_PROTO}"

KeitaW

Review Batch 4/4 — Documentation Consistency

Things That Look Great

Exceptional documentation: The README is one of the best I've seen in this repo — complete with ASCII architecture diagrams, detailed memory budget tables, a comparison with veRL, and a troubleshooting section that reads like production runbook notes.
Memory optimization analysis: The "Why Non-Hybrid?" explanation with concrete numbers (adam_offload ~80GB + vLLM sleep ~40GB > 160Gi) is exactly the kind of practical reasoning that helps users make informed decisions.
Key Discoveries section: Documenting the num-gpus: "0" head node gotcha, RAY_memory_monitor_refresh_ms=0 placement requirement, and NumPy/cv2 fix saves future users significant debugging time.
Reward function API documentation: The note about extra_logs requiring 1-d tensors (not 0-d scalars) for torch.cat() compatibility is a subtle catch that would be hard to debug.
Dual instance type support: The env_vars.example with both g5.12xlarge and p5en.48xlarge configurations, with clear comments on the differences (GPUDirect RDMA, NCCL_PROTO), makes it easy to adapt.
Clean Dockerfile pattern: Follows the repo convention — EFA installer with --skip-kmod --no-verify, HPC-X cleanup, proper NCCL_SOCKET_IFNAME exclusion pattern (^docker,lo,veth).
Self-contained with shared reward function: The pattern of kubectl cp the reward function to FSx (accessible by all Ray workers) is practical and well-documented.
Copyright and license headers: Present on every file.

KeitaW · 2026-04-08T07:02:38Z

3.test_cases/pytorch/openrlhf/hyperpod-eks/rlvr/recipe/evaluate_gptoss.py

+        gpu_memory_utilization=gpu_mem,
+        max_model_len=max_model_len,
+        enforce_eager=True,
+        trust_remote_code=True,


trust_remote_code=True used without comment

The review checklist flags unconditional trust_remote_code=True since it executes arbitrary code from the HuggingFace Hub. For openai/gpt-oss-20b this is likely required (MoE models often need custom code), but I'd suggest adding a brief comment explaining why:

Suggested change

trust_remote_code=True,

trust_remote_code=True, # Required for gpt-oss-20b MoE architecture

nkumaraws force-pushed the feature/openrlhf-grpo-gptoss-g5 branch from 506185b to 34b476f Compare April 4, 2026 07:20

KeitaW reviewed Apr 8, 2026

View reviewed changes

3.test_cases/pytorch/openrlhf/Dockerfile Outdated Show resolved Hide resolved

3.test_cases/pytorch/openrlhf/Dockerfile Outdated Show resolved Hide resolved

KeitaW reviewed Apr 8, 2026

View reviewed changes

Update 3.test_cases/pytorch/openrlhf/Dockerfile

92d8310

KeitaW reviewed Apr 8, 2026

View reviewed changes

Update 3.test_cases/pytorch/openrlhf/Dockerfile

466d2b4

KeitaW approved these changes Apr 8, 2026

View reviewed changes

KeitaW reviewed Apr 8, 2026

View reviewed changes

KeitaW merged commit 66e7860 into main Apr 8, 2026
4 checks passed

KeitaW deleted the feature/openrlhf-grpo-gptoss-g5 branch April 8, 2026 07:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add OpenRLHF GRPO training recipe for gpt-oss-20b on HyperPod EKS (g5.12xlarge)#1053

feat: add OpenRLHF GRPO training recipe for gpt-oss-20b on HyperPod EKS (g5.12xlarge)#1053
KeitaW merged 3 commits intomainfrom
feature/openrlhf-grpo-gptoss-g5

nkumaraws commented Apr 4, 2026 •

edited

Loading

Uh oh!

KeitaW left a comment

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Apr 8, 2026

Uh oh!

KeitaW Apr 8, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Apr 8, 2026

Uh oh!

KeitaW left a comment

Uh oh!

KeitaW Apr 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,171 @@
		# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
		# SPDX-License-Identifier: Apache-2.0

	# SPDX-License-Identifier: Apache-2.0
	# SPDX-License-Identifier: MIT-0

	trust_remote_code=True,
	trust_remote_code=True, # Required for gpt-oss-20b MoE architecture

Conversation

nkumaraws commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Training Results

Key Differences from veRL (Companion PR #1054)

Files

Docker Image Notes

Testing

Related

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 1/4 — Structure & Repository Hygiene

Uh oh!

Uh oh!

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 2/4 — Deployment Pipeline (K8s)

envsubst without explicit variable whitelist

Uh oh!

KeitaW Apr 8, 2026

Choose a reason for hiding this comment

HF_TOKEN embedded directly in pod spec

Uh oh!

KeitaW Apr 8, 2026

Choose a reason for hiding this comment

License header mismatch

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 3/4 — Infrastructure & NCCL Configuration

Uh oh!

KeitaW Apr 8, 2026

Choose a reason for hiding this comment

NCCL_PROTO=simple hardcoded despite env_vars configurability

Uh oh!

KeitaW left a comment

Choose a reason for hiding this comment

Review Batch 4/4 — Documentation Consistency

Things That Look Great

Uh oh!

KeitaW Apr 8, 2026

Choose a reason for hiding this comment

trust_remote_code=True used without comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nkumaraws commented Apr 4, 2026 •

edited

Loading

`envsubst` without explicit variable whitelist

`HF_TOKEN` embedded directly in pod spec

`NCCL_PROTO=simple` hardcoded despite env_vars configurability

`trust_remote_code=True` used without comment