Skip to content

Commit 66e7860

Browse files
nkumarawsKeitaW
andauthored
feat: add OpenRLHF GRPO training recipe for gpt-oss-20b on HyperPod EKS (g5.12xlarge) (#1053)
* feat: add OpenRLHF GRPO training recipe for gpt-oss-20b on HyperPod EKS Add complete OpenRLHF v0.9.0 recipe for GRPO training of openai/gpt-oss-20b (20B MoE) on 6x g5.12xlarge with Non-Hybrid Engine architecture. Architecture: 5 GPU workers (160Gi, 4xA10G, 1 EFA each) + 1 Ray head (8Gi, num-gpus=0). vLLM inference on 1 dedicated worker (TP=4), DeepSpeed ZeRO-3 training on 4 workers (16 GPUs, adam_offload ~80GB/node). Includes: Dockerfile (NGC 25.02 + EFA + numpy/cv2 fixes), KubeRay manifest, training script, custom reward function (language compliance), evaluation scripts, data loader, and CodeBuild spec. Training validated: 60+ steps completed, rewards 4.88-5.97, ~2.3 min/step, HF checkpoints saved at steps 20 and 40 (39GB each). * Update 3.test_cases/pytorch/openrlhf/Dockerfile * Update 3.test_cases/pytorch/openrlhf/Dockerfile --------- Co-authored-by: Keita Watanabe <mlkeita@amazon.com>
1 parent 311109b commit 66e7860

File tree

10 files changed

+1850
-0
lines changed

10 files changed

+1850
-0
lines changed
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
2+
# SPDX-License-Identifier: MIT-0
3+
#
4+
# OpenRLHF with EFA support for Amazon EKS (HyperPod)
5+
#
6+
# Base: NGC PyTorch 25.02 (CUDA 12.8, PyTorch 2.8, Python 3.12)
7+
# OpenRLHF: v0.9.0 with vLLM 0.11.0 + DeepSpeed ZeRO-3 + Ray
8+
#
9+
# Build:
10+
# docker build -t openrlhf-rlvr:latest .
11+
#
12+
# The image supports both g5.12xlarge (4× A10G 24GB) and p5en.48xlarge
13+
# (8× H100 80GB) with Non-Hybrid Engine (separate vLLM + training nodes).
14+
15+
FROM nvcr.io/nvidia/pytorch:25.02-py3
16+
17+
# ---------------------------------------------------------------------------
18+
# System dependencies + EFA
19+
# ---------------------------------------------------------------------------
20+
ARG EFA_VERSION=1.47.0
21+
22+
RUN apt-get update && apt-get install -y --no-install-recommends \
23+
git wget curl ninja-build autoconf build-essential \
24+
pciutils environment-modules tcl tcl-dev \
25+
libnl-3-dev libnl-route-3-dev libevent-dev libhwloc-dev \
26+
dmidecode ethtool iproute2 \
27+
openssh-server openssh-client \
28+
systemd udev \
29+
&& rm -rf /var/lib/apt/lists/*
30+
31+
# SSH configuration for multi-node
32+
RUN mkdir -p /var/run/sshd \
33+
&& sed -i 's/[ #]\(.*StrictHostKeyChecking \).*/ \1no/g' /etc/ssh/ssh_config \
34+
&& echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config \
35+
&& sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
36+
37+
# ---------------------------------------------------------------------------
38+
# EFA installer (skip kernel modules — provided by the host)
39+
# ---------------------------------------------------------------------------
40+
RUN cd /tmp \
41+
&& curl -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \
42+
&& tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \
43+
&& cd aws-efa-installer \
44+
&& ./efa_installer.sh -y --skip-kmod --skip-limit-conf --no-verify \
45+
&& rm -rf /tmp/aws-efa-installer*
46+
47+
# Clean up HPC-X to avoid conflicts with EFA
48+
RUN rm -rf /opt/hpcx /usr/local/mpi \
49+
&& rm -f /etc/ld.so.conf.d/hpcx.conf \
50+
&& ldconfig
51+
52+
# EFA / OpenMPI paths
53+
ENV PATH="/opt/amazon/openmpi/bin:/opt/amazon/efa/bin:${PATH}"
54+
ENV LD_LIBRARY_PATH="/opt/amazon/openmpi/lib:/opt/nccl/build/lib:/opt/amazon/efa/lib:/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}"
55+
ENV OMPI_MCA_pml=^ucx
56+
ENV OMPI_MCA_btl=tcp,self
57+
ENV OMPI_MCA_btl_tcp_if_exclude=lo,docker0,veth_def_agent
58+
ENV OPAL_PREFIX=/opt/amazon/openmpi
59+
60+
# EFA / NCCL tuning
61+
ENV FI_PROVIDER=efa
62+
ENV FI_EFA_USE_DEVICE_RDMA=1
63+
ENV FI_EFA_FORK_SAFE=1
64+
ENV FI_EFA_ENABLE_SHM_TRANSFER=1
65+
ENV NCCL_PROTO=simple
66+
ENV NCCL_NET_GDR_LEVEL=LOC
67+
ENV NCCL_SOCKET_IFNAME=^docker,lo,veth
68+
ENV NCCL_TUNER_PLUGIN=/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-ofi-tuner.so
69+
ENV PMIX_MCA_gds=hash
70+
71+
# ---------------------------------------------------------------------------
72+
# NCCL tests (optional — useful for cluster validation)
73+
# ---------------------------------------------------------------------------
74+
RUN git clone --branch v2.13.11 --depth 1 https://github.com/NVIDIA/nccl-tests.git /opt/nccl-tests \
75+
&& cd /opt/nccl-tests \
76+
&& make -j $(nproc) MPI=1 MPI_HOME=/opt/amazon/openmpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/opt/nccl/build
77+
78+
# ---------------------------------------------------------------------------
79+
# Python dependencies — OpenRLHF + vLLM
80+
# ---------------------------------------------------------------------------
81+
# Remove NGC packages that conflict with OpenRLHF's pinned versions
82+
RUN pip uninstall -y xgboost transformer_engine flash_attn pynvml opencv-python-headless 2>/dev/null || true
83+
84+
# Install vLLM first (heavy dependency — bundles its own flash-attention)
85+
RUN pip install --no-cache-dir vllm==0.11.0
86+
87+
# Fix NumPy / cv2 compatibility issues introduced by vLLM 0.11.0:
88+
# 1. vLLM pulls opencv-python-headless which crashes with NumPy 2.4 from NGC
89+
# 2. vLLM imports numba which requires NumPy ≤ 2.2
90+
RUN pip install --no-cache-dir 'numpy<2.3' \
91+
&& pip uninstall -y opencv-python-headless 2>/dev/null || true \
92+
&& rm -rf /usr/local/lib/python3.12/dist-packages/cv2*
93+
94+
# Note: flash-attn is NOT installed separately. vLLM 0.11.0 bundles its own
95+
# flash-attention backend, and HuggingFace Transformers' flash_attention_2
96+
# implementation uses it automatically. Building flash-attn from source
97+
# requires a GPU (CUDA compilation), which is unavailable in CI/CodeBuild.
98+
99+
# Install OpenRLHF v0.9.0
100+
RUN pip install --no-cache-dir openrlhf==0.9.0
101+
102+
# Additional dependencies for our reward function and evaluation
103+
RUN pip install --no-cache-dir \
104+
langdetect \
105+
boto3 \
106+
botocore \
107+
s3torchconnector
108+
109+
# ---------------------------------------------------------------------------
110+
# Working directory and ports
111+
# ---------------------------------------------------------------------------
112+
WORKDIR /workspace
113+
114+
# Ray dashboard (8265), Ray client (10001), Ray GCS (6379), metrics (8080)
115+
EXPOSE 8265 10001 6379 8080

0 commit comments

Comments
 (0)