Reasoning Model and RLHF Research Notes

This repository serves as a collection of research notes and resources on training large language models (LLMs) and Reinforcement Learning from Human Feedback (RLHF). It focuses on the latest research, methodologies, and techniques for fine-tuning language models.

Repository Contents

Reinforcement Learning and RLHF Overview

A curated list of materials providing an introduction to RL and RLHF:

Research papers and books covering key concepts in reinforcement learning.
Video lectures explaining the fundamentals of RLHF.

Methods for LLM Training

An extensive collection of state-of-the-art approaches for optimizing preferences and model alignment:

Key techniques such as PPO, DPO, KTO, ORPO, and more.
The latest ArXiv publications and publicly available implementations.
Analysis of effectiveness across different optimization strategies.

Purpose of this Repository

This repository is designed as a reference for researchers and engineers working on reinforcement learning and large language models. If you're interested in model alignment, experiments with DPO and its variants, or alternative RL-based methods, you will find valuable resources here.

RL overview

Methods for LLM training

PPO - Proximal Policy Optimization Algorithm - OpenAI
DPO - Direct Preference Optimization: Your Language Model is Secretly a Reward Model - Standford
online DPO
KTO - KTO: Model Alignment as Prospect Theoretic Optimization
SimPO imple Preference Optimization with a Reference-Free Reward - Princeton
ORPO - Monolithic Preference Optimization without Reference Model - Kaist AI
Sample Efficient Reinforcement Learning with REINFORCE
REINFORCE++
RPO Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment
RLOO - Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
GRPO
ReMax - Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
DPOP - Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
BCO - Binary Classifier Optimization for Large Language Model Alignment
PRIME -
DAPO - DAPO: an Open-Source LLM Reinforcement Learning System at Scale - Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm. By making our work publicly available, we provide the broader research community and society with practical access to scalable reinforcement learning, enabling all to benefit from these advancements.
VAPO - VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
DR-GRPO - Understanding R1-Zero-Like Training: A Critical Perspective
KL_Cov & Clip_Cov - The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Minimal implementation

Method
DPO

Tutorials

Notes for learning RL: Value Iteration -> Q Learning -> DQN -> REINFORCE -> Policy Gradient Theorem -> TRPO -> PPO

RLHF training techniques explained

Training frameworks

VERL - Volcano Engine Reinforcement Learning for LLMs
OpenRLHF - OpenRLHF is the first easy-to-use, high-performance open-source RLHF framework built on Ray, vLLM, ZeRO-3 and HuggingFace Transformers, designed to make RLHF training simple and accessible
TRL - TRL is a full stack library where we provide a set of tools to train transformer language models with methods like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), Reward Modeling, and more.
Nemo-RL - Nemo RL: A Scalable and Efficient Post-Training Library
ROLL - Large scale training with megatron support, a feature-rich codebase from Alibaba
RL2 - Ray Less Reinforcement Learning. The NanoGPT of RL with it's small and hackable size (<1k lines)
AReal - AReaL (Ant Reasoning RL): LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting
OAT - Oat 🌾 is a simple yet efficient framework for running online LLM alignment algorithms.
Meta GenAI - LlamaR - Meta GenAI - LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training
Verifiers: Reinforcement Learning with LLMs in Verifiable Environments - verifiers is a set of tools and abstractions for training LLMs with reinforcement learning in verifiable multi-turn environments via Group-Relative Policy Optimization.
RAGEN - RAGEN (Reasoning AGENt, pronounced like "region") leverages reinforcement learning (RL) to train LLM reasoning agents in interactive, stochastic environments.
ART - Agent Reinforcement Trainer - ART is an open-source reinforcement training library for improving LLM performance in agentic workflows. ART utilizes the powerful GRPO reinforcement learning algorithm to train models from their own experiences.
Atropos - Nous Research's LLM RL Gym - Atropos is an environment microservice framework for async RL with LLMs.
slime - slime is an LLM post-training framework for RL scaling

RLHF methods implementation (only with detailed explanations)

GRPO

Articles

Reasoning LLMs
Process Reinforcement through Implicit Rewards
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
LIMR: Less is More for RL Scaling
LIMO: Less Is More for Reasoning
s1: Simple test-time scaling and s1.1
The 37 Implementation Details of Proximal Policy Optimization
Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead and github
a reinforcement learning guide
Approximating KL Divergence
How to align open LLMs in 2025 with DPO & and synthetic data
DeepSeek-R1 -> The Illustrated DeepSeek-R1, DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs, DeepSeek R1 and R1-Zero Explained
2025.03.23
- Reinforcement Learning for Reasoning in Small LLMs: What Works and WhatDoesn’t
- Understanding R1-zero
2025.02.22
- Small Models Struggle to Learn from Strong Reasoners
- Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
- LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
- Open Reasoner Zero An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
2025.06.22
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
- POLARIS: A POst-training recipe for scaling reinforcement Learning on Advanced ReasonIng modelS - data dificulty, Diversity-Based Rollout, Inference-Time Lengt, Exploration Efficiency.
- ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs - cross-domain generalization arises from shared abstract reasoning prototypes — fundamental reasoning patterns that capture the essence of problems across domains.
- Truncated Proximal Policy Optimization - Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation. T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle during the waiting periods for complete rollouts.
- GRESO - Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts - GRESO is a lightweight pre-rollout filter that skips uninformative prompts using reward dynamics, saving RL training time without hurting accuracy.
- AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy - In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt.
- Self-Adapting Language Models - We introduce Self-Adapting LLMs (SEAL) 🦭, a framework that enables LLMs to self-adapt by generating their own finetuning data and update directives. Given a new input, the model produces a self-edit — a generation that may restructure the information in different ways, specify optimization hyperparameters, or invoke tools for data augmentation and gradient-based updates.
- VeriFree: Reinforcing General Reasoning without Verifiers
- Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning - This repository contains the source code for the experiments in paper Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning. We implemented GRPO-PODS (Policy Optimization with Down-Sampling) and compared its performance with vanilla GRPO.
- Spurious Rewards: Rethinking Training Signals in RLVR - We show that you can do RLVR on Qwen2.5-Math models with completely random or incorrect rewards, and still get massive math benchmark gains.
- Towards a More Efficient Reasoning LLM: AIMO2 Solution Summary and Introduction to Fast-Math Models
2025.06.23
- Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers - Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information.
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning - Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs).
- Thinker: Learning to Think Fast and Slow - Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1.
- OpenThoughts3 - A new SOTA Reasoning Data Recipe
- BLEUBERI: BLEU is a surprisingly effective reward for instruction following
- How much do language models memorize?
- Reinforcement Pre-Training
- SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning
- e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
2025.06.24
2025.06.27
- OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling - Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning?

Books

Multi-Agent Reinforcement Learning: Foundations and Modern Approaches

Thinking process

Models cases

AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset

Dataset generators and verifiers

SynLogic - This repository contains the code and data for SynLogic, a comprehensive logical reasoning data synthesis framework that generates diverse, verifiable reasoning data at scale.
Reasoning Gym - Reasoning Gym is a community-created Python library of procedural dataset generators and algorithmically verifiable reasoning environments for training reasoning models with reinforcement learning (RL). The goal is to generate virtually infinite training data with adjustable complexity.
atropos - environments - This directory contains various environments for training and evaluating language models on different tasks. Each environment implements a specific task with its own input format, reward function, and evaluation metrics.

Repos

Awesome-System2-Reasoning-LLM

Articles

Papers

Open-source project to reproduce DeepSeek R1

DeepScaleR - Democratizing Reinforcement Learning for LLMs

Datasets - thinking models

Medical
- lingshu-medical-mllm/ReasonMed
- II-Medical-8B-1706
Math
- [R1 - distill] OpenR1-Math-220k
- [R1 - distill] s1K-1.1
- [R1 - distill] OpenThoughts-114k
- [R1 - distill] LIMO
- [R1 - distill] NuminaMath-CoT
- [Llama-70B - distill] natural_reasoning - licence for non commercial use
- Open Reasoning Data
- Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
- 2025.06.22

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
minimal_implementation		minimal_implementation
pdf		pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reasoning Model and RLHF Research Notes

Repository Contents

Reinforcement Learning and RLHF Overview

Methods for LLM Training

Purpose of this Repository

RL overview

Methods for LLM training

Minimal implementation

Tutorials

RLHF training techniques explained

Training frameworks

RLHF methods implementation (only with detailed explanations)

Articles

Books

Thinking process

Models cases

Dataset generators and verifiers

Repos

Articles

Papers

Open-source project to reproduce DeepSeek R1

Datasets - thinking models

Evaluation and benchmarks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Reasoning Model and RLHF Research Notes

Repository Contents

Reinforcement Learning and RLHF Overview

Methods for LLM Training

Purpose of this Repository

RL overview

Methods for LLM training

Minimal implementation

Tutorials

RLHF training techniques explained

Training frameworks

RLHF methods implementation (only with detailed explanations)

Articles

Books

Thinking process

Models cases

Dataset generators and verifiers

Repos

Articles

Papers

Open-source project to reproduce DeepSeek R1

Datasets - thinking models

Evaluation and benchmarks

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages