This repository serves as a collection of research notes and resources on training large language models (LLMs) and Reinforcement Learning from Human Feedback (RLHF). It focuses on the latest research, methodologies, and techniques for fine-tuning language models.
A curated list of materials providing an introduction to RL and RLHF:
- Research papers and books covering key concepts in reinforcement learning.
- Video lectures explaining the fundamentals of RLHF.
An extensive collection of state-of-the-art approaches for optimizing preferences and model alignment:
- Key techniques such as PPO, DPO, KTO, ORPO, and more.
- The latest ArXiv publications and publicly available implementations.
- Analysis of effectiveness across different optimization strategies.
This repository is designed as a reference for researchers and engineers working on reinforcement learning and large language models. If you're interested in model alignment, experiments with DPO and its variants, or alternative RL-based methods, you will find valuable resources here.
- Reinforcement Learning: An Overview
- A COMPREHENSIVE SURVEY OF LLM ALIGNMENT TECHNIQUES: RLHF, RLAIF, PPO, DPO AND MORE
- Book-Mathematical-Foundation-of-Reinforcement-Learning
- The FASTEST introduction to Reinforcement Learning on the internet
- rlhf-book
- Notes on reinforcement learning
- PPO - Proximal Policy Optimization Algorithm - OpenAI
- DPO - Direct Preference Optimization: Your Language Model is Secretly a Reward Model - Standford
- online DPO
- KTO - KTO: Model Alignment as Prospect Theoretic Optimization
- SimPO imple Preference Optimization with a Reference-Free Reward - Princeton
- ORPO - Monolithic Preference Optimization without Reference Model - Kaist AI
- Sample Efficient Reinforcement Learning with REINFORCE
- REINFORCE++
- RPO Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment
- RLOO - Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
- GRPO
- ReMax - Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models
- DPOP - Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
- BCO - Binary Classifier Optimization for Large Language Model Alignment
- PRIME -
- DAPO - DAPO: an Open-Source LLM Reinforcement Learning System at Scale - Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm. By making our work publicly available, we provide the broader research community and society with practical access to scalable reinforcement learning, enabling all to benefit from these advancements.
- VAPO - VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks
- DR-GRPO - Understanding R1-Zero-Like Training: A Critical Perspective
- KL_Cov & Clip_Cov - The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
| Method |
|---|
| DPO |
Notes for learning RL: Value Iteration -> Q Learning -> DQN -> REINFORCE -> Policy Gradient Theorem -> TRPO -> PPO
- CS234: Reinforcement Learning Winter 2025
- CS285 Deep Reinforcement Learning
- Welcome to Spinning Up in Deep RL
- deep-rl-course from Huggingface
- RL Course by David Silver
- Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code.
- Direct Preference Optimization (DPO) explained: Bradley-Terry model, log probabilities, math
- GRPO vs PPO
- Unraveling RLHF and Its Variants: Progress and Practical Engineering Insights
- VERL - Volcano Engine Reinforcement Learning for LLMs
- OpenRLHF - OpenRLHF is the first easy-to-use, high-performance open-source RLHF framework built on Ray, vLLM, ZeRO-3 and HuggingFace Transformers, designed to make RLHF training simple and accessible
- TRL - TRL is a full stack library where we provide a set of tools to train transformer language models with methods like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), Reward Modeling, and more.
- Nemo-RL - Nemo RL: A Scalable and Efficient Post-Training Library
- ROLL - Large scale training with megatron support, a feature-rich codebase from Alibaba
- RL2 - Ray Less Reinforcement Learning. The NanoGPT of RL with it's small and hackable size (<1k lines)
- AReal - AReaL (Ant Reasoning RL): LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting
- OAT - Oat 🌾 is a simple yet efficient framework for running online LLM alignment algorithms.
- Meta GenAI - LlamaR - Meta GenAI - LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training
- Verifiers: Reinforcement Learning with LLMs in Verifiable Environments - verifiers is a set of tools and abstractions for training LLMs with reinforcement learning in verifiable multi-turn environments via Group-Relative Policy Optimization.
- RAGEN - RAGEN (Reasoning AGENt, pronounced like "region") leverages reinforcement learning (RL) to train LLM reasoning agents in interactive, stochastic environments.
- ART - Agent Reinforcement Trainer - ART is an open-source reinforcement training library for improving LLM performance in agentic workflows. ART utilizes the powerful GRPO reinforcement learning algorithm to train models from their own experiences.
- Atropos - Nous Research's LLM RL Gym - Atropos is an environment microservice framework for async RL with LLMs.
- slime - slime is an LLM post-training framework for RL scaling
-
DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL
-
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition
-
s1: Simple test-time scaling and s1.1
-
The 37 Implementation Details of Proximal Policy Optimization
-
Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead and github
-
How to align open LLMs in 2025 with DPO & and synthetic data
-
DeepSeek-R1 -> The Illustrated DeepSeek-R1, DeepSeek R1's recipe to replicate o1 and the future of reasoning LMs, DeepSeek R1 and R1-Zero Explained
-
2025.03.23
-
2025.02.22
- Small Models Struggle to Learn from Strong Reasoners
- Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning
- LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
- Open Reasoner Zero An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
-
2025.06.22
- Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
- POLARIS: A POst-training recipe for scaling reinforcement Learning on Advanced ReasonIng modelS - data dificulty, Diversity-Based Rollout, Inference-Time Lengt, Exploration Efficiency.
- ProtoReasoning: Prototypes as the Foundation for Generalizable Reasoning in LLMs - cross-domain generalization arises from shared abstract reasoning prototypes — fundamental reasoning patterns that capture the essence of problems across domains.
- Truncated Proximal Policy Optimization - Truncated Proximal Policy Optimization (T-PPO), a novel extension to PPO that improves training efficiency by streamlining policy update and length-restricted response generation. T-PPO mitigates the issue of low hardware utilization, an inherent drawback of fully synchronized long-generation procedures, where resources often sit idle during the waiting periods for complete rollouts.
- GRESO - Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts - GRESO is a lightweight pre-rollout filter that skips uninformative prompts using reward dynamics, saving RL training time without hurting accuracy.
- AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy - In this work, we investigate the synergy between supervised fine-tuning (SFT) and reinforcement learning (RL) in developing strong reasoning models. We begin by curating the SFT training data through two scaling strategies: increasing the number of collected prompts and the number of generated responses per prompt.
- Self-Adapting Language Models - We introduce Self-Adapting LLMs (SEAL) 🦭, a framework that enables LLMs to self-adapt by generating their own finetuning data and update directives. Given a new input, the model produces a self-edit — a generation that may restructure the information in different ways, specify optimization hyperparameters, or invoke tools for data augmentation and gradient-based updates.
- VeriFree: Reinforcing General Reasoning without Verifiers
- Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning - This repository contains the source code for the experiments in paper Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning. We implemented GRPO-PODS (Policy Optimization with Down-Sampling) and compared its performance with vanilla GRPO.
- Spurious Rewards: Rethinking Training Signals in RLVR - We show that you can do RLVR on Qwen2.5-Math models with completely random or incorrect rewards, and still get massive math benchmark gains.
- Towards a More Efficient Reasoning LLM: AIMO2 Solution Summary and Introduction to Fast-Math Models
-
2025.06.23
- Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers - Large language models (LLMs) can acquire new knowledge through fine-tuning, but this process exhibits a puzzling duality: models can generalize remarkably from new facts, yet are also prone to hallucinating incorrect information.
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning - Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs).
- Thinker: Learning to Think Fast and Slow - Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1.
- OpenThoughts3 - A new SOTA Reasoning Data Recipe
- BLEUBERI: BLEU is a surprisingly effective reward for instruction following
- How much do language models memorize?
- Reinforcement Pre-Training
- SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning
- e3: Learning to Explore Enables Extrapolation of Test-Time Compute for LLMs
-
2025.06.24
-
2025.06.27
- OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling - Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning?
- SynLogic - This repository contains the code and data for SynLogic, a comprehensive logical reasoning data synthesis framework that generates diverse, verifiable reasoning data at scale.
- Reasoning Gym - Reasoning Gym is a community-created Python library of procedural dataset generators and algorithmically verifiable reasoning environments for training reasoning models with reinforcement learning (RL). The goal is to generate virtually infinite training data with adjustable complexity.
- atropos - environments - This directory contains various environments for training and evaluating language models on different tasks. Each environment implements a specific task with its own input format, reward function, and evaluation metrics.
- ✨ LLM Reasoning: Curated Insights
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- LLM Post-Training: A Deep Dive into Reasoning Large Language Models
- SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
- ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
- A Minimalist Approach to Offline Reinforcement Learning
- Training Language Models to Reason Efficiently
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search
- Medical
- Math
- [R1 - distill] OpenR1-Math-220k
- [R1 - distill] s1K-1.1
- [R1 - distill] OpenThoughts-114k
- [R1 - distill] LIMO
- [R1 - distill] NuminaMath-CoT
- [Llama-70B - distill] natural_reasoning - licence for non commercial use
- Open Reasoning Data
- Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models
- 2025.06.22