Skip to content

rkinas/reasoning_models_how_to

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 

Repository files navigation

Reasoning Model and RLHF Research Notes

This repository serves as a collection of research notes and resources on training large language models (LLMs) and Reinforcement Learning from Human Feedback (RLHF). It focuses on the latest research, methodologies, and techniques for fine-tuning language models.

Repository Contents

Reinforcement Learning and RLHF Overview

A curated list of materials providing an introduction to RL and RLHF:

  • Research papers and books covering key concepts in reinforcement learning.
  • Video lectures explaining the fundamentals of RLHF.

Methods for LLM Training

An extensive collection of state-of-the-art approaches for optimizing preferences and model alignment:

  • Key techniques such as PPO, DPO, KTO, ORPO, and more.
  • The latest ArXiv publications and publicly available implementations.
  • Analysis of effectiveness across different optimization strategies.

Purpose of this Repository

This repository is designed as a reference for researchers and engineers working on reinforcement learning and large language models. If you're interested in model alignment, experiments with DPO and its variants, or alternative RL-based methods, you will find valuable resources here.

RL overview

Methods for LLM training

Minimal implementation

Method
DPO

Tutorials

Notes for learning RL: Value Iteration -> Q Learning -> DQN -> REINFORCE -> Policy Gradient Theorem -> TRPO -> PPO

RLHF training techniques explained

Training frameworks

  • VERL - Volcano Engine Reinforcement Learning for LLMs
  • OpenRLHF - OpenRLHF is the first easy-to-use, high-performance open-source RLHF framework built on Ray, vLLM, ZeRO-3 and HuggingFace Transformers, designed to make RLHF training simple and accessible
  • TRL - TRL is a full stack library where we provide a set of tools to train transformer language models with methods like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO), Reward Modeling, and more.
  • Nemo-RL - Nemo RL: A Scalable and Efficient Post-Training Library
  • ROLL - Large scale training with megatron support, a feature-rich codebase from Alibaba
  • RL2 - Ray Less Reinforcement Learning. The NanoGPT of RL with it's small and hackable size (<1k lines)
  • AReal - AReaL (Ant Reasoning RL): LLM generation runs in a streaming manner, with each rollout worker continuously producing outputs without waiting
  • OAT - Oat 🌾 is a simple yet efficient framework for running online LLM alignment algorithms.
  • Meta GenAI - LlamaR - Meta GenAI - LlamaRL: A Distributed Asynchronous Reinforcement Learning Framework for Efficient Large-scale LLM Training
  • Verifiers: Reinforcement Learning with LLMs in Verifiable Environments - verifiers is a set of tools and abstractions for training LLMs with reinforcement learning in verifiable multi-turn environments via Group-Relative Policy Optimization.
  • RAGEN - RAGEN (Reasoning AGENt, pronounced like "region") leverages reinforcement learning (RL) to train LLM reasoning agents in interactive, stochastic environments.
  • ART - Agent Reinforcement Trainer - ART is an open-source reinforcement training library for improving LLM performance in agentic workflows. ART utilizes the powerful GRPO reinforcement learning algorithm to train models from their own experiences.
  • Atropos - Nous Research's LLM RL Gym - Atropos is an environment microservice framework for async RL with LLMs.
  • slime - slime is an LLM post-training framework for RL scaling

RLHF methods implementation (only with detailed explanations)

Articles

Books

Thinking process

Models cases

Dataset generators and verifiers

  • SynLogic - This repository contains the code and data for SynLogic, a comprehensive logical reasoning data synthesis framework that generates diverse, verifiable reasoning data at scale.
  • Reasoning Gym - Reasoning Gym is a community-created Python library of procedural dataset generators and algorithmically verifiable reasoning environments for training reasoning models with reinforcement learning (RL). The goal is to generate virtually infinite training data with adjustable complexity.
  • atropos - environments - This directory contains various environments for training and evaluating language models on different tasks. Each environment implements a specific task with its own input format, reward function, and evaluation metrics.

Repos

Articles

Papers

Open-source project to reproduce DeepSeek R1

Datasets - thinking models

Evaluation and benchmarks

About

This repository serves as a collection of research notes and resources on training large language models (LLMs) and Reinforcement Learning from Human Feedback (RLHF). It focuses on the latest research, methodologies, and techniques for fine-tuning language models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages