Skip to content

THU-KEG/AdaptThink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

11 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AdaptThink: LLM Can Learn When to Think

πŸ€— HF Repo β€’ πŸ“ƒ Paper

πŸ” Table of Contents

πŸ€–οΈ AdaptThink

We present AdapThink, a novel reinforcement learning (RL) algorithm that enables reasoning models to adaptively choose between Thinking and NoThinking modes according to the difficulty of each input problem, thereby achieving automatic hybrid reasoning. Specifically, the model engages in thinking only when the problem is determined to be challenging; for other simple questions, it will bypass the thinking process and directly produce a concise final solution. This approach substantially reduces inference costs while further improving overall performance. image

βš™οΈ Released Models

All Available Datasets and Models

We apply the AdaptThink algorithm on DeepSeek-R1-Distill-Qwen-1.5B with $\delta$ from 0 to 0.1, and DeepSeek-R1-Distill-Qwen-7B with $\delta=0.05$. A larger $\large$ results in a higher proportion of NoThinking responses, which reduces more inference costs but also diminishes the resultant improvement in accuracy.

All the trained models are available on HuggingFace.

Name HF Repo
AdaptThink-1.5B-delta0 πŸ€— HF Repo
AdaptThink-1.5B-delta0.01 πŸ€— HF Repo
AdaptThink-1.5B-delta0.02 πŸ€— HF Repo
AdaptThink-1.5B-delta0.05 πŸ€— HF Repo
AdaptThink-1.5B-delta0.075 πŸ€— HF Repo
AdaptThink-1.5B-delta0.1 πŸ€— HF Repo
AdaptThink-7B-delta0.05 πŸ€— HF Repo

πŸ”₯ Training

Our training code is based on VeRL framework.

1. Creating Environment

We use vLLM 0.8.2, which supports flash-attention.

conda create -n adapt_think python=3.10
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

2. Check the chat template in HF models

After you download DeepSeek models, you should check chat_template in tokenizer_config.json to ensure the template ends with <|Assistant|><think>\\n, otherwise there will be bugs when running our code.

3. Pre-sampling from reference models

First, we need to pre-sample multiple responses from the reference model for each training problem to evaluate its instance-level accuracy. The sampling process will take several hours. For convenience, we have released our post-processed results in ./data/train/ref_results, which can be directly used for training.

# Initialize VLLM server. You can start multiple servers to accelerate pre-sampling.
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --served_model_name DeepSeek-R1-Distill-Qwen-1.5B --tensor_parallel_size 4

# Sampling 16 responses for each training problem. 
python src/presampling_ref_responses.py --K 16 --dataset_path ./data/train/deepscaler.json --model_name DeepSeek-R1-Distill-Qwen-1.5B --max_tokens 16384

# Postprocess to get instance-level accuracy
python src/postprocess_ref_results.py --input_path ./data/train/ref_presampling/DeepSeek-R1-Distill-Qwen-1.5B_deepscaler_n0_K16_len16384.json --output_path ./data/train/ref_results/DeepSeek-R1-Distill-Qwen-1.5B_deepscaler_K16_len16384.json

4. Preprocess training and test Datasets

bash scripts/preprocess_dataset.sh

5. Training

The training context size, batch size, and the learning rate are set to 16K, 128, and 2e-6, respectively. We train the models for 1 epoch, which is 314 steps in total. For the 1.5B model, we use one 8*H800 node and cost about 32 hours. For the 7B model, we use four 8*H800 nodes and cost about 28 hours. Finally, we select the checkpoints on 300 and 150 steps for the 1.5B and 7B models, respectively, where the models' accuracy and response lengths achieve a good balance.

To facilitate the training process, you can set a larger learning rate, such as 5e-5. However, it may make the training more unstable.

# 1.5b, single-node
bash scripts/run_adapt_think_1.5b_deepscaler_16k_delta0.05_btz128_lr2e-6.sh

# 7b, single-node
bash scripts/run_adapt_think_7b_deepscaler_16k_delta0.05_btz128_lr2e-6.sh

# 7b, multi-node
bash submit_mpi.sh scripts/run_adapt_think_7b_deepscaler_16k_delta0.05_btz128_lr2e-6_multinode.sh

πŸ“Š Evaluation

During training, VeRL will automatically evaluate on you selected test sets for every trainer.test_freq step.

We also provide additional scripts for evaluation.

# convert checkpoint to HF model
bash scripts/convert_to_hf.sh

# eval
bash scripts/run_eval_verl_hf.sh

You can also evaluate downloaded HF models by running:

bash scripts/run_eval_hf.sh

We list our evaluation results as follows:

1. Comparison with existing methods for efficient reasoning on mathematics datasets

image

2. Nothinking responses ratio and accuracy across different difficulty levels on MATH500

image

3. Comparison of different $\delta$ values

image

4. Evaluation results on MMLU

image

🧐 Cases

Simple problem

image image

Difficult problem

image

πŸ“ Citation

If you find our work useful, please consider citing LongReward:

@article{zhang2025adapt_think,
  title = {AdaptThink: LLM Can Learn When to Think} 
  author={Jiajie Zhang and Nianyi Lin and Lei Hou and Ling Feng and Juanzi Li},
  journal={arXiv preprint arXiv: 2505.13417},
  url={https://arxiv.org/abs/2505.13417}
  year={2025}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published