We present AdapThink, a novel reinforcement learning (RL) algorithm that enables reasoning models to adaptively choose between Thinking and NoThinking modes according to the difficulty of each input problem, thereby achieving automatic hybrid reasoning. Specifically, the model engages in thinking only when the problem is determined to be challenging; for other simple questions, it will bypass the thinking process and directly produce a concise final solution. This approach substantially reduces inference costs while further improving overall performance.
We apply the AdaptThink algorithm on DeepSeek-R1-Distill-Qwen-1.5B with
All the trained models are available on HuggingFace.
Name | HF Repo |
---|---|
AdaptThink-1.5B-delta0 | π€ HF Repo |
AdaptThink-1.5B-delta0.01 | π€ HF Repo |
AdaptThink-1.5B-delta0.02 | π€ HF Repo |
AdaptThink-1.5B-delta0.05 | π€ HF Repo |
AdaptThink-1.5B-delta0.075 | π€ HF Repo |
AdaptThink-1.5B-delta0.1 | π€ HF Repo |
AdaptThink-7B-delta0.05 | π€ HF Repo |
Our training code is based on VeRL framework.
We use vLLM 0.8.2, which supports flash-attention.
conda create -n adapt_think python=3.10
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
After you download DeepSeek models, you should check chat_template
in tokenizer_config.json
to ensure the template ends with <ο½Assistantο½><think>\\n
, otherwise there will be bugs when running our code.
First, we need to pre-sample multiple responses from the reference model for each training problem to evaluate its instance-level accuracy. The sampling process will take several hours. For convenience, we have released our post-processed results in ./data/train/ref_results
, which can be directly used for training.
# Initialize VLLM server. You can start multiple servers to accelerate pre-sampling.
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --served_model_name DeepSeek-R1-Distill-Qwen-1.5B --tensor_parallel_size 4
# Sampling 16 responses for each training problem.
python src/presampling_ref_responses.py --K 16 --dataset_path ./data/train/deepscaler.json --model_name DeepSeek-R1-Distill-Qwen-1.5B --max_tokens 16384
# Postprocess to get instance-level accuracy
python src/postprocess_ref_results.py --input_path ./data/train/ref_presampling/DeepSeek-R1-Distill-Qwen-1.5B_deepscaler_n0_K16_len16384.json --output_path ./data/train/ref_results/DeepSeek-R1-Distill-Qwen-1.5B_deepscaler_K16_len16384.json
bash scripts/preprocess_dataset.sh
The training context size, batch size, and the learning rate are set to 16K, 128, and 2e-6, respectively. We train the models for 1 epoch, which is 314 steps in total. For the 1.5B model, we use one 8*H800 node and cost about 32 hours. For the 7B model, we use four 8*H800 nodes and cost about 28 hours. Finally, we select the checkpoints on 300 and 150 steps for the 1.5B and 7B models, respectively, where the models' accuracy and response lengths achieve a good balance.
To facilitate the training process, you can set a larger learning rate, such as 5e-5. However, it may make the training more unstable.
# 1.5b, single-node
bash scripts/run_adapt_think_1.5b_deepscaler_16k_delta0.05_btz128_lr2e-6.sh
# 7b, single-node
bash scripts/run_adapt_think_7b_deepscaler_16k_delta0.05_btz128_lr2e-6.sh
# 7b, multi-node
bash submit_mpi.sh scripts/run_adapt_think_7b_deepscaler_16k_delta0.05_btz128_lr2e-6_multinode.sh
During training, VeRL will automatically evaluate on you selected test sets for every trainer.test_freq
step.
We also provide additional scripts for evaluation.
# convert checkpoint to HF model
bash scripts/convert_to_hf.sh
# eval
bash scripts/run_eval_verl_hf.sh
You can also evaluate downloaded HF models by running:
bash scripts/run_eval_hf.sh
We list our evaluation results as follows:




If you find our work useful, please consider citing LongReward:
@article{zhang2025adapt_think,
title = {AdaptThink: LLM Can Learn When to Think}
author={Jiajie Zhang and Nianyi Lin and Lei Hou and Ling Feng and Juanzi Li},
journal={arXiv preprint arXiv: 2505.13417},
url={https://arxiv.org/abs/2505.13417}
year={2025}
}