[ICML 2025] MME-CoT π₯π΅οΈ: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency
Official repository for "MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency".
π For more details, please refer to the project page with dataset exploration and visualization tools.
[πProject Page] [π Paper] [π Huggingface Dataset] [π Leaderboard] [ποΈ Visualization]
- [2025.05.01] π MME-CoT is accepted by ICML 2025.
- [2025.03.29] βοΈ We have just integrated MME-CoT into lmms-eval. Thanks Luodian!
- [2025.03.08] βοΈ We have just integrated MME-CoT into VLMEvalKit.
- [2025.02.14] π We are very proud to launch MME-CoT, the first-ever comprehensive CoT evaluation benchmark of LMMs in Visual Reasoning! We release the arxiv paper and all data samples in huggingface dataset.
Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation.
In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level.
Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: (1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; (2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; (3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs.
We support running inference on MME-CoT with lmms-eval. And then run the evaluation of each metric detailed in the Eval section.
Please first install lmms-eval as demonstrated in its official GitHub repo here.
Then, run the inference with the CoT prompt (needed for: Precision, Recall, Stability, Efficacy, Reflection Quality, and Relevance Rate):
accelerate launch --num_processes=8 --main_process_port=12345 -m lmms_eval \
--model TESTED_MODEL \
--model_args=pretrained=TESTED_MODEL_NAME \
--tasks mme_cot_reason \
--batch_size 1 --log_samples --log_samples_suffix output_cot --output_path ./logs/
Run the inference with the Direct prompt (needed for: Stability and Efficacy):
accelerate launch --num_processes=8 --main_process_port=12345 -m lmms_eval \
--model TESTED_MODEL \
--model_args=pretrained=TESTED_MODEL_NAME \
--tasks mme_cot_direct \
--batch_size 1 --log_samples --log_samples_suffix output_dir --output_path ./logs/
Then, format the output json file to the evaluation output as illustrated here:
cd tasks/mme_cot
# For CoT prompt
python tools/update_lmmseval_json.py \
--lmms_eval_json_path mmecot_reasoning_test_for_submission.json \
--save_path results/json/YOUR_MODEL_NAME_cot.json
# For direct prompt
python tools/update_lmmseval_json.py \
--lmms_eval_json_path mmecot_direct_test_for_submission.json \
--save_path results/json/YOUR_MODEL_NAME_dir.json
Finally, run the evaluation illustrated below.
We also support running inference on MME-CoT with VLMEvalkit. And then run the evaluation of each metric detailed in the Eval section.
Please first install VLMEvalKit as demonstrated in its official GitHub repo here.
Then, run the inference with the CoT prompt (needed for: Precision, Recall, Stability, Efficacy, Reflection Quality, and Relevance Rate):
USE_COT_PROMPT=1 \
python run.py \
--data MME_CoT_TEST \
--model TESTED_MODEL \
--verbose \
--work-dir cot_results
Run the inference with the Direct prompt (needed for: Stability and Efficacy):
USE_COT_PROMPT=0 \
python run.py \
--data MME_CoT_TEST \
--model TESTED_MODEL \
--verbose \
--work-dir direct_results
Rename the result file MODELNAME_MME_CoT_TEST.xlsx
to either MODELNAME_MME_CoT_TEST_cot.xlsx
or MODELNAME_MME_CoT_TEST_dir.xlsx
, depending on the prompt used.
Finally, run the evaluation illustrated below.
To calculate the six metrics (precision, recall, efficacy, stability, relevance rate, reflection quality), please follow the following steps:
- Install the required packages.
pip install -r requirements.txt
-
Format the model answer.
- If you evaluate with lmms-eval, please follow the instruction above to convert to valid json format.
- If you evaluate with VLMEvalKit, you can directly use the output xlsx.
- We also provide examples shown in
results/xlsx
(the output from VLMEvalKit) andresults/json
. The json file should be in a jsonl format, with each answer to a question in one line. All the other information of the question in the dataset should be preserved in the line.
The suffix
_cot.json
denotes answering with the CoT prompt, and_dir.json
denotes answering with the direct prompt. -
Run the evaluation script.
You can either run the metrics one by one. For example, to evaluate recall:
bash scripts/recall.sh
Simply change the
YOUR_MODEL_NAME
and the data path in therecall.sh
file.Or you can run all the metrics for all the models in one directory with:
bash batch_scripts/run_all.py --result_dir results/xlsx
After GPT evaluation, you are expected to obtain a
cache/
directory like this:π cache β£ββ π recall β βββ π YOUR_MODEL_NAME β β£ββ π 1.json β β£ββ π 2.json β βββ π ... β£ββ π precision β βββ π YOUR_MODEL_NAME β£ββ π relevance_rate β βββ π YOUR_MODEL_NAME β£ββ π reflection_quality β βββ π YOUR_MODEL_NAME β£ββ π extract β β£ββ π YOUR_MODEL_NAME_dir β βββ π YOUR_MODEL_NAME_cot βββ π judge β£ββ π YOUR_MODEL_NAME_dir βββ π YOUR_MODEL_NAME_cot
Note that, if your model does not contain reflection process, you do not need to run
reflection_quality.sh
. The metric calculation script below will handle that automatically. -
Calculate the metrics.
We cache the evaluation results of all the questions in the cache dir. Here we read the results from the cache dir and calculate the metrics.
For example, to calculate quality:
python final_score/quality.py --cache_dir cache --save_path final_results
The script will automatically calculate recall and precision, then calculate the f1 score or average score.
Or, you can calculate each metric one by one. For example, to calculate recall:
python final_score/recall.py --cache_dir cache/recall --save_path final_results
- The structure of the
scripts
directory:π scripts β£ββ π recall.sh # evaluate recall β£ββ π precision.sh # evaluate precision β£ββ π reflection_quality.sh # evaluate reflection quality β£ββ π relevance_rate.sh # evaluate relevance rate β£ββ π extract.sh # First step of direct evaluation (for robustness): Extract final answers from model responses βββ π judge.sh # Second step of direct evaluation (for robustness): Judge the correctness of the extracted answers
π¨ The Leaderboard is continuously being updated, welcoming the contribution of your excellent LMMs!
To contribute your model to the leaderboard, please email the prediction files of four tasks to π«[email protected].
We release the MME-CoT data and evaluation prompts for benchmarking on the leaderboard.
You can download the dataset from the π€ Huggingface by the following command (make sure that you have installed related packages):
from datasets import load_dataset
dataset = load_dataset("CaraJ/MME-CoT")
If you find MME-CoT useful for your research and applications, please kindly cite using this BibTeX:
@article{jiang2025mme,
title={MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency},
author={Jiang, Dongzhi and Zhang, Renrui and Guo, Ziyu and Li, Yanwei and Qi, Yu and Chen, Xinyan and Wang, Liuhui and Jin, Jianhan and Guo, Claire and Yan, Shen and others},
journal={arXiv preprint arXiv:2502.09621},
year={2025}
}
Explore our additional research on Vision-Language Large Models:
- [MME-Survey] MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
- [MME] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
- [MMSearch] MMSearch: Benchmarking the potential of large models as multi-modal search engines
- [MathVerse] MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
- [LLaMA-Adapter] LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
- [ImageBind-LLM] Imagebind-LLM: Multi-modality Instruction Tuning
- [SPHINX] The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal LLMs
- [Point-Bind & Point-LLM] Multi-modality 3D Understanding, Generation, and Instruction Following
- [PerSAM] Personalize segment anything model with one shot
- [CoMat] CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching