[ICML 2025] MME-CoT 🔥🕵️: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency

Official repository for "MME-CoT: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency".

🌟 For more details, please refer to the project page with dataset exploration and visualization tools.

[🍓Project Page] [📖 Paper] [📊 Huggingface Dataset] [🏆 Leaderboard] [👁️ Visualization]

💥 News

[2025.05.01] 🎉 MME-CoT is accepted by ICML 2025.
[2025.03.29] ⚙️ We have just integrated MME-CoT into lmms-eval. Thanks Luodian!
[2025.03.08] ⚙️ We have just integrated MME-CoT into VLMEvalKit.
[2025.02.14] 🌟 We are very proud to launch MME-CoT, the first-ever comprehensive CoT evaluation benchmark of LMMs in Visual Reasoning! We release the arxiv paper and all data samples in huggingface dataset.

👀 About MME-CoT

Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation.

In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level.

Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: (1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; (2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; (3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs.

💡 Illustration of our CoT Quality Evaluation Strategy

💪 Illustration of our CoT Robustness Evaluation Strategy

⚡️ Illustration of our CoT Efficiency Evaluation Strategy

Inference

Inference with lmms-eval

We support running inference on MME-CoT with lmms-eval. And then run the evaluation of each metric detailed in the Eval section.

Please first install lmms-eval as demonstrated in its official GitHub repo here.

Then, run the inference with the CoT prompt (needed for: Precision, Recall, Stability, Efficacy, Reflection Quality, and Relevance Rate):

accelerate launch --num_processes=8 --main_process_port=12345 -m lmms_eval \
    --model TESTED_MODEL \
    --model_args=pretrained=TESTED_MODEL_NAME \
    --tasks mme_cot_reason \
    --batch_size 1 --log_samples --log_samples_suffix output_cot --output_path ./logs/

Run the inference with the Direct prompt (needed for: Stability and Efficacy):

accelerate launch --num_processes=8 --main_process_port=12345 -m lmms_eval \
    --model TESTED_MODEL \
    --model_args=pretrained=TESTED_MODEL_NAME \
    --tasks mme_cot_direct \
    --batch_size 1 --log_samples --log_samples_suffix output_dir --output_path ./logs/

Then, format the output json file to the evaluation output as illustrated here:

cd tasks/mme_cot
# For CoT prompt
python tools/update_lmmseval_json.py \
--lmms_eval_json_path mmecot_reasoning_test_for_submission.json \
--save_path results/json/YOUR_MODEL_NAME_cot.json

# For direct prompt
python tools/update_lmmseval_json.py \
--lmms_eval_json_path mmecot_direct_test_for_submission.json \
--save_path results/json/YOUR_MODEL_NAME_dir.json

Finally, run the evaluation illustrated below.

Inference with VLMEvalKit

We also support running inference on MME-CoT with VLMEvalkit. And then run the evaluation of each metric detailed in the Eval section.

Please first install VLMEvalKit as demonstrated in its official GitHub repo here.

Then, run the inference with the CoT prompt (needed for: Precision, Recall, Stability, Efficacy, Reflection Quality, and Relevance Rate):

USE_COT_PROMPT=1 \
python run.py \
--data MME_CoT_TEST \
--model TESTED_MODEL \
--verbose \
--work-dir cot_results

Run the inference with the Direct prompt (needed for: Stability and Efficacy):

USE_COT_PROMPT=0 \
python run.py \
--data MME_CoT_TEST \
--model TESTED_MODEL \
--verbose \
--work-dir direct_results

Rename the result file MODELNAME_MME_CoT_TEST.xlsx to either MODELNAME_MME_CoT_TEST_cot.xlsx or MODELNAME_MME_CoT_TEST_dir.xlsx, depending on the prompt used.

Finally, run the evaluation illustrated below.

Evaluation

To calculate the six metrics (precision, recall, efficacy, stability, relevance rate, reflection quality), please follow the following steps:

Install the required packages.

pip install -r requirements.txt

Format the model answer.
- If you evaluate with lmms-eval, please follow the instruction above to convert to valid json format.
- If you evaluate with VLMEvalKit, you can directly use the output xlsx.
- We also provide examples shown in results/xlsx (the output from VLMEvalKit) and results/json. The json file should be in a jsonl format, with each answer to a question in one line. All the other information of the question in the dataset should be preserved in the line.
The suffix _cot.json denotes answering with the CoT prompt, and _dir.json denotes answering with the direct prompt.

Run the evaluation script.

You can either run the metrics one by one. For example, to evaluate recall:

bash scripts/recall.sh

Simply change the YOUR_MODEL_NAME and the data path in the recall.sh file.

Or you can run all the metrics for all the models in one directory with:

bash batch_scripts/run_all.py --result_dir results/xlsx

After GPT evaluation, you are expected to obtain a cache/ directory like this:

  📂 cache
   ┣━━ 📂 recall
   ┃    ┗━━ 📂 YOUR_MODEL_NAME
   ┃         ┣━━ 📄 1.json
   ┃         ┣━━ 📄 2.json
   ┃         ┗━━ 📄 ...
   ┣━━ 📂 precision
   ┃    ┗━━ 📂 YOUR_MODEL_NAME
   ┣━━ 📂 relevance_rate
   ┃    ┗━━ 📂 YOUR_MODEL_NAME
   ┣━━ 📂 reflection_quality
   ┃    ┗━━ 📂 YOUR_MODEL_NAME
   ┣━━ 📂 extract
   ┃    ┣━━ 📂 YOUR_MODEL_NAME_dir
   ┃    ┗━━ 📂 YOUR_MODEL_NAME_cot
   ┗━━ 📂 judge
        ┣━━ 📂 YOUR_MODEL_NAME_dir
        ┗━━ 📂 YOUR_MODEL_NAME_cot

Note that, if your model does not contain reflection process, you do not need to run reflection_quality.sh. The metric calculation script below will handle that automatically.

Calculate the metrics.

We cache the evaluation results of all the questions in the cache dir. Here we read the results from the cache dir and calculate the metrics.

For example, to calculate quality:
```
python final_score/quality.py --cache_dir cache --save_path final_results
```
The script will automatically calculate recall and precision, then calculate the f1 score or average score.

Or, you can calculate each metric one by one. For example, to calculate recall:
```
python final_score/recall.py --cache_dir cache/recall --save_path final_results
```

Notes

The structure of the scripts directory:

 📂 scripts
  ┣━━ 📜 recall.sh           # evaluate recall
  ┣━━ 📜 precision.sh        # evaluate precision
  ┣━━ 📜 reflection_quality.sh  # evaluate reflection quality
  ┣━━ 📜 relevance_rate.sh   # evaluate relevance rate
  ┣━━ 📜 extract.sh          # First step of direct evaluation (for robustness): Extract final answers from model responses
  ┗━━ 📜 judge.sh            # Second step of direct evaluation (for robustness): Judge the correctness of the extracted answers

🏆 Leaderboard

Contributing to the Leaderboard

🚨 The Leaderboard is continuously being updated, welcoming the contribution of your excellent LMMs!

To contribute your model to the leaderboard, please email the prediction files of four tasks to 📫[email protected].

Data Usage

We release the MME-CoT data and evaluation prompts for benchmarking on the leaderboard.

You can download the dataset from the 🤗 Huggingface by the following command (make sure that you have installed related packages):

from datasets import load_dataset

dataset = load_dataset("CaraJ/MME-CoT")

✅ Citation

If you find MME-CoT useful for your research and applications, please kindly cite using this BibTeX:

@article{jiang2025mme,
  title={MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency},
  author={Jiang, Dongzhi and Zhang, Renrui and Guo, Ziyu and Li, Yanwei and Qi, Yu and Chen, Xinyan and Wang, Liuhui and Jin, Jianhan and Guo, Claire and Yan, Shen and others},
  journal={arXiv preprint arXiv:2502.09621},
  year={2025}
}

📜 Related Work

Explore our additional research on Vision-Language Large Models:

[MME-Survey] MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs
[MME] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
[MMSearch] MMSearch: Benchmarking the potential of large models as multi-modal search engines
[MathVerse] MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?
[LLaMA-Adapter] LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
[ImageBind-LLM] Imagebind-LLM: Multi-modality Instruction Tuning
[SPHINX] The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal LLMs
[Point-Bind & Point-LLM] Multi-modality 3D Understanding, Generation, and Instruction Following
[PerSAM] Personalize segment anything model with one shot
[CoMat] CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
batch_scripts		batch_scripts
figs		figs
final_score		final_score
prompt		prompt
results		results
scripts		scripts
tools		tools
.gitignore		.gitignore
README.md		README.md
dataset.py		dataset.py
direct_eval.py		direct_eval.py
file_utils.py		file_utils.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[ICML 2025] MME-CoT 🔥🕵️: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency

💥 News

👀 About MME-CoT

Inference

Inference with lmms-eval

Inference with VLMEvalKit

Evaluation

Notes

🏆 Leaderboard

Contributing to the Leaderboard

Data Usage

✅ Citation

📜 Related Work

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

MME-Benchmarks/MME-CoT

Folders and files

Latest commit

History

Repository files navigation

[ICML 2025] MME-CoT 🔥🕵️: Benchmarking Chain-of-Thought in LMMs for Reasoning Quality, Robustness, and Efficiency

💥 News

👀 About MME-CoT

Inference

Inference with lmms-eval

Inference with VLMEvalKit

Evaluation

Notes

🏆 Leaderboard

Contributing to the Leaderboard

Data Usage

✅ Citation

📜 Related Work

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages