MMFineReason
A Large-Scale Dataset for Fine-Grained Multimodal Reasoning with Long-Form Chain-of-Thought Supervision
Figure 1: Benchmark results on mathematical reasoning and general understanding tasks. MMFineReason(MFR)-8B demonstrates strong performance relative to thinking models with significantly more parameters.
Recent advances in Vision Language Models (VLMs) have driven significant progress in visual reasoning. However, open-source multimodal models still lag behind proprietary systems, largely due to the lack of high-quality reasoning data. Existing datasets offer limited coverage of challenging domains such as STEM diagrams and visual puzzles, and lack consistent, long-form Chain-of-Thought (CoT) annotations essential for eliciting strong reasoning capabilities.
To bridge this gap, we introduce MMFineReason, a large-scale multimodal reasoning dataset comprising 1.8M samples and 5.1B solution tokens, featuring high-quality reasoning annotations distilled from Qwen3-VL-235B-A22B-Thinking.
Our dataset is constructed through a systematic four-stage pipeline: large-scale data collection and standardization, filtering to retain high-value reasoning samples, CoT rationale generation, and rigorous multi-dimensional quality verification. The resulting dataset spans STEM problems, visual puzzles, games, and complex diagrams, with each sample annotated with detailed, visually grounded reasoning traces.
We fine-tune Qwen3-VL-Instruct on MMFineReason to develop MMFineReason-2B/4B/8B. Our models establish new state-of-the-art results for their size class. Notably, MMFineReason-4B surpasses Qwen3-VL-8B-Thinking, and MMFineReason-8B outperforms Qwen3-VL-30B-A3B-Thinking while approaching Qwen3-VL-32B-Thinking, demonstrating remarkable parameter efficiency.
A systematic four-stage pipeline ensuring data diversity and high-quality reasoning annotations
Comprehensive statistics and domain distribution of MMFineReason
Dataset Composition of MMFineReason-SFT. The outer ring represents the proportion of major categories, and the inner ring shows the distribution of specific datasets. The actual data distribution is dominated by Mathematics (79.4%), followed by Science (13.8%), Puzzle/Game (4.6%), and General/OCR (2.2%).
Table 2: STEM/Diagrammatic images dominate, with diverse natural image coverage.
MMFineReason provides significantly deeper reasoning traces than existing datasets
We report the distribution metrics (mean, median, and percentiles) for reasoning chains (CoT) and image captions.
| Dataset | Type | Count | Total Tokens | Mean | Median | Std Dev | Min | Max | P25 | P75 | P95 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| MMFineReason (Ours) | CoT | 1,770,926 | 5,152,806,394 | 2,909.67 | 2,038 | 2,463.83 | 239 | 16,316 | 1,321 | 3,569 | 8,207 |
| OpenMMReasoner | CoT | 874,357 | 590,096,263 | 674.89 | 180 | 1,477.53 | 26 | 16,483 | 102 | 464 | 3,318 |
| HoneyBee | CoT | 2,481,229 | 2,636,405,079 | 1,062.54 | 972 | 428.00 | 203 | 7,190 | 745 | 1,298 | 1,931 |
| MMFineReason (Ours) | Caption | 1,770,926 | 1,079,313,259 | 609.46 | 582 | 184.88 | 1 | 5,187 | 494 | 688 | 920 |
| HoneyBee | Caption | 1,439,921 | 431,096,653 | 299.39 | 264 | 157.99 | 21 | 2,739 | 201 | 350 | 598 |
Bold values indicate the highest statistics, highlighting the significant reasoning depth of MMFineReason. MMFineReason achieves an average CoT length of 2,910 tokens — approximately 2.7× longer than HoneyBee and 4.3× longer than OpenMMReasoner.
MMFineReason models achieve remarkable performance across diverse benchmarks
| Model | MMMUval | MathVistamini | MathVisiontest | MathVersemini | DynaMathtest | LogicVistatest | VisuLogictest | ScienceQA | RWQAtest | MMBench-EN | MMStartest | AI2Dtest | CharXivreas. | CharXivdesc. | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-32B-Thinking | 78.1 | 85.9 | 70.2 | 82.6 | 82.0 | 70.9 | 32.4 | 97.2 | 78.4 | 89.5 | 79.4 | 88.9 | 65.2 | 90.2 | 77.9 |
| GPT5-mini-high | 79.0 | 79.1 | 71.9 | 78.8 | 81.4 | 71.4 | 27.2 | 96.9 | 79.0 | 86.6 | 74.1 | 88.2 | 68.6 | 89.4 | 76.5 |
| Ours MMFineReason-8B | 71.3 | 81.7 | 67.1 | 81.5 | 83.4 | 68.5 | 30.5 | 97.5 | 75.6 | 88.9 | 75.2 | 87.9 | 60.0 | 90.8 | 75.7 |
| Gemini-2.5-Flash | 77.7 | 79.4 | 64.3 | 77.7 | 75.9 | 67.3 | 31.0 | 97.1 | 76.0 | 87.0 | 76.5 | 88.7 | 61.7 | 90.1 | 75.0 |
| Qwen3-VL-30B-A3B-Thinking | 76.0 | 81.9 | 65.7 | 79.6 | 80.1 | 65.8 | 26.6 | 96.4 | 77.4 | 87.0 | 75.5 | 86.9 | 56.6 | 86.9 | 74.5 |
| Ours MMFineReason-4B | 69.6 | 82.2 | 61.3 | 78.7 | 80.6 | 67.6 | 29.8 | 95.8 | 74.9 | 88.7 | 72.8 | 86.5 | 58.1 | 87.7 | 73.9 |
| Qwen3-VL-8B-Thinking | 74.1 | 81.4 | 62.7 | 77.7 | 73.2 | 65.1 | 27.5 | 94.8 | 73.5 | 85.3 | 75.3 | 84.9 | 53.0 | 85.9 | 72.5 |
| MMR1-8B | 62.8 | 75.3 | 48.4 | 67.3 | 73.6 | 54.6 | 25.4 | 95.4 | 71.0 | 86.9 | 69.3 | 83.4 | 48.8 | 81.5 | 67.4 |
| Ours MMFineReason-2B | 54.8 | 74.6 | 45.3 | 69.2 | 71.4 | 53.8 | 28.3 | 94.4 | 68.3 | 84.5 | 67.7 | 82.5 | 45.4 | 74.3 | 65.3 |
| OpenMMReasoner-7B | 57.8 | 79.5 | 43.6 | 63.8 | 69.1 | 50.0 | 24.4 | 96.8 | 69.4 | 85.9 | 69.0 | 85.0 | 46.1 | 73.5 | 65.3 |
| HoneyBee-8B | 63.1 | 71.9 | 37.4 | 60.9 | 69.4 | 47.8 | 25.9 | 95.2 | 70.5 | 87.4 | 73.3 | 86.0 | 47.4 | 75.8 | 65.1 |
Key Results: MFR-8B outperforms Qwen3-VL-30B-A3B-Thinking on nearly all mathematical benchmarks while using 3.75× fewer parameters. Despite minimal chart/real-world training data, MFR-8B achieves strong results on CharXiv (90.8%) and RealWorldQA (75.6%), demonstrating excellent cross-domain generalization.
Main results of different model scales across various multimodal benchmarks. We compare our SFT and RL models against the base model.
| Size | Model | MMMUval | MathVistamini | MathVisiontest | MathVersemini | DynaMathtest | LogicVistatest | VisuLogictest | ScienceQA | RWQAtest | MMBench-EN | MMStartest | AI2Dtest | CharXivreas. | CharXivdesc. | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2B Model | ||||||||||||||||
| 2B | Qwen3-VL-Instruct | 53.4 | 61.3 | 31.6 | 52.1 | 54.2 | 35.8 | 11.5 | 87.4 | 63.9 | 78.4 | 58.3 | 76.9 | 26.8 | 62.3 | 53.9 |
| Qwen3-VL-Thinking | 61.4 | 73.6 | 45.9 | 66.9 | 66.7 | 50.0 | 25.4 | 88.0 | 69.5 | 79.9 | 68.1 | 80.4 | 37.1 | 70.1 | 63.1 | |
| MFR-SFT | 54.6 | 73.3 | 40.9 | 70.4 | 68.7 | 52.8 | 24.7 | 92.3 | 67.9 | 83.2 | 63.6 | 78.5 | 39.0 | 74.1 | 63.1 | |
| MFR-RL | 54.8 | 74.6 | 45.3 | 69.2 | 71.4 | 53.8 | 28.3 | 94.4 | 68.2 | 84.5 | 67.7 | 82.5 | 45.4 | 74.3 | 65.3 | |
| 4B Model | ||||||||||||||||
| 4B | Qwen3-VL-Instruct | 67.4 | 73.7 | 51.6 | 46.8 | 65.3 | 53.2 | 19.0 | 88.0 | 70.9 | 83.9 | 69.8 | 84.1 | 39.7 | 76.2 | 63.5 |
| Qwen3-VL-Thinking | 70.8 | 79.5 | 60.0 | 75.2 | 74.4 | 61.1 | 30.2 | 94.1 | 73.2 | 84.6 | 73.2 | 84.9 | 50.3 | 83.9 | 71.1 | |
| MFR-SFT | 69.3 | 80.1 | 62.4 | 78.4 | 79.9 | 66.7 | 27.8 | 96.6 | 71.5 | 88.7 | 73.0 | 86.1 | 55.9 | 87.7 | 73.2 | |
| MFR-RL | 69.6 | 82.2 | 61.3 | 78.7 | 80.6 | 67.6 | 29.8 | 95.8 | 74.9 | 88.7 | 72.8 | 86.5 | 58.1 | 87.7 | 73.9 | |
| 8B Model | ||||||||||||||||
| 8B | Qwen3-VL-Instruct | 69.6 | 77.2 | 53.9 | 62.1 | 67.7 | 55.3 | 22.5 | 95.4 | 71.5 | 84.5 | 70.9 | 85.7 | 46.4 | 83.0 | 67.6 |
| Qwen3-VL-Thinking | 74.1 | 81.4 | 62.7 | 77.7 | 73.2 | 65.1 | 27.5 | 94.8 | 73.5 | 85.3 | 75.3 | 84.9 | 53.0 | 85.9 | 72.5 | |
| MFR-SFT | 71.3 | 81.2 | 67.6 | 82.2 | 82.6 | 68.7 | 29.9 | 95.4 | 74.1 | 87.8 | 74.8 | 86.5 | 58.4 | 89.9 | 75.0 | |
| MFR-RL | 71.3 | 81.7 | 67.1 | 81.5 | 83.4 | 68.5 | 30.5 | 97.5 | 75.6 | 88.9 | 75.2 | 87.9 | 60.0 | 90.8 | 75.7 | |
Key Results: SFT provides the primary performance boost, establishing strong reasoning foundations. RL training significantly improves generalization on general understanding and chart benchmarks.
Key insights from our comprehensive experiments on data and training strategies
We filter samples by difficulty using Qwen3-VL-4B-Thinking pass rates:
Key Results: High-quality challenging samples provide most of the training signal, enabling faster convergence with minimal data.
Analysis of how dataset properties affect STEM reasoning performance:
Explore detailed Chain-of-Thought reasoning traces across different domains
Find the value of the variable y in the figure. Choices: 55, 115, 125, 135.
125
If you find our work useful, please cite our paper
@misc{lin2026mmfinereasonclosingmultimodalreasoning,
title={MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods},
author={Honglin Lin and Zheng Liu and Yun Zhu and Chonghan Qin and Juekai Lin and Xiaoran Shang and Conghui He and Wentao Zhang and Lijun Wu},
year={2026},
eprint={2601.21821},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.21821},
}