[2511.05705] Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
Summary
The paper presents a novel framework for synthesizing vision-centric problems and reasoning chains, generating over 1 million high-quality visual problems that enhance multimodal reasoning capabilities.
Why It Matters
This research addresses the limitations in multimodal reasoning by providing a systematic approach to create diverse visual datasets. The findings suggest significant improvements in model performance across various benchmarks, indicating potential advancements in AI's understanding of visual information.
Key Takeaways
- Introduces a framework for synthesizing complex visual problems.
- Generates over 1 million high-quality visual reasoning problems.
- Demonstrates improved performance of models fine-tuned on the new dataset.
- Shows positive transfer effects to text-only and audio reasoning tasks.
- Analyzes the VLM post-training pipeline, revealing insights on SFT and RL.
Computer Science > Computer Vision and Pattern Recognition arXiv:2511.05705 (cs) [Submitted on 7 Nov 2025 (v1), last revised 17 Feb 2026 (this version, v2)] Title:Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale Authors:David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu, Prithviraj Ammanabrolu, Hyunwoo Kim, Yuan-Hong Liao, Yejin Choi View a PDF of the paper titled Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale, by David Acuna and 7 other authors View PDF HTML (experimental) Abstract:Despite rapid progress, multimodal reasoning still lacks a systematic approach to synthesize large-scale vision-centric datasets beyond visual math. We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts supporting SFT, offline and online RL. Our vision-centric synthesis framework uses a two-stage process focusing on: (1) generating diverse verifiable questions from existing images at scale, and (2) creating complex compositional visual problems by merging simpler questions. Remarkably, finetuning Qwen2.5-VL-7B on our data outperforms existing open-data baselines across evaluated vision-centric benchmarks, and our best configurations match or surpass strong closed-data models such as MiMo-VL-7B-RL on Vstar Bench, CV-Bench and M...