[2512.18956] Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection

[2512.18956] Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection

arXiv - Machine Learning 4 min read Article

Summary

This paper presents a three-stage framework, SynSelect, for enhancing the training of multimodal large reasoning models through improved long chain-of-thought synthesis and selection.

Why It Matters

As multimodal reasoning becomes increasingly important in AI, this framework addresses key challenges in data quality and model performance, potentially leading to significant advancements in AI capabilities across various applications.

Key Takeaways

  • Introduces SynSelect, a novel framework for multimodal reasoning.
  • Enhances model performance by generating high-quality long chain-of-thought data.
  • Demonstrates significant improvements over baseline models through extensive experiments.

Computer Science > Artificial Intelligence arXiv:2512.18956 (cs) [Submitted on 22 Dec 2025 (v1), last revised 14 Feb 2026 (this version, v2)] Title:Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection Authors:Yizhi Wang, Linan Yue, Min-Ling Zhang View a PDF of the paper titled Training Multimodal Large Reasoning Models Needs Better Thoughts: A Three-Stage Framework for Long Chain-of-Thought Synthesis and Selection, by Yizhi Wang and 2 other authors View PDF HTML (experimental) Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning tasks through long Chain-of-Thought (CoT) reasoning. Extending these successes to multimodal reasoning remains challenging due to the increased complexity of integrating diverse input modalities and the scarcity of high-quality long CoT training data. Existing multimodal datasets and CoT synthesis methods still suffer from limited reasoning depth, modality conversion errors, and rigid generation pipelines, hindering model performance and stability. To this end, in this paper, we propose SynSelect, a novel three-stage Synthesis-Selection framework for generating high-quality long CoT data tailored to multimodal reasoning tasks. Specifically, SynSelect first leverages multiple heterogeneous multimodal LRMs to produce diverse candidate CoTs, and then applies both instance and batch level selection to filter high-qualit...

Related Articles

Machine Learning

[D] Is this considered unsupervised or semi-supervised learning in anomaly detection?

Hi 👋🏼, I’m working on an anomaly detection setup and I’m a bit unsure how to correctly describe it from a learning perspective. The model...

Reddit - Machine Learning · 1 min ·
Machine Learning

Serious question. Did a transformer just describe itself and the universe and build itself a Shannon limit framework?

The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeo...

Reddit - Artificial Intelligence · 1 min ·
UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime