[2602.12916] Reliable Thinking with Images
Summary
The paper discusses 'Reliable Thinking with Images,' a method to enhance reasoning in Multi-modal Large Language Models (MLLMs) by addressing the issue of Noisy Thinking (NT) that arises from imperfect visual cues.
Why It Matters
As MLLMs increasingly integrate visual and textual data, addressing the reliability of these inputs is crucial for improving their reasoning capabilities. This research highlights a significant challenge in multimodal understanding and proposes a solution that could enhance the performance of AI systems in real-world applications.
Key Takeaways
- Introduces the concept of Noisy Thinking (NT) in MLLMs.
- Proposes Reliable Thinking with Images (RTWI) to mitigate NT effects.
- Demonstrates the effectiveness of RTWI through extensive experiments on multiple benchmarks.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.12916 (cs) [Submitted on 13 Feb 2026] Title:Reliable Thinking with Images Authors:Haobin Li, Yutong Yang, Yijie Lin, Dai Xiang, Mouxing Yang, Xi Peng View a PDF of the paper titled Reliable Thinking with Images, by Haobin Li and 5 other authors View PDF HTML (experimental) Abstract:As a multimodal extension of Chain-of-Thought (CoT), Thinking with Images (TWI) has recently emerged as a promising avenue to enhance the reasoning capability of Multi-modal Large Language Models (MLLMs), which generates interleaved CoT by incorporating visual cues into the textual reasoning process. However, the success of existing TWI methods heavily relies on the assumption that interleaved image-text CoTs are faultless, which is easily violated in real-world scenarios due to the complexity of multimodal understanding. In this paper, we reveal and study a highly-practical yet under-explored problem in TWI, termed Noisy Thinking (NT). Specifically, NT refers to the imperfect visual cues mining and answer reasoning process. As the saying goes, ``One mistake leads to another'', erroneous interleaved CoT would cause error accumulation, thus significantly degrading the performance of MLLMs. To solve the NT problem, we propose a novel method dubbed Reliable Thinking with Images (RTWI). In brief, RTWI estimates the reliability of visual cues and textual CoT in a unified text-centric manner and accordingly employs robust filtering ...