[2503.06749] Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
About this article
Abstract page for arXiv paper 2503.06749: Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Computer Science > Computer Vision and Pattern Recognition arXiv:2503.06749 (cs) [Submitted on 9 Mar 2025 (v1), last revised 28 Feb 2026 (this version, v4)] Title:Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models Authors:Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Xu Tang, Yao Hu, Shaohui Lin View a PDF of the paper titled Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models, by Wenxuan Huang and 9 other authors View PDF HTML (experimental) Abstract:DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after...