[2512.08980] Training Multi-Image Vision Agents via End2End Reinforcement Learning
About this article
Abstract page for arXiv paper 2512.08980: Training Multi-Image Vision Agents via End2End Reinforcement Learning
Computer Science > Computer Vision and Pattern Recognition arXiv:2512.08980 (cs) [Submitted on 5 Dec 2025 (v1), last revised 3 Apr 2026 (this version, v3)] Title:Training Multi-Image Vision Agents via End2End Reinforcement Learning Authors:Chengqi Dong, Chuhuai Yue, Hang He, Rongge Mao, Fenghe Tang, S Kevin Zhou, Zekun Xu, Xiaohan Wang, Jiajun Chai, Guojun Yin View a PDF of the paper titled Training Multi-Image Vision Agents via End2End Reinforcement Learning, by Chengqi Dong and 9 other authors View PDF HTML (experimental) Abstract:Recent VLM-based agents aim to replicate OpenAI O3's "thinking with images" via tool use, yet most open-source methods restrict inputs to a single image, limiting their applicability to real-world multi-image QA tasks. To address this gap, we propose IMAgent, an open-source visual agent trained with end-to-end reinforcement learning for fine-grained single/multi-image reasoning. During inference, VLMs tend to gradually neglect visual inputs; to mitigate this issue, we design two dedicated tools for visual reflection and verification, enabling the model to actively refocus attention on image content. Beyond that, we, for the first time, reveal how tool usage enhances agent performance from an attention perspective. Equipped with a carefully designed two-layer motion trajectory masking strategy and tool-use reward gain, IMAgent acquires an effective tool-use paradigm through pure reinforcement learning, eliminating the need for costly supervised ...