[2505.19255] VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
About this article
Abstract page for arXiv paper 2505.19255: VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
Computer Science > Machine Learning arXiv:2505.19255 (cs) [Submitted on 25 May 2025 (v1), last revised 4 Mar 2026 (this version, v4)] Title:VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use Authors:Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt View a PDF of the paper titled VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use, by Mingyuan Wu and 8 other authors View PDF HTML (experimental) Abstract:Reinforcement Learning Finetuning (RFT) has significantly advanced the reasoning capabilities of large language models (LLMs) by enabling long chains of thought, self-correction, and effective tool use. While recent works attempt to extend RFT to vision-language models (VLMs), these efforts largely produce text-only reasoning conditioned on static image inputs, falling short of true multimodal reasoning in the response. In contrast, test-time methods like Visual Sketchpad incorporate visual steps but lack training mechanisms. We introduce VTool-R1, the first framework that trains VLMs to generate multimodal chains of thought by interleaving text and intermediate visual reasoning steps. VTool-R1 integrates Python-based visual editing tools into the RFT process, enabling VLMs to learn when and how to generate visual reasoning steps that benefit final reasoning. Trained with outcome-based rewards tied to task accuracy, our a...