[2603.27482] Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning
About this article
Abstract page for arXiv paper 2603.27482: Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.27482 (cs) [Submitted on 29 Mar 2026] Title:Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning Authors:Feiding, Yongkang Zhang, Yuhao Liao, Zijian Zeng, Chunzheng Zhu, Yaozong Zheng, Yafei Liu, Yeling Peng, Youwei Wang, Sibo Wang, Huiming Yang, Linglin Liao, Shunzhi Yang View a PDF of the paper titled Difference Feedback: Generating Multimodal Process-Level Supervision for VLM Reinforcement Learning, by Feiding and 12 other authors View PDF HTML (experimental) Abstract:Vision--language models (VLMs) are increasingly aligned via Group Relative Policy Optimization (GRPO)-style training. However, relying solely on terminal outcome rewards yields sparse credit assignment in multi-step reasoning, weakening the linkage between visual evidence and intermediate steps and often causing unstable optimization and visual hallucinations. We propose Differential Feedback, which automatically constructs token/step-level supervision masks by repairing erroneous reasoning trajectories, explicitly marking the key positions that require correction. Without costly large-scale step-by-step human annotations, our method enables process-level visual alignment and can be seamlessly integrated into existing GRPO-like frameworks. Experiments on multimodal reasoning benchmarks including MMMStar and MathVista show an average 3% improvement under matched compute budgets. Our approach ...