[2506.22832] Listener-Rewarded Thinking in VLMs for Image Preferences
About this article
Abstract page for arXiv paper 2506.22832: Listener-Rewarded Thinking in VLMs for Image Preferences
Computer Science > Computer Vision and Pattern Recognition arXiv:2506.22832 (cs) This paper has been withdrawn by Alexander Gambashidze [Submitted on 28 Jun 2025 (v1), last revised 9 Apr 2026 (this version, v3)] Title:Listener-Rewarded Thinking in VLMs for Image Preferences Authors:Alexander Gambashidze, Li Pengyi, Matvey Skripkin, Andrey Galichin, Anton Gusarov, Konstantin Sobolev, Andrey Kuznetsov, Ivan Oseledets View a PDF of the paper titled Listener-Rewarded Thinking in VLMs for Image Preferences, by Alexander Gambashidze and 7 other authors No PDF available, click to view other formats Abstract:Training robust and generalizable reward models for human visual preferences is essential for aligning text-to-image and text-to-video generative models with human intent. However, current reward models often fail to generalize, and supervised fine-tuning leads to memorization, demanding complex annotation pipelines. While reinforcement learning (RL), specifically Group Relative Policy Optimization (GRPO), improves generalization, we uncover a key failure mode: a significant drop in reasoning accuracy occurs when a model's reasoning trace contradicts that of an independent, frozen vision-language model ("listener") evaluating the same output. To address this, we introduce a listener-augmented GRPO framework. Here, the listener re-evaluates the reasoner's chain-of-thought to provide a dense, calibrated confidence score, shaping the RL reward signal. This encourages the reasoner...