[2510.02240] RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning
Summary
The paper presents RewardMap, a multi-stage reinforcement learning framework aimed at improving fine-grained visual reasoning in multimodal large language models by addressing sparse rewards and enhancing training efficiency.
Why It Matters
This research addresses a significant challenge in AI, specifically in fine-grained visual reasoning, which is crucial for applications requiring spatial understanding. By proposing a novel approach to reinforcement learning, it enhances the capabilities of MLLMs, potentially impacting various domains such as robotics and computer vision.
Key Takeaways
- RewardMap tackles sparse rewards in visual reasoning tasks.
- The framework utilizes a difficulty-aware reward design for richer supervision.
- It introduces a multi-stage RL scheme for effective cold-start training.
- Experiments show an average improvement of 3.47% across multiple benchmarks.
- The proposed methods enhance both visual understanding and reasoning capabilities.
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.02240 (cs) [Submitted on 2 Oct 2025 (v1), last revised 21 Feb 2026 (this version, v2)] Title:RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning Authors:Sicheng Feng, Kaiwen Tuo, Song Wang, Lingdong Kong, Jianke Zhu, Huan Wang View a PDF of the paper titled RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning, by Sicheng Feng and 5 other authors View PDF HTML (experimental) Abstract:Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design tha...