[2511.22715] ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
About this article
Abstract page for arXiv paper 2511.22715: ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering
Computer Science > Computer Vision and Pattern Recognition arXiv:2511.22715 (cs) [Submitted on 27 Nov 2025 (v1), last revised 31 Mar 2026 (this version, v2)] Title:ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering Authors:Alberto Compagnoni, Marco Morini, Sara Sarto, Federico Cocchi, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara View a PDF of the paper titled ReAG: Reasoning-Augmented Generation for Knowledge-based Visual Question Answering, by Alberto Compagnoni and 7 other authors View PDF HTML (experimental) Abstract:Multimodal Large Language Models (MLLMs) have shown impressive capabilities in jointly understanding text, images, and videos, often evaluated via Visual Question Answering (VQA). However, even state-of-the-art MLLMs struggle with domain-specific or knowledge-intensive queries, where relevant information is underrepresented in pre-training data. Knowledge-based VQA (KB-VQA) addresses this by retrieving external documents to condition answer generation, but current retrieval-augmented approaches suffer from low precision, noisy passages, and limited reasoning. To address this, we propose ReAG, a novel Reasoning-Augmented Multimodal RAG approach that combines coarse- and fine-grained retrieval with a critic model that filters irrelevant passages, ensuring high-quality additional context. The model follows a multi-stage training strategy leveraging reinforcement learning to enhance reasoning over retrieved c...