[2602.15915] MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering
Summary
The paper presents MaS-VQA, a novel framework for Knowledge-Based Visual Question Answering that enhances answer accuracy by integrating visual data with filtered external knowledge.
Why It Matters
This research addresses the challenges of noisy and irrelevant knowledge in visual question answering systems, proposing a method that improves the accuracy of answers by effectively filtering and utilizing knowledge. This has implications for advancements in AI and machine learning applications, particularly in fields requiring accurate visual interpretation.
Key Takeaways
- MaS-VQA integrates explicit knowledge filtering with implicit reasoning.
- The Mask-and-Select mechanism enhances the relevance of visual and knowledge inputs.
- Experiments show consistent performance improvements across multiple model backbones.
- The framework addresses common issues of noise and irrelevance in knowledge retrieval.
- Ablation studies confirm the effectiveness of the selection mechanism.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15915 (cs) [Submitted on 17 Feb 2026] Title:MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering Authors:Xianwei Mao, Kai Ye, Sheng Zhou, Nan Zhang, Haikuan Huang, Bin Li, Jiajun Bu View a PDF of the paper titled MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering, by Xianwei Mao and 6 other authors View PDF HTML (experimental) Abstract:Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA a...