Machine Learning Computer Vision Ai Agents

[2602.15915] MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

arXiv - AI February 19, 2026 3 min read Article

Summary

The paper presents MaS-VQA, a novel framework for Knowledge-Based Visual Question Answering that enhances answer accuracy by integrating visual data with filtered external knowledge.

Why It Matters

This research addresses the challenges of noisy and irrelevant knowledge in visual question answering systems, proposing a method that improves the accuracy of answers by effectively filtering and utilizing knowledge. This has implications for advancements in AI and machine learning applications, particularly in fields requiring accurate visual interpretation.

Key Takeaways

MaS-VQA integrates explicit knowledge filtering with implicit reasoning.
The Mask-and-Select mechanism enhances the relevance of visual and knowledge inputs.
Experiments show consistent performance improvements across multiple model backbones.
The framework addresses common issues of noise and irrelevance in knowledge retrieval.
Ablation studies confirm the effectiveness of the selection mechanism.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15915 (cs) [Submitted on 17 Feb 2026] Title:MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering Authors:Xianwei Mao, Kai Ye, Sheng Zhou, Nan Zhang, Haikuan Huang, Bin Li, Jiajun Bu View a PDF of the paper titled MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering, by Xianwei Mao and 6 other authors View PDF HTML (experimental) Abstract:Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA a...

Read Original Article

[2602.15915] MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

Summary

Why It Matters

Key Takeaways

Related Articles

[D] Howcome Muon is only being used for Transformers?

[P] Run Karpathy's Autoresearch for $0.44 instead of $24 — Open-source parallel evolution pipeline on SageMaker Spot

Improving AI models’ ability to explain their predictions

[R] Are there ML approaches for prioritizing and routing “important” signals across complex systems?

No comments

Stay updated with AI News