[2602.15915] MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

[2602.15915] MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering

arXiv - AI 3 min read Article

Summary

The paper presents MaS-VQA, a novel framework for Knowledge-Based Visual Question Answering that enhances answer accuracy by integrating visual data with filtered external knowledge.

Why It Matters

This research addresses the challenges of noisy and irrelevant knowledge in visual question answering systems, proposing a method that improves the accuracy of answers by effectively filtering and utilizing knowledge. This has implications for advancements in AI and machine learning applications, particularly in fields requiring accurate visual interpretation.

Key Takeaways

  • MaS-VQA integrates explicit knowledge filtering with implicit reasoning.
  • The Mask-and-Select mechanism enhances the relevance of visual and knowledge inputs.
  • Experiments show consistent performance improvements across multiple model backbones.
  • The framework addresses common issues of noise and irrelevance in knowledge retrieval.
  • Ablation studies confirm the effectiveness of the selection mechanism.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15915 (cs) [Submitted on 17 Feb 2026] Title:MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering Authors:Xianwei Mao, Kai Ye, Sheng Zhou, Nan Zhang, Haikuan Huang, Bin Li, Jiajun Bu View a PDF of the paper titled MaS-VQA: A Mask-and-Select Framework for Knowledge-Based Visual Question Answering, by Xianwei Mao and 6 other authors View PDF HTML (experimental) Abstract:Knowledge-based Visual Question Answering (KB-VQA) requires models to answer questions by integrating visual information with external knowledge. However, retrieved knowledge is often noisy, partially irrelevant, or misaligned with the visual content, while internal model knowledge is difficult to control and interpret. Naive aggregation of these sources limits reasoning effectiveness and reduces answer accuracy. To address this, we propose MaS-VQA, a selection-driven framework that tightly couples explicit knowledge filtering with implicit knowledge reasoning. MaS-VQA first retrieves candidate passages and applies a Mask-and-Select mechanism to jointly prune irrelevant image regions and weakly relevant knowledge fragments, producing compact, high-signal multimodal knowledge . This filtered knowledge then guides the activation of internal knowledge in a constrained semantic space, enabling complementary co-modeling of explicit and implicit knowledge for robust answer prediction. Experiments on Encyclopedic-VQA a...

Related Articles

Llms

[D] Howcome Muon is only being used for Transformers?

Muon has quickly been adopted in LLM training, yet we don't see it being talked about in other contexts. Searches for Muon on ConvNets tu...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] Run Karpathy's Autoresearch for $0.44 instead of $24 — Open-source parallel evolution pipeline on SageMaker Spot

TL;DR: I built an open-source pipeline that runs Karpathy's autoresearch on SageMaker Spot instances — 25 autonomous ML experiments for $...

Reddit - Machine Learning · 1 min ·
Improving AI models’ ability to explain their predictions
Machine Learning

Improving AI models’ ability to explain their predictions

AI News - General · 9 min ·
Machine Learning

[R] Are there ML approaches for prioritizing and routing “important” signals across complex systems?

I’ve been reading more about attention mechanisms in transformers and how they effectively learn to weight and prioritize relevant inputs...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime