[2510.06638] StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
About this article
Abstract page for arXiv paper 2510.06638: StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.06638 (cs) [Submitted on 8 Oct 2025 (v1), last revised 22 Mar 2026 (this version, v3)] Title:StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering Authors:Zhihao Wen, Wenkang Wei, Yuan Fang, Xingtong Yu, Hui Zhang, Weicheng Zhu, Xin Zhang View a PDF of the paper titled StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering, by Zhihao Wen and 6 other authors View PDF HTML (experimental) Abstract:Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. Recent work has introduced its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source and answers are produced without external retrieval. Existing IK-KVQA approaches, however, are typically trained with answer-only supervision: reasoning remains implicit, justifications are often weak or inconsistent, and generalization after standard supervised fine-tuning (SFT) can be brittle. We propose StaR-KVQA, a framework that equips IK-KVQA with dual-path structured reasoning traces - symbolic relation paths over text and vision together with path-grounded natural-language explanations - to provide a stronger inductive bias than generic answer-only supervision. These traces act as modality-aware scaffolds that guide the model toward relevant entities and attributes, offeri...