[2602.21054] VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
Summary
The paper introduces VAUQ, a framework for vision-aware uncertainty quantification in large vision-language models (LVLMs), enhancing self-evaluation by measuring output dependence on visual evidence.
Why It Matters
As LVLMs are increasingly deployed in real-world applications, ensuring their reliability is crucial. VAUQ addresses the limitations of existing self-evaluation methods, providing a more accurate assessment of model outputs based on visual input, which is essential for safe deployment in critical applications.
Key Takeaways
- VAUQ improves self-evaluation of LVLMs by quantifying uncertainty based on visual evidence.
- The framework introduces the Image-Information Score (IS) to measure predictive uncertainty reduction.
- VAUQ outperforms existing self-evaluation methods across multiple datasets.
- An unsupervised core-region masking strategy enhances the influence of salient visual regions.
- The approach is training-free, making it accessible for practical applications.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.21054 (cs) [Submitted on 24 Feb 2026] Title:VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation Authors:Seongheon Park, Changdae Oh, Hyeong Kyu Choi, Xuefeng Du, Sharon Li View a PDF of the paper titled VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation, by Seongheon Park and 4 other authors View PDF HTML (experimental) Abstract:Large Vision-Language Models (LVLMs) frequently hallucinate, limiting their safe deployment in real-world applications. Existing LLM self-evaluation methods rely on a model's ability to estimate the correctness of its own outputs, which can improve deployment reliability; however, they depend heavily on language priors and are therefore ill-suited for evaluating vision-conditioned predictions. We propose VAUQ, a vision-aware uncertainty quantification framework for LVLM self-evaluation that explicitly measures how strongly a model's output depends on visual evidence. VAUQ introduces the Image-Information Score (IS), which captures the reduction in predictive uncertainty attributable to visual input, and an unsupervised core-region masking strategy that amplifies the influence of salient regions. Combining predictive entropy with this core-masked IS yields a training-free scoring function that reliably reflects answer correctness. Comprehensive experiments show that VAUQ consistently outperforms existing self-evaluation methods across multipl...