[2510.08138] Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability
About this article
Abstract page for arXiv paper 2510.08138: Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability
Computer Science > Computer Vision and Pattern Recognition arXiv:2510.08138 (cs) [Submitted on 9 Oct 2025 (v1), last revised 23 Mar 2026 (this version, v2)] Title:Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability Authors:Chengzhi Li, Heyan Huang, Ping Jian, Zhen Yang, Yaning Tian, Zhongbin Guo View a PDF of the paper titled Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability, by Chengzhi Li and 5 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored. In this work, we adopt an interpretability-driven approach to analyze, statistically summarize, and intervention the potential factors of the phenomenon. We find that one of the primary reasons for the inconsistency in responses lies in the inability of cross-modal attention heads to effectively distinguish video tokens across different timestamps. To address this, we propose an attention enhancement method called Temporally Conditioned Attention ...