[2603.24934] CVA: Context-aware Video-text Alignment for Video Temporal Grounding
About this article
Abstract page for arXiv paper 2603.24934: CVA: Context-aware Video-text Alignment for Video Temporal Grounding
Computer Science > Machine Learning arXiv:2603.24934 (cs) [Submitted on 26 Mar 2026] Title:CVA: Context-aware Video-text Alignment for Video Temporal Grounding Authors:Sungho Moon, Seunghun Lee, Jiwan Seo, Sunghoon Im View a PDF of the paper titled CVA: Context-aware Video-text Alignment for Video Temporal Grounding, by Sungho Moon and 3 other authors View PDF HTML (experimental) Abstract:We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the syn...