[2603.22121] Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding
About this article
Abstract page for arXiv paper 2603.22121: Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.22121 (cs) [Submitted on 23 Mar 2026] Title:Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding Authors:Yunzhuo Sun, Xinyue Liu, Yanyang Li, Nanding Wu, Yifang Xu, Linlin Zong, Xianchao Zhang, Wenxin Liang View a PDF of the paper titled Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding, by Yunzhuo Sun and 7 other authors View PDF HTML (experimental) Abstract:Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficien...