[2508.06869] VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
About this article
Abstract page for arXiv paper 2508.06869: VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
Computer Science > Computer Vision and Pattern Recognition arXiv:2508.06869 (cs) [Submitted on 9 Aug 2025 (v1), last revised 10 Apr 2026 (this version, v4)] Title:VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding Authors:Jianxiang He, Meisheng Hong, Jungang Li, Weiyu Guo, Xuming Hu, Hui Xiong View a PDF of the paper titled VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding, by Jianxiang He and 5 other authors View PDF HTML (experimental) Abstract:Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise localization. Experiments on LongVideoBench and VideoMME demonstr...