[2512.04000] Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
About this article
Abstract page for arXiv paper 2512.04000: Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding
Computer Science > Computer Vision and Pattern Recognition arXiv:2512.04000 (cs) [Submitted on 3 Dec 2025 (v1), last revised 24 Mar 2026 (this version, v2)] Title:Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding Authors:Jialuo Li, Bin Li, Jiahao Li, Yan Lu View a PDF of the paper titled Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding, by Jialuo Li and 3 other authors View PDF HTML (experimental) Abstract:The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames...