[2512.04000] Divide, then Ground: Adapting Frame Selection to Query

[2512.04000] Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

arXiv - Machine Learning March 26, 2026 4 min read

About this article

Abstract page for arXiv paper 2512.04000: Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

Computer Science > Computer Vision and Pattern Recognition arXiv:2512.04000 (cs) [Submitted on 3 Dec 2025 (v1), last revised 24 Mar 2026 (this version, v2)] Title:Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding Authors:Jialuo Li, Bin Li, Jiahao Li, Yan Lu View a PDF of the paper titled Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding, by Jialuo Li and 3 other authors View PDF HTML (experimental) Abstract:The application of Large Multimodal Models (LMMs) to long-form video understanding is constrained by limited context lengths and the computationally prohibitive cost of processing dense video tokens. Consequently, recent research has focused on query-aware frame selection, methods that often incur significant computational overhead. This paper challenges the assumption that such complex search mechanisms are universally necessary. We first identify and validate a query typology distinguishing between global query and localized query. We demonstrate that while uniform sampling is both effective and efficient for global queries, localized queries indeed necessitate query-aware selection for optimal performance. Building on this insight, we propose DIG, a training-free frame selection framework that adapts its strategy based on the query type. Specifically,DIG employs efficient uniform sampling for global queries while activating a specialized pipeline to extract query-relevant frames...

Originally published on March 26, 2026. Curated by AI News.

Llms

[2603.18940] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Abstract page for arXiv paper 2603.18940: Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty ...

arXiv - Machine Learning · 3 min · 33 minutes ago

Machine Learning

[2512.20620] Uncovering Patterns of Brain Activity from EEG Data Consistently Associated with Cybersickness Using Neural Network Interpretability Maps

Abstract page for arXiv paper 2512.20620: Uncovering Patterns of Brain Activity from EEG Data Consistently Associated with Cybersickness ...

arXiv - Machine Learning · 4 min · 33 minutes ago

Machine Learning

[2512.13607] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Abstract page for arXiv paper 2512.13607: Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

arXiv - Machine Learning · 4 min · 33 minutes ago

Machine Learning

[2512.02650] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

Abstract page for arXiv paper 2512.02650: Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

arXiv - Machine Learning · 3 min · 33 minutes ago

[2512.04000] Divide, then Ground: Adapting Frame Selection to Query Types for Long-Form Video Understanding

About this article

Related Articles

[2603.18940] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

[2512.20620] Uncovering Patterns of Brain Activity from EEG Data Consistently Associated with Cybersickness Using Neural Network Interpretability Maps

[2512.13607] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

[2512.02650] Hear What Matters! Text-conditioned Selective Video-to-Audio Generation

No comments

Stay updated with AI News