[2512.14698] TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
About this article
Abstract page for arXiv paper 2512.14698: TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Computer Science > Computer Vision and Pattern Recognition arXiv:2512.14698 (cs) [Submitted on 16 Dec 2025 (v1), last revised 26 Mar 2026 (this version, v2)] Title:TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs Authors:Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li, Ying Shan, Limin Wang View a PDF of the paper titled TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs, by Jun Zhang and 6 other authors View PDF HTML (experimental) Abstract:This paper does not introduce a novel method but instead establishes a straightforward, incremental, yet essential baseline for video temporal grounding (VTG), a core capability in video understanding. While multimodal large language models (MLLMs) excel at various video understanding tasks, the recipes for optimizing them for VTG remain under-explored. In this paper, we present TimeLens, a systematic investigation into building MLLMs with strong VTG ability, along two primary dimensions: data quality and algorithmic design. We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria. Our analysis reveals dramatic model re-rankings compared to legacy benchmarks, confirming the unreliability of prior evaluation standards. We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality trainin...