[2602.18702] Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding
Summary
The paper presents Video-TwG, a curriculum reinforced framework for improving long video understanding through selective video grounding and reasoning.
Why It Matters
As long videos become more prevalent, enhancing their understanding through advanced reasoning techniques is crucial. This research addresses existing limitations in video analysis, particularly the challenges posed by temporal redundancy and hallucinations in text-only reasoning. The proposed framework could significantly improve video comprehension in AI applications, making it relevant for fields like computer vision and AI-driven content analysis.
Key Takeaways
- Introduces Video-TwG, a framework for enhanced long video understanding.
- Employs a Two-stage Reinforced Curriculum Strategy for training.
- Utilizes fine-grained grounding rewards to improve reasoning accuracy.
- Demonstrates superior performance on multiple video understanding benchmarks.
- Addresses challenges of temporal redundancy and hallucinations in video analysis.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.18702 (cs) [Submitted on 21 Feb 2026] Title:Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding Authors:Houlun Chen, Xin Wang, Guangyao Li, Yuwei Zhou, Yihan Chen, Jia Jia, Wenwu Zhu View a PDF of the paper titled Think with Grounding: Curriculum Reinforced Reasoning with Video Grounding for Long Video Understanding, by Houlun Chen and 6 other authors View PDF HTML (experimental) Abstract:Long video understanding is challenging due to rich and complicated multimodal clues in long temporal this http URL methods adopt reasoning to improve the model's ability to analyze complex video clues in long videos via text-form this http URL,the existing literature suffers from the fact that the text-only reasoning under fixed video context may exacerbate hallucinations since detailed crucial clues are often ignored under limited video context length due to the temporal redundancy of long this http URL address this gap,we propose Video-TwG,a curriculum reinforced framework that employs a novel Think-with-Grounding paradigm,enabling video LLMs to actively decide when to perform on-demand grounding during interleaved text-video reasoning, selectively zooming into question-relevant clips only when this http URL-TwG can be trained end-to-end in a straightforward manner, without relying on complex auxiliary modules or heavily annotated reasoning tracesIn detail,we design...