[2602.13602] Towards Sparse Video Understanding and Reasoning
Summary
The paper introduces evise, a multi-round agent designed for video question answering (VQA) that enhances efficiency by selecting informative frames and using a summary-as-state approach.
Why It Matters
This research addresses the growing need for efficient video understanding and reasoning in AI, particularly in applications like VQA, where traditional methods can be resource-intensive. By improving accuracy while reducing the number of frames and rounds needed, it has significant implications for real-time video analysis and AI applications.
Key Takeaways
- Introduces evise, a novel agent for video question answering.
- Utilizes a summary-as-state approach to enhance efficiency.
- Implements EAGER, a new annotation-free reward mechanism for fine-tuning.
- Demonstrates improved accuracy across multiple VQA benchmarks.
- Reduces the number of frames and rounds needed for effective reasoning.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.13602 (cs) [Submitted on 14 Feb 2026] Title:Towards Sparse Video Understanding and Reasoning Authors:Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu View a PDF of the paper titled Towards Sparse Video Understanding and Reasoning, by Chenwei Xu and 10 other authors View PDF HTML (experimental) Abstract:We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrat...