[2603.22918] EVA: Efficient Reinforcement Learning for End-to-End Video Agent
About this article
Abstract page for arXiv paper 2603.22918: EVA: Efficient Reinforcement Learning for End-to-End Video Agent
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.22918 (cs) [Submitted on 24 Mar 2026] Title:EVA: Efficient Reinforcement Learning for End-to-End Video Agent Authors:Yaolun Zhang, Ruohui Wang, Jiahao Wang, Yepeng Tang, Xuanyu Zheng, Haonan Duan, Hao Lu, Hanming Deng, Lewei Lu View a PDF of the paper titled EVA: Efficient Reinforcement Learning for End-to-End Video Agent, by Yaolun Zhang and 8 other authors View PDF HTML (experimental) Abstract:Video understanding with multimodal large language models (MLLMs) remains challenging due to the long token sequences of videos, which contain extensive temporal dependencies and redundant frames. Existing approaches typically treat MLLMs as passive recognizers, processing entire videos or uniformly sampled frames without adaptive reasoning. Recent agent-based methods introduce external tools, yet still depend on manually designed workflows and perception-first strategies, resulting in inefficiency on long videos. We present EVA, an Efficient Reinforcement Learning framework for End-to-End Video Agent, which enables planning-before-perception through iterative summary-plan-action-reflection reasoning. EVA autonomously decides what to watch, when to watch, and how to watch, achieving query-driven and efficient video understanding. To train such agents, we design a simple yet effective three-stage learning pipeline - comprising supervised fine-tuning (SFT), Kahneman-Tversky Optimization (KTO), and Generalized Rewar...