[2602.24289] Mode Seeking meets Mean Seeking for Fast Long Video Generation
About this article
Abstract page for arXiv paper 2602.24289: Mode Seeking meets Mean Seeking for Fast Long Video Generation
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.24289 (cs) [Submitted on 27 Feb 2026] Title:Mode Seeking meets Mean Seeking for Fast Long Video Generation Authors:Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, Arash Vahdat View a PDF of the paper titled Mode Seeking meets Mean Seeking for Fast Long Video Generation, by Shengqu Cai and 10 other authors View PDF HTML (experimental) Abstract:Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, result...