[2603.25730] PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference
About this article
Abstract page for arXiv paper 2603.25730: PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.25730 (cs) [Submitted on 26 Mar 2026] Title:PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference Authors:Xiaofeng Mao, Shaohao Rui, Kaining Ying, Bo Zheng, Chuanhao Li, Mingmin Chi, Kaipeng Zhang View a PDF of the paper titled PackForcing: Short Video Training Suffices for Long Video Sampling and Long Context Inference, by Xiaofeng Mao and 6 other authors View PDF HTML (experimental) Abstract:Autoregressive video diffusion models have demonstrated remarkable progress, yet they remain bottlenecked by intractable linear KV-cache growth, temporal repetition, and compounding errors during long-video generation. To address these challenges, we present PackForcing, a unified framework that efficiently manages the generation history through a novel three-partition KV-cache strategy. Specifically, we categorize the historical context into three distinct types: (1) Sink tokens, which preserve early anchor frames at full resolution to maintain global semantics; (2) Mid tokens, which achieve a massive spatiotemporal compression (32x token reduction) via a dual-branch network fusing progressive 3D convolutions with low-resolution VAE re-encoding; and (3) Recent tokens, kept at full resolution to ensure local temporal coherence. To strictly bound the memory footprint without sacrificing quality, we introduce a dynamic top-$k$ context selection mechanism for the mid tokens, cou...