[2602.13191] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models
Summary
The paper presents CoPE-VideoLM, a novel approach that utilizes codec primitives to enhance the efficiency of video language models, significantly reducing computational overhead while maintaining performance across various benchmarks.
Why It Matters
As AI systems increasingly rely on video data, optimizing video language models is crucial for improving their understanding of temporal dynamics. CoPE-VideoLM addresses key inefficiencies in current methods, making it relevant for researchers and developers in machine learning and computer vision.
Key Takeaways
- CoPE-VideoLM leverages codec primitives to reduce computational costs.
- The approach achieves up to 86% faster time-to-first-token and 93% less token usage.
- Performance is maintained or exceeded on 14 diverse video understanding benchmarks.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.13191 (cs) [Submitted on 13 Feb 2026] Title:CoPE-VideoLM: Codec Primitives For Efficient Video Language Models Authors:Sayan Deb Sarkar, Rémi Pautrat, Ondrej Miksik, Marc Pollefeys, Iro Armeni, Mahdi Rad, Mihai Dusmanu View a PDF of the paper titled CoPE-VideoLM: Codec Primitives For Efficient Video Language Models, by Sayan Deb Sarkar and 6 other authors View PDF HTML (experimental) Abstract:Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to $86\%$ and token usage by up to $93\%$ compared to standard VideoLMs. Moreover, by varying the k...