[2602.16132] CHAI: CacHe Attention Inference for text2video
Summary
The paper presents CHAI, a novel approach to enhance text-to-video generation by utilizing Cache Attention for efficient inference, achieving significant speed improvements while maintaining video quality.
Why It Matters
As text-to-video models become more prevalent, optimizing their performance without sacrificing quality is crucial. CHAI addresses the challenge of slow inference times, making these models more practical for real-world applications. This innovation could lead to broader adoption and development in the field of generative AI.
Key Takeaways
- CHAI reduces inference latency in text-to-video models significantly.
- The Cache Attention mechanism allows for high-quality video generation with fewer denoising steps.
- Performance improvements range from 1.65x to 3.35x faster than existing models.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.16132 (cs) [Submitted on 18 Feb 2026] Title:CHAI: CacHe Attention Inference for text2video Authors:Joel Mathew Cherian, Ashutosh Muralidhara Bharadwaj, Vima Gupta, Anand Padmanabha Iyer View a PDF of the paper titled CHAI: CacHe Attention Inference for text2video, by Joel Mathew Cherian and 3 other authors View PDF HTML (experimental) Abstract:Text-to-video diffusion models deliver impressive results but remain slow because of the sequential denoising of 3D latents. Existing approaches to speed up inference either require expensive model retraining or use heuristic-based step skipping, which struggles to maintain video quality as the number of denoising steps decreases. Our work, CHAI, aims to use cross-inference caching to reduce latency while maintaining video quality. We introduce Cache Attention as an effective method for attending to shared objects/scenes across cross-inference latents. This selective attention mechanism enables effective reuse of cached latents across semantically related prompts, yielding high cache hit rates. We show that it is possible to generate high-quality videos using Cache Attention with as few as 8 denoising steps. When integrated into the overall system, CHAI is 1.65x - 3.35x faster than baseline OpenSora 1.2 while maintaining video quality. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2602.16132 [cs.CV] (or arXiv:2...