[2501.16997] Resolving Spatio-Temporal Entanglement in Video Prediction via Multi-Modal Attention
About this article
Abstract page for arXiv paper 2501.16997: Resolving Spatio-Temporal Entanglement in Video Prediction via Multi-Modal Attention
Computer Science > Computer Vision and Pattern Recognition arXiv:2501.16997 (cs) [Submitted on 28 Jan 2025 (v1), last revised 29 Mar 2026 (this version, v2)] Title:Resolving Spatio-Temporal Entanglement in Video Prediction via Multi-Modal Attention Authors:Shreyam Gupta (1), P. Agrawal (2), Priyam Gupta (3) ((1) Indian Institute of Technology (BHU), Varanasi, India, (2) University of Colorado, Boulder, USA, (3) Intelligent Field Robotic Systems (IFRoS), University of Girona, Spain) View a PDF of the paper titled Resolving Spatio-Temporal Entanglement in Video Prediction via Multi-Modal Attention, by Shreyam Gupta (1) and 10 other authors View PDF HTML (experimental) Abstract:The fast progress in computer vision has necessitated more advanced methods for temporal sequence modeling. This area is essential for the operation of autonomous systems, real-time surveillance, and predicting anomalies. As the demand for accurate video prediction increases, the limitations of traditional deterministic models, particularly their struggle to maintain long-term temporal coherence while providing high-frequency spatial detail, have become very clear. This report provides an exhaustive analysis of the Multi-Attention Unit Cell (MAUCell), a novel architectural framework that represents a significant leap forward in video frame prediction. By synergizing Generative Adversarial Networks (GANs) with a hierarchical "STAR-GAN" processing strategy and a triad of specialized attention mechanisms ...