[2412.14294] TRecViT: A Recurrent Video Transformer

arXiv - Machine Learning February 17, 2026 4 min read Article

Summary

TRecViT introduces a novel recurrent video transformer architecture that excels in causal video modeling, outperforming existing models while being more efficient.

Why It Matters

This research is significant as it presents a new approach to video modeling that balances performance and efficiency, addressing the growing demand for real-time video processing in various applications such as surveillance, autonomous driving, and content creation.

Key Takeaways

TRecViT achieves state-of-the-art performance on video datasets with fewer parameters and lower computational costs.
The model utilizes a unique time-space-channel factorization for effective information processing.
It operates in real-time, processing approximately 300 frames per second, making it suitable for practical applications.

Computer Science > Computer Vision and Pattern Recognition arXiv:2412.14294 (cs) [Submitted on 18 Dec 2024 (v1), last revised 15 Feb 2026 (this version, v2)] Title:TRecViT: A Recurrent Video Transformer Authors:Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu View a PDF of the paper titled TRecViT: A Recurrent Video Transformer, by Viorica P\u{a}tr\u{a}ucean and 12 other authors View PDF HTML (experimental) Abstract:We propose a novel block for \emph{causal} video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture \emph{TRecViT} is causal and shows strong performance on sparse and dense tasks, trained in supervised or self-supervised regimes, being the first causal video model in the state-space models family. Notably, our model outperforms or is on par with the popular (non-causal) ViViT-L model on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count than the full self-attention ViViT, with an inference throughput of about 300 frames per second, running comfortably in real-time. When compared ...

Read Original Article

Machine Learning

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

[D] Ive been trying to understand the technical setup of a project called Qubic. It claims to use distributed proof of work computing for...

Reddit - Machine Learning · 1 min · about 1 hour ago

Machine Learning

[R] VLMs Behavior for Long Video Understanding

I have extensively searched on long video understanding datasets such as Video-MME, MLVU, VideoBench, LongVideoBench and etc. What I have...

Reddit - Machine Learning · 1 min · about 1 hour ago

Llms

My AI spent last night modifying its own codebase

I've been working on a local AI system called Apis that runs completely offline through Ollama. During a background run, Apis identified ...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

Fake users generated by AI can't simulate humans — review of 182 research papers. Your thoughts?

https://www.researchsquare.com/article/rs-9057643/v1 There’s a massive trend right now where tech companies, businesses, even researchers...