[2412.14294] TRecViT: A Recurrent Video Transformer

[2412.14294] TRecViT: A Recurrent Video Transformer

arXiv - Machine Learning 4 min read Article

Summary

TRecViT introduces a novel recurrent video transformer architecture that excels in causal video modeling, outperforming existing models while being more efficient.

Why It Matters

This research is significant as it presents a new approach to video modeling that balances performance and efficiency, addressing the growing demand for real-time video processing in various applications such as surveillance, autonomous driving, and content creation.

Key Takeaways

  • TRecViT achieves state-of-the-art performance on video datasets with fewer parameters and lower computational costs.
  • The model utilizes a unique time-space-channel factorization for effective information processing.
  • It operates in real-time, processing approximately 300 frames per second, making it suitable for practical applications.

Computer Science > Computer Vision and Pattern Recognition arXiv:2412.14294 (cs) [Submitted on 18 Dec 2024 (v1), last revised 15 Feb 2026 (this version, v2)] Title:TRecViT: A Recurrent Video Transformer Authors:Viorica Pătrăucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, Razvan Pascanu View a PDF of the paper titled TRecViT: A Recurrent Video Transformer, by Viorica P\u{a}tr\u{a}ucean and 12 other authors View PDF HTML (experimental) Abstract:We propose a novel block for \emph{causal} video modelling. It relies on a time-space-channel factorisation with dedicated blocks for each dimension: gated linear recurrent units (LRUs) perform information mixing over time, self-attention layers perform mixing over space, and MLPs over channels. The resulting architecture \emph{TRecViT} is causal and shows strong performance on sparse and dense tasks, trained in supervised or self-supervised regimes, being the first causal video model in the state-space models family. Notably, our model outperforms or is on par with the popular (non-causal) ViViT-L model on large scale video datasets (SSv2, Kinetics400), while having $3\times$ less parameters, $12\times$ smaller memory footprint, and $5\times$ lower FLOPs count than the full self-attention ViViT, with an inference throughput of about 300 frames per second, running comfortably in real-time. When compared ...

Related Articles

Machine Learning

[D] How does distributed proof of work computing handle the coordination needs of neural network training?

[D] Ive been trying to understand the technical setup of a project called Qubic. It claims to use distributed proof of work computing for...

Reddit - Machine Learning · 1 min ·
Machine Learning

[R] VLMs Behavior for Long Video Understanding

I have extensively searched on long video understanding datasets such as Video-MME, MLVU, VideoBench, LongVideoBench and etc. What I have...

Reddit - Machine Learning · 1 min ·
Llms

My AI spent last night modifying its own codebase

I've been working on a local AI system called Apis that runs completely offline through Ollama. During a background run, Apis identified ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Fake users generated by AI can't simulate humans — review of 182 research papers. Your thoughts?

https://www.researchsquare.com/article/rs-9057643/v1 There’s a massive trend right now where tech companies, businesses, even researchers...

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime