[2512.19941] Block-Recurrent Dynamics in Vision Transformers

[2512.19941] Block-Recurrent Dynamics in Vision Transformers

arXiv - Machine Learning 4 min read Article

Summary

This article introduces the Block-Recurrent Hypothesis (BRH) for Vision Transformers, proposing a new framework for understanding their computational dynamics through a recurrent structure.

Why It Matters

As Vision Transformers become increasingly prevalent in computer vision tasks, understanding their internal mechanisms is crucial for improving model efficiency and interpretability. The BRH offers a novel perspective that could enhance the design and application of these models, making it relevant for researchers and practitioners in AI and machine learning.

Key Takeaways

  • The Block-Recurrent Hypothesis suggests that Vision Transformers can be understood through a block-recurrent structure.
  • Empirical evidence shows that this structure can effectively reduce complexity while maintaining performance.
  • The study introduces a new model, Raptor, which demonstrates the effectiveness of the BRH in achieving high accuracy with fewer blocks.

Computer Science > Computer Vision and Pattern Recognition arXiv:2512.19941 (cs) [Submitted on 23 Dec 2025 (v1), last revised 19 Feb 2026 (this version, v5)] Title:Block-Recurrent Dynamics in Vision Transformers Authors:Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller View a PDF of the paper titled Block-Recurrent Dynamics in Vision Transformers, by Mozes Jacobs and 5 other authors View PDF HTML (experimental) Abstract:As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then ...

Related Articles

Llms

World models will be the next big thing, bye-bye LLMs

Was at Nvidia's GTC conference recently and honestly, it was one of the most eye-opening events I've attended in a while. There was a lot...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

[D] Got my first offer after months of searching — below posted range, contract-to-hire, and worried it may pause my search. Do I take it?

I could really use some outside perspective. I’m a senior ML/CV engineer in Canada with about 5–6 years across research and industry. Mas...

Reddit - Machine Learning · 1 min ·
Machine Learning

[Research] AI training is bad, so I started an research

Hello, I started researching about AI training Q:Why? R: Because AI training is bad right now. Q: What do you mean its bad? R: Like when ...

Reddit - Machine Learning · 1 min ·
Machine Learning

[P] Unix philosophy for ML pipelines: modular, swappable stages with typed contracts

We built an open-source prototype that applies Unix philosophy to retrieval pipelines. Each stage (PII redaction, chunking, dedup, embedd...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime