[2512.19941] Block-Recurrent Dynamics in Vision Transformers
Summary
This article introduces the Block-Recurrent Hypothesis (BRH) for Vision Transformers, proposing a new framework for understanding their computational dynamics through a recurrent structure.
Why It Matters
As Vision Transformers become increasingly prevalent in computer vision tasks, understanding their internal mechanisms is crucial for improving model efficiency and interpretability. The BRH offers a novel perspective that could enhance the design and application of these models, making it relevant for researchers and practitioners in AI and machine learning.
Key Takeaways
- The Block-Recurrent Hypothesis suggests that Vision Transformers can be understood through a block-recurrent structure.
- Empirical evidence shows that this structure can effectively reduce complexity while maintaining performance.
- The study introduces a new model, Raptor, which demonstrates the effectiveness of the BRH in achieving high accuracy with fewer blocks.
Computer Science > Computer Vision and Pattern Recognition arXiv:2512.19941 (cs) [Submitted on 23 Dec 2025 (v1), last revised 19 Feb 2026 (this version, v5)] Title:Block-Recurrent Dynamics in Vision Transformers Authors:Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller View a PDF of the paper titled Block-Recurrent Dynamics in Vision Transformers, by Mozes Jacobs and 5 other authors View PDF HTML (experimental) Abstract:As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then ...