[2603.05228] The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
About this article
Abstract page for arXiv paper 2603.05228: The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
Computer Science > Machine Learning arXiv:2603.05228 (cs) [Submitted on 5 Mar 2026] Title:The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology Authors:Alper Yıldırım View a PDF of the paper titled The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology, by Alper Y{\i}ld{\i}r{\i}m View PDF HTML (experimental) Abstract:Mechanistic interpretability typically relies on post-hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp) - investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data-dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude-based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides data-dependent query-key routing with a uniform distribution, reducing the attention layer to a Continuous Bag-of-Words (CBOW) aggregator. Despite removing adaptive routing, these models achieve 1...