[2602.22600] Transformers converge to invariant algorithmic cores
Summary
The paper explores how transformers, despite varying weights, converge to invariant algorithmic cores essential for task performance, revealing insights into their internal workings.
Why It Matters
Understanding the internal structures of large language models like transformers is crucial for advancing AI interpretability and improving model design. This research highlights the consistent algorithmic cores that persist across different training runs, offering a pathway to better mechanistic interpretability.
Key Takeaways
- Transformers exhibit invariant algorithmic cores despite different weight configurations.
- Identifying these cores can enhance mechanistic interpretability of AI models.
- The study reveals low-dimensional invariants that persist across training runs and scales.
Computer Science > Machine Learning arXiv:2602.22600 (cs) [Submitted on 26 Feb 2026] Title:Transformers converge to invariant algorithmic cores Authors:Joshua S. Schiffman View a PDF of the paper titled Transformers converge to invariant algorithmic cores, by Joshua S. Schiffman View PDF HTML (experimental) Abstract:Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact,...