[2603.19348] Anatomical Heterogeneity in Transformer Language Models
About this article
Abstract page for arXiv paper 2603.19348: Anatomical Heterogeneity in Transformer Language Models
Computer Science > Machine Learning arXiv:2603.19348 (cs) [Submitted on 19 Mar 2026] Title:Anatomical Heterogeneity in Transformer Language Models Authors:Tomasz Wietrzykowski View a PDF of the paper titled Anatomical Heterogeneity in Transformer Language Models, by Tomasz Wietrzykowski View PDF HTML (experimental) Abstract:Current transformer language models are trained with uniform computational budgets across all layers, implicitly assuming layer homogeneity. We challenge this assumption through empirical analysis of SmolLM2-135M, a 30-layer, 135M-parameter causal language model, using five diagnostic metrics: weight predictability (R2), ablation degradation, recovery speed, weight manipulation robustness, and structural analysis. We find profound anatomical heterogeneity: (1) Layer weights follow strong mathematical regularity (R2 = 0.91) with a universal oscillatory delta pattern (correlation ~= -0.50), yet predicted weights cause catastrophic failure due to nonlinear error accumulation. (2) Layer importance spans a 10^7 range, from a critical core (L8-11, up to +63,419% PPL degradation) to anti-layers (L14, L17) whose removal improves performance. (3) Recovery speed correlates with layer importance, indicating differential training requirements. (4) Only weight scaling (alpha = 0.9) preserves model quality among five tested manipulation strategies. (5) Growth Transformer Training, allocating budget by layer importance, achieves ~54% cost reduction. A proof-of-concept...