Machine Learning Ai Startups Llms Ai Agents

[2602.10496] Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks

arXiv - Machine Learning February 16, 2026 4 min read Article

Summary

This paper explores the geometric structure of learning dynamics in transformer models, revealing that training trajectories collapse onto low-dimensional execution manifolds, impacting interpretability and training strategies.

Why It Matters

Understanding the low-dimensional execution manifolds in transformer learning dynamics provides insights into how these models operate in high-dimensional spaces. This has implications for improving model interpretability, optimizing training processes, and leveraging overparameterization effectively in neural networks.

Key Takeaways

Transformer training trajectories collapse onto low-dimensional manifolds of dimensions 3-4.
Sharp attention concentration emerges from saturation along routing coordinates within these manifolds.
SGD commutators align preferentially with execution subspace early in training.
Sparse autoencoders capture auxiliary routing structures but do not isolate execution.
The findings suggest a geometric framework for understanding transformer learning dynamics.

Computer Science > Machine Learning arXiv:2602.10496 (cs) [Submitted on 11 Feb 2026 (v1), last revised 13 Feb 2026 (this version, v2)] Title:Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks Authors:Yongzhong Xu View a PDF of the paper titled Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks, by Yongzhong Xu View PDF HTML (experimental) Abstract:We investigate the geometric structure of learning dynamics in overparameterized transformer models through carefully controlled modular arithmetic tasks. Our primary finding is that despite operating in high-dimensional parameter spaces ($d=128$), transformer training trajectories rapidly collapse onto low-dimensional execution manifolds of dimension $3$--$4$. This dimensional collapse is robust across random seeds and moderate task difficulties, though the orientation of the manifold in parameter space varies between runs. We demonstrate that this geometric structure underlies several empirically observed phenomena: (1) sharp attention concentration emerges as saturation along routing coordinates within the execution manifold, (2) SGD commutators are preferentially aligned with the execution subspace (up to $10\times$ random baseline) early in training, with $>92\%$ of non-commutativity confined to orthogonal staging directions and this alignment decreasing as training converges, and (3) sparse autoencoders captur...

Read Original Article

[2602.10496] Low-Dimensional Execution Manifolds in Transformer Learning Dynamics: Evidence from Modular Arithmetic Tasks

Summary

Why It Matters

Key Takeaways

Related Articles

[D] ICML Rebuttal Question

[D] ML researcher looking to switch to a product company.

Building behavioural response models of public figures using Brain scan data (Predict their next move using psychological modelling) [P]

[P] bitnet-edge: Ternary-weight CNNs ({-1,0,+1}) on MNIST and CIFAR-10, deployed to ESP32-S3 with zero multiplications

No comments

Stay updated with AI News