[2602.16746] Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
Summary
This article presents a geometric analysis of optimization dynamics in transformers, focusing on the phenomenon of grokking, where models transition from memorization to generalization in small algorithmic tasks.
Why It Matters
Understanding grokking is crucial for improving machine learning models, particularly in optimizing their training processes. The insights into low-dimensional dynamics and curvature can inform better strategies for model generalization, which is vital for real-world applications in AI.
Key Takeaways
- Grokking involves a delayed transition from memorization to generalization in transformers.
- Training evolves predominantly within a low-dimensional execution subspace, with significant variance captured by a single principal component.
- Curvature in the loss landscape grows in directions orthogonal to the execution subspace, indicating a complex optimization dynamic.
- Causal interventions show that motion along the learned subspace is necessary for grokking, while increasing curvature alone does not suffice.
- These findings replicate across various learning rates and hyperparameter settings, suggesting robustness in the observed dynamics.
Computer Science > Machine Learning arXiv:2602.16746 (cs) [Submitted on 18 Feb 2026] Title:Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking Authors:Yongzhong Xu View a PDF of the paper titled Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking, by Yongzhong Xu View PDF HTML (experimental) Abstract:Grokking -- the delayed transition from memorization to generalization in small algorithmic tasks -- remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects -- the non-commutativity of successive gradient steps -- and project them onto this learned subspace. We find that curvature grows sharply in directions orthogonal to the execution subspace while the trajectory remains largely confined to it. Importantly, curvature growth consistently precedes generalization across learning rates and hyperparameter regimes, with the lead time obeying a power law in the grokking timescale. Causal intervention experiments show that motion along the learned subspace is necessary for grokking, while artificially increasing curvature is insufficient. Together, these results support a geometric account in which grokkin...