Machine Learning Ai Agents Data Science

[2602.16746] Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

arXiv - AI February 20, 2026 4 min read Article

Summary

This article presents a geometric analysis of optimization dynamics in transformers, focusing on the phenomenon of grokking, where models transition from memorization to generalization in small algorithmic tasks.

Why It Matters

Understanding grokking is crucial for improving machine learning models, particularly in optimizing their training processes. The insights into low-dimensional dynamics and curvature can inform better strategies for model generalization, which is vital for real-world applications in AI.

Key Takeaways

Grokking involves a delayed transition from memorization to generalization in transformers.
Training evolves predominantly within a low-dimensional execution subspace, with significant variance captured by a single principal component.
Curvature in the loss landscape grows in directions orthogonal to the execution subspace, indicating a complex optimization dynamic.
Causal interventions show that motion along the learned subspace is necessary for grokking, while increasing curvature alone does not suffice.
These findings replicate across various learning rates and hyperparameter settings, suggesting robustness in the observed dynamics.

Computer Science > Machine Learning arXiv:2602.16746 (cs) [Submitted on 18 Feb 2026] Title:Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking Authors:Yongzhong Xu View a PDF of the paper titled Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking, by Yongzhong Xu View PDF HTML (experimental) Abstract:Grokking -- the delayed transition from memorization to generalization in small algorithmic tasks -- remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects -- the non-commutativity of successive gradient steps -- and project them onto this learned subspace. We find that curvature grows sharply in directions orthogonal to the execution subspace while the trajectory remains largely confined to it. Importantly, curvature growth consistently precedes generalization across learning rates and hyperparameter regimes, with the lead time obeying a power law in the grokking timescale. Causal intervention experiments show that motion along the learned subspace is necessary for grokking, while artificially increasing curvature is insufficient. Together, these results support a geometric account in which grokkin...

Read Original Article

[2602.16746] Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

Summary

Why It Matters

Key Takeaways

Related Articles

[D] ICML final justification

Anthropic debuts preview of powerful new AI model Mythos in new cybersecurity initiative | TechCrunch

Anthropic debuts ‘Project Glasswing’ and new AI model for cybersecurity | The Verge

FYI the Tennessee bill makes making an AI friend the same level as murder or aggravated rape

No comments

Stay updated with AI News