[2602.16746] Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

[2602.16746] Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking

arXiv - AI 4 min read Article

Summary

This article presents a geometric analysis of optimization dynamics in transformers, focusing on the phenomenon of grokking, where models transition from memorization to generalization in small algorithmic tasks.

Why It Matters

Understanding grokking is crucial for improving machine learning models, particularly in optimizing their training processes. The insights into low-dimensional dynamics and curvature can inform better strategies for model generalization, which is vital for real-world applications in AI.

Key Takeaways

  • Grokking involves a delayed transition from memorization to generalization in transformers.
  • Training evolves predominantly within a low-dimensional execution subspace, with significant variance captured by a single principal component.
  • Curvature in the loss landscape grows in directions orthogonal to the execution subspace, indicating a complex optimization dynamic.
  • Causal interventions show that motion along the learned subspace is necessary for grokking, while increasing curvature alone does not suffice.
  • These findings replicate across various learning rates and hyperparameter settings, suggesting robustness in the observed dynamics.

Computer Science > Machine Learning arXiv:2602.16746 (cs) [Submitted on 18 Feb 2026] Title:Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking Authors:Yongzhong Xu View a PDF of the paper titled Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking, by Yongzhong Xu View PDF HTML (experimental) Abstract:Grokking -- the delayed transition from memorization to generalization in small algorithmic tasks -- remains poorly understood. We present a geometric analysis of optimization dynamics in transformers trained on modular arithmetic. PCA of attention weight trajectories reveals that training evolves predominantly within a low-dimensional execution subspace, with a single principal component capturing 68-83% of trajectory variance. To probe loss-landscape geometry, we measure commutator defects -- the non-commutativity of successive gradient steps -- and project them onto this learned subspace. We find that curvature grows sharply in directions orthogonal to the execution subspace while the trajectory remains largely confined to it. Importantly, curvature growth consistently precedes generalization across learning rates and hyperparameter regimes, with the lead time obeying a power law in the grokking timescale. Causal intervention experiments show that motion along the learned subspace is necessary for grokking, while artificially increasing curvature is insufficient. Together, these results support a geometric account in which grokkin...

Related Articles

Machine Learning

[D] ICML final justification

Do we get notified if any reviewer put their final justification into their original review comment? submitted by /u/tuejan11 [link] [com...

Reddit - Machine Learning · 1 min ·
Anthropic debuts preview of powerful new AI model Mythos in new cybersecurity initiative | TechCrunch
Machine Learning

Anthropic debuts preview of powerful new AI model Mythos in new cybersecurity initiative | TechCrunch

The new model will be used by a small number of high-profile companies to engage in defensive cybersecurity work.

TechCrunch - AI · 5 min ·
Anthropic debuts ‘Project Glasswing’ and new AI model for cybersecurity | The Verge
Machine Learning

Anthropic debuts ‘Project Glasswing’ and new AI model for cybersecurity | The Verge

Anthropic launched Project Glasswing, a cybersecurity initiative in which it’s partnering with Nvidia, Apple, and others, and debuted a n...

The Verge - AI · 5 min ·
Machine Learning

FYI the Tennessee bill makes making an AI friend the same level as murder or aggravated rape

I think what Tennessee is doing is they recently passed SB 1580, which makes it illegal to even advertise that an AI can act as a mental ...

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime