[2602.18523] The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure
Summary
This article explores the geometric analysis of multi-task grokking in machine learning, detailing five key phenomena observed during training shared-trunk Transformers on multiple arithmetic tasks.
Why It Matters
Understanding multi-task grokking is crucial for advancing machine learning models, particularly in optimizing their performance across various tasks. This research provides insights into how weight decay influences model training and generalization, which can inform future developments in AI systems.
Key Takeaways
- Grokking transitions from memorization to generalization occur in a staggered order based on task complexity.
- Optimization trajectories are confined to a low-dimensional manifold, with specific defects indicating generalization readiness.
- Weight decay significantly impacts grokking timescales and model performance, revealing distinct operational regimes.
- Final solutions are fragile and highly sensitive to perturbations, indicating a need for robust training methods.
- Redundant parameters in overparameterized models can recover performance even after significant gradient component removal.
Computer Science > Machine Learning arXiv:2602.18523 (cs) [Submitted on 19 Feb 2026] Title:The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure Authors:Yongzhong Xu View a PDF of the paper titled The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure, by Yongzhong Xu View PDF HTML (experimental) Abstract:Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, revealing distinct dynamical regimes and a sharp no-decay failure mode. (4) Holographic incompressibility: final solutions occupy only 4--8 principal trajector...