[2602.22988] Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability
Summary
This paper introduces Residual Koopman Spectral Profiling (RKSP) as a method to predict and prevent training instability in transformers, enhancing training efficiency and stability.
Why It Matters
Training instability in transformer models can lead to wasted computational resources and time. By providing a predictive measure of instability, RKSP allows practitioners to mitigate risks before training begins, ultimately improving model performance and resource utilization.
Key Takeaways
- RKSP predicts transformer training instability from a single forward pass.
- The method achieves an AUROC of 0.995, outperforming existing baselines.
- Koopman Spectral Shaping (KSS) effectively prevents divergence when RKSP indicates high risk.
- KSS allows for higher learning rates, reducing divergence rates significantly.
- The findings are applicable across various transformer architectures and datasets.
Computer Science > Machine Learning arXiv:2602.22988 (cs) [Submitted on 26 Feb 2026] Title:Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability Authors:Bum Jun Kim, Shohei Taniguchi, Makoto Kawano, Yusuke Iwasawa, Yutaka Matsuo View a PDF of the paper titled Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability, by Bum Jun Kim and 4 other authors View PDF HTML (experimental) Abstract:Training divergence in transformers wastes compute, yet practitioners discover instability only after expensive runs begin. They therefore need an expected probability of failure for a transformer before training starts. Our study of Residual Koopman Spectral Profiling (RKSP) provides such an estimate. From a single forward pass at initialization, RKSP extracts Koopman spectral features by applying whitened dynamic mode decomposition to layer-wise residual snapshots. Our central diagnostic, the near-unit spectral mass, quantifies the fraction of modes concentrated near the unit circle, which captures instability risk. For predicting divergence across extensive configurations, this estimator achieves an AUROC of 0.995, outperforming the best gradient baseline. We further make this diagnostic actionable through Koopman Spectral Shaping (KSS), which reshapes spectra during training. We empirically validate that our method works in practice: RKSP predicts divergence at initialization, and when RKSP flags hig...