[2602.10956] Stochastic Parroting in Temporal Attention -- Regulating the Diagonal Sink
Summary
The paper explores the challenges of spatio-temporal models in machine learning, focusing on biases in temporal attention mechanisms and proposing regularization methods to mitigate these issues.
Why It Matters
Understanding the biases in temporal attention is crucial for improving the performance of machine learning models that rely on spatio-temporal data. This research provides insights into potential regularization techniques that can enhance model accuracy and reliability, which is essential for applications in various fields such as natural language processing and robotics.
Key Takeaways
- Temporal attention mechanisms can suffer from biases due to over-squashing.
- The paper derives sensitivity bounds on the Jacobian of temporal attention layers.
- Regularization methods are proposed to address diagonal attention sinks.
- Experimental results demonstrate the effectiveness of the suggested methods.
- Insights from this research can enhance model performance in spatio-temporal tasks.
Computer Science > Machine Learning arXiv:2602.10956 (cs) [Submitted on 11 Feb 2026 (v1), last revised 18 Feb 2026 (this version, v2)] Title:Stochastic Parroting in Temporal Attention -- Regulating the Diagonal Sink Authors:Victoria Hankemeier, Malte Schilling View a PDF of the paper titled Stochastic Parroting in Temporal Attention -- Regulating the Diagonal Sink, by Victoria Hankemeier and 1 other authors View PDF HTML (experimental) Abstract:Spatio-temporal models analyze spatial structures and temporal dynamics, which makes them prone to information degeneration among space and time. Prior literature has demonstrated that over-squashing in causal attention or temporal convolutions creates a bias on the first tokens. To analyze whether such a bias is present in temporal attention mechanisms, we derive sensitivity bounds on the expected value of the Jacobian of a temporal attention layer. We theoretically show how off-diagonal attention scores depend on the sequence length, and that temporal attention matrices suffer a diagonal attention sink. We suggest regularization methods, and experimentally demonstrate their effectiveness. Comments: Subjects: Machine Learning (cs.LG) Cite as: arXiv:2602.10956 [cs.LG] (or arXiv:2602.10956v2 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2602.10956 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Victoria Hankemeier [view email] [v1] Wed, 11 Feb 2026 15:45:34 UTC (374 KB) [v2] Wed, 18 Feb 2026 ...