[2512.22213] On the Existence and Behavior of Secondary Attention Sinks
Summary
This paper explores the concept of secondary attention sinks in machine learning models, highlighting their distinct properties and behaviors compared to primary attention sinks.
Why It Matters
Understanding secondary attention sinks is crucial for improving the interpretability and efficiency of attention mechanisms in machine learning models. This research could lead to advancements in model design and performance, particularly in natural language processing and generative AI applications.
Key Takeaways
- Secondary attention sinks differ from primary sinks, emerging in middle layers of models.
- These sinks draw a smaller but significant amount of attention mass.
- The formation of secondary sinks is influenced by specific middle-layer MLP modules.
- Larger models exhibit more deterministic and frequent sink levels.
- Understanding these sinks can enhance the design of attention mechanisms.
Computer Science > Machine Learning arXiv:2512.22213 (cs) [Submitted on 22 Dec 2025 (v1), last revised 19 Feb 2026 (this version, v2)] Title:On the Existence and Behavior of Secondary Attention Sinks Authors:Jeffrey T. H. Wong, Cheng Zhang, Louis Mahon, Wayne Luk, Anton Isopoussu, Yiren Zhao View a PDF of the paper titled On the Existence and Behavior of Secondary Attention Sinks, by Jeffrey T. H. Wong and 5 other authors View PDF HTML (experimental) Abstract:Attention sinks are tokens, often the beginning-of-sequence (BOS) token, that receive disproportionately high attention despite limited semantic relevance. In this work, we identify a class of attention sinks, which we term secondary sinks, that differ fundamentally from the sinks studied in prior works, which we term primary sinks. While prior works have identified that tokens other than BOS can sometimes become sinks, they were found to exhibit properties analogous to the BOS token. Specifically, they emerge at the same layer, persist throughout the network and draw a large amount of attention mass. Whereas, we find the existence of secondary sinks that arise primarily in middle layers and can persist for a variable number of layers, and draw a smaller, but still significant, amount of attention mass. Through extensive experiments across 11 model families, we analyze where these secondary sinks appear, their properties, how they are formed, and their impact on the attention mechanism. Specifically, we show that: (1)...