[2602.13524] Singular Vectors of Attention Heads Align with Features
Summary
This paper explores the alignment of singular vectors of attention heads with feature representations in language models, providing theoretical justification and practical implications for mechanistic interpretability.
Why It Matters
Understanding how attention mechanisms in language models align with features is crucial for advancing interpretability in AI. This research addresses a gap in existing literature by providing a theoretical framework and empirical evidence, which can enhance the development of more transparent AI systems.
Key Takeaways
- Singular vectors of attention heads can align with observable features in language models.
- The paper provides theoretical conditions under which this alignment is expected.
- Sparse attention decomposition is proposed as a testable prediction for alignment recognition.
- Empirical evidence supports the theoretical claims made regarding alignment.
- This research contributes to the field of mechanistic interpretability in AI.
Computer Science > Machine Learning arXiv:2602.13524 (cs) [Submitted on 13 Feb 2026] Title:Singular Vectors of Attention Heads Align with Features Authors:Gabriel Franco, Carson Loughridge, Mark Crovella View a PDF of the paper titled Singular Vectors of Attention Heads Align with Features, by Gabriel Franco and 2 other authors View PDF HTML (experimental) Abstract:Identifying feature representations in language models is a central task in mechanistic interpretability. Several recent studies have made an implicit assumption that feature representations can be inferred in some cases from singular vectors of attention matrices. However, sound justification for this assumption is lacking. In this paper we address that question, asking: why and when do singular vectors align with features? First, we demonstrate that singular vectors robustly align with features in a model where features can be directly observed. We then show theoretically that such alignment is expected under a range of conditions. We close by asking how, operationally, alignment may be recognized in real models where feature representations are not directly observable. We identify sparse attention decomposition as a testable prediction of alignment, and show evidence that it emerges in a manner consistent with predictions in real models. Together these results suggest that alignment of singular vectors with features can be a sound and theoretically justified basis for feature identification in language models...