[2512.05865] Sparse Attention Post-Training for Mechanistic Interpretability
Summary
The paper presents a novel post-training method that enhances transformer attention sparsity while maintaining performance, revealing insights into mechanistic interpretability.
Why It Matters
This research is significant as it addresses the redundancy in transformer models, suggesting that increased sparsity can lead to more interpretable AI systems. By simplifying attention mechanisms, it opens pathways for better understanding and optimizing machine learning models, which is crucial for advancing AI safety and efficiency.
Key Takeaways
- The proposed method reduces attention connectivity to approximately 0.4% without sacrificing performance.
- Sparsity serves as a structural prior, enhancing interpretability of transformer models.
- The approach leads to global circuit simplification, reducing the complexity of task-specific circuits.
- Cross-layer transcoders facilitate a unified view of feature-based and circuit-based perspectives.
- The findings suggest that much of the computation in transformers may be redundant.
Computer Science > Machine Learning arXiv:2512.05865 (cs) [Submitted on 5 Dec 2025 (v1), last revised 25 Feb 2026 (this version, v3)] Title:Sparse Attention Post-Training for Mechanistic Interpretability Authors:Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf View a PDF of the paper titled Sparse Attention Post-Training for Mechanistic Interpretability, by Florent Draye and 4 other authors View PDF HTML (experimental) Abstract:We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective, we show on models up to 7B parameters that it is possible to retain the original pretraining loss while reducing attention connectivity to $\approx 0.4 \%$ of its edges. Unlike sparse-attention methods designed for computational efficiency, our approach leverages sparsity as a structural prior: it preserves capability while exposing a more organized and interpretable connectivity pattern. We find that this local sparsity cascades into global circuit simplification: task-specific circuits involve far fewer components (attention heads and MLPs) with up to 100x fewer edges connecting them. Additionally, using cross-layer transcoders, we show that sparse attention substantially simplifies attention attribution, enabling a unified view of feature-based and circuit-based perspectives. These results demonstrate that transformer attention c...