[2508.04581] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning
Summary
The paper presents a novel framework called MASA for weight sharing in transformers, reducing parameters by 66.7% while maintaining performance, addressing efficiency in large language models.
Why It Matters
As large language models grow in complexity, their deployment becomes challenging due to high computational demands. MASA offers a scalable solution for improving transformer efficiency, making advanced AI applications more accessible and sustainable.
Key Takeaways
- MASA reduces transformer attention parameters by 66.7% without performance loss.
- The method leverages matrix-based dictionary learning for structured weight sharing.
- Experiments show MASA outperforms existing compression techniques in benchmark accuracy.
- The approach is a drop-in replacement, requiring no architectural changes.
- Extending to Vision Transformers, MASA achieves similar efficiency gains in image classification.
Computer Science > Computation and Language arXiv:2508.04581 (cs) [Submitted on 6 Aug 2025 (v1), last revised 19 Feb 2026 (this version, v2)] Title:Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning Authors:Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis View a PDF of the paper titled Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning, by Magauiya Zhussip and 3 other authors View PDF HTML (experimental) Abstract:Large language models have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g., low-rank approximation or attention pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in convolutional networks, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices (Q, K, V, O) into shared dictionary atoms, reducing the attention module's parameters by 66.7\% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement-trained with standard optimizers - and represents each layer's weights as linear combination...