Llms Machine Learning Ai Infrastructure

[2508.04581] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

arXiv - Machine Learning February 23, 2026 4 min read Article

Summary

The paper presents a novel framework called MASA for weight sharing in transformers, reducing parameters by 66.7% while maintaining performance, addressing efficiency in large language models.

Why It Matters

As large language models grow in complexity, their deployment becomes challenging due to high computational demands. MASA offers a scalable solution for improving transformer efficiency, making advanced AI applications more accessible and sustainable.

Key Takeaways

MASA reduces transformer attention parameters by 66.7% without performance loss.
The method leverages matrix-based dictionary learning for structured weight sharing.
Experiments show MASA outperforms existing compression techniques in benchmark accuracy.
The approach is a drop-in replacement, requiring no architectural changes.
Extending to Vision Transformers, MASA achieves similar efficiency gains in image classification.

Computer Science > Computation and Language arXiv:2508.04581 (cs) [Submitted on 6 Aug 2025 (v1), last revised 19 Feb 2026 (this version, v2)] Title:Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning Authors:Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis View a PDF of the paper titled Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning, by Magauiya Zhussip and 3 other authors View PDF HTML (experimental) Abstract:Large language models have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g., low-rank approximation or attention pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in convolutional networks, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices (Q, K, V, O) into shared dictionary atoms, reducing the attention module's parameters by 66.7\% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement-trained with standard optimizers - and represents each layer's weights as linear combination...

Read Original Article

[2508.04581] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Summary

Why It Matters

Key Takeaways

Related Articles

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED

The public needs to control AI-run infrastructure, labor, education, and governance— NOT private actors

Agents that write their own code at runtime and vote on capabilities, no human in the loop

Google Maps can now write captions for your photos using AI | TechCrunch

No comments

Stay updated with AI News