[2508.04581] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

[2508.04581] Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

arXiv - Machine Learning 4 min read Article

Summary

The paper presents a novel framework called MASA for weight sharing in transformers, reducing parameters by 66.7% while maintaining performance, addressing efficiency in large language models.

Why It Matters

As large language models grow in complexity, their deployment becomes challenging due to high computational demands. MASA offers a scalable solution for improving transformer efficiency, making advanced AI applications more accessible and sustainable.

Key Takeaways

  • MASA reduces transformer attention parameters by 66.7% without performance loss.
  • The method leverages matrix-based dictionary learning for structured weight sharing.
  • Experiments show MASA outperforms existing compression techniques in benchmark accuracy.
  • The approach is a drop-in replacement, requiring no architectural changes.
  • Extending to Vision Transformers, MASA achieves similar efficiency gains in image classification.

Computer Science > Computation and Language arXiv:2508.04581 (cs) [Submitted on 6 Aug 2025 (v1), last revised 19 Feb 2026 (this version, v2)] Title:Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning Authors:Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis View a PDF of the paper titled Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning, by Magauiya Zhussip and 3 other authors View PDF HTML (experimental) Abstract:Large language models have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g., low-rank approximation or attention pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in convolutional networks, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices (Q, K, V, O) into shared dictionary atoms, reducing the attention module's parameters by 66.7\% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in replacement-trained with standard optimizers - and represents each layer's weights as linear combination...

Related Articles

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED
Llms

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything | WIRED

The AI lab's Project Glasswing will bring together Apple, Google, and more than 45 other organizations. They'll use the new Claude Mythos...

Wired - AI · 7 min ·
Llms

The public needs to control AI-run infrastructure, labor, education, and governance— NOT private actors

A lot of discussion around AI is becoming siloed, and I think that is dangerous. People in AI-focused spaces often talk as if the only qu...

Reddit - Artificial Intelligence · 1 min ·
Llms

Agents that write their own code at runtime and vote on capabilities, no human in the loop

hollowOS just hit v4.4 and I added something that I haven’t seen anyone else do. Previous versions gave you an OS for agents: structured ...

Reddit - Artificial Intelligence · 1 min ·
Google Maps can now write captions for your photos using AI | TechCrunch
Llms

Google Maps can now write captions for your photos using AI | TechCrunch

Gemini can now create captions when users are looking to share a photo or video.

TechCrunch - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime