[2509.10406] Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

[2509.10406] Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining

arXiv - Machine Learning 3 min read Article

Summary

The paper introduces Multipole Semantic Attention (MuSe), a method that accelerates pretraining of transformers on long sequences by 36% without compromising performance, enhancing efficiency in machine learning tasks.

Why It Matters

As transformer models become increasingly complex, optimizing their training processes is crucial for efficiency and scalability. MuSe addresses the bottleneck of quadratic attention costs, making it relevant for researchers and practitioners focused on improving model performance and resource utilization.

Key Takeaways

  • MuSe accelerates 64k-context pretraining by 36% while maintaining baseline loss.
  • The method clusters queries and keys in representation space for improved efficiency.
  • MuSe is compatible with existing pretrained models, requiring no architectural changes.
  • The approach has been validated on models like Llama 3.1-8B and 3.2-1B.
  • Pretraining with MuSe preserves quality and long-context utilization.

Computer Science > Machine Learning arXiv:2509.10406 (cs) [Submitted on 12 Sep 2025 (v1), last revised 13 Feb 2026 (this version, v3)] Title:Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining Authors:Rupert Mitchell, Kristian Kersting View a PDF of the paper titled Multipole Semantic Attention: A Fast Approximation of Softmax Attention for Pretraining, by Rupert Mitchell and Kristian Kersting View PDF HTML (experimental) Abstract:Pretraining transformers on long sequences (entire code repositories, collections of related documents) is bottlenecked by quadratic attention costs. We present Multipole Semantic Attention (MuSe), which accelerates 64k-context pretraining by 36% while matching baseline loss, requiring no architectural changes. MuSe clusters queries and keys separately in representation space. This yields query-specific summaries that substantially outperform spatial blocking at matched sparsity, while also enabling drop-in compatibility with existing pretrained models; we validate on Llama 3.1-8B and 3.2-1B without retraining. We pretrain language models up to 1B parameters at 64k context on code and scientific documents, confirming that MuSe preserves quality and long-context utilization during training. Subjects: Machine Learning (cs.LG) MSC classes: 68W25, 68T50 (primary) 68W40, 68T07 (secondary) ACM classes: I.2.6; I.2.7 Cite as: arXiv:2509.10406 [cs.LG]   (or arXiv:2509.10406v3 [cs.LG] for this version)   https://doi.org/...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
AI Hiring Growth: AI and ML Hiring Surges 37% in Marche
Machine Learning

AI Hiring Growth: AI and ML Hiring Surges 37% in Marche

AI News - General · 1 min ·
As Meta Flounders, It Reportedly Plans to Open Source Its New AI Models
Machine Learning

As Meta Flounders, It Reportedly Plans to Open Source Its New AI Models

AI Tools & Products · 5 min ·
Google quietly launched an AI dictation app that works offline
Machine Learning

Google quietly launched an AI dictation app that works offline

Google's new offline-first dictation app uses Gemma AI models to take on the apps like Wispr Flow.

TechCrunch - AI · 4 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime