[2604.03260] Why Attend to Everything? Focus is the Key

[2604.03260] Why Attend to Everything? Focus is the Key

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2604.03260: Why Attend to Everything? Focus is the Key

Computer Science > Computation and Language arXiv:2604.03260 (cs) [Submitted on 12 Mar 2026] Title:Why Attend to Everything? Focus is the Key Authors:Hengshuai Yao, Xing Chen, Ahmed Murtadha, Jin Li, Shuai Shao, Yasin Abbasi Yadkori, Guan Wang, Mingli Yuan, William Chen, Sen Song View a PDF of the paper titled Why Attend to Everything? Focus is the Key, by Hengshuai Yao and 9 other authors View PDF HTML (experimental) Abstract:We introduce Focus, a method that learns which token pairs matter rather than approximating all of them. Learnable centroids assign tokens to groups; distant attention is restricted to same-group pairs while local attention operates at full resolution. Because all model weights stay frozen, Focus is purely additive: centroid-only training (as few as 148K parameters) improves domain perplexity with zero degradation on downstream benchmarks--from 124M to 70B parameters, across five attention architectures. No existing efficient attention method achieves this in the retrofit setting. At 124M, Focus surpasses full attention (30.3 vs 31.4 PPL); trained from scratch at 7B scale (2B tokens), Focus again beats full attention (13.82 vs 13.89 PPL). At inference, restricting each token to its top-k highest-scoring groups discretizes the soft routing into a hard sparsity pattern, yielding 2x speedup while beating the pretrained baseline (41.3 vs 42.8 PPL); decomposing this pattern into two standard FlashAttention calls reaches 8.6x wall-clock speedup at 1M token...

Originally published on April 07, 2026. Curated by AI News.

Related Articles

Machine Learning

Google signs deal with Pentagon, allowing 'any lawful' use of AI models

https://preview.redd.it/hbbp7hn1cxxg1.png?width=811&format=png&auto=webp&s=a633fe43837bf60e014afaa4c6cf3fe72a4976d3 I feel li...

Reddit - Artificial Intelligence · 1 min ·
Llms

Karpathy dropped a 200-line GPT, so I used the math to turn pandas DataFrames into searchable context windows and open sourced it (and automated my stats pipeline). [P]

TL;DR: I got tired of manually running Shapiro-Wilk tests and copy-pasting p-values at 2 AM. I built an open-source, async Python pipelin...

Reddit - Machine Learning · 1 min ·
Google and Pentagon reportedly agree deal for ‘any lawful’ use of AI | The Verge
Machine Learning

Google and Pentagon reportedly agree deal for ‘any lawful’ use of AI | The Verge

Google has signed a classified deal that allows the US Department of Defense to use its AI models for “any lawful government purpose.”

The Verge - AI · 4 min ·
Machine Learning

Fresher in AI/ML looking for entry-level opportunities

submitted by /u/SlowButAqurate [link] [comments]

Reddit - ML Jobs · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime