[2602.18196] RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

[2602.18196] RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

arXiv - Machine Learning 3 min read Article

Summary

The paper introduces RAT+, a novel architecture that enhances attention mechanisms in machine learning by combining dense pretraining with sparse inference, maintaining accuracy while improving efficiency.

Why It Matters

RAT+ addresses the challenges of accuracy degradation in sparse attention models, providing a solution that optimizes performance in commonsense reasoning tasks. This innovation is crucial for advancing machine learning applications that require efficient processing of large datasets.

Key Takeaways

  • RAT+ combines dense pretraining with sparse inference for improved efficiency.
  • The model maintains accuracy close to dense models while reducing computational costs.
  • RAT+ can adapt to different attention patterns without needing retraining.

Computer Science > Machine Learning arXiv:2602.18196 (cs) [Submitted on 20 Feb 2026] Title:RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference Authors:Xiuying Wei, Caglar Gulcehre View a PDF of the paper titled RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference, by Xiuying Wei and 1 other authors View PDF HTML (experimental) Abstract:Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of the attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. However, we find a persistent failure mode of them -- sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once, then flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at 16 and drops by about 2-3 points at 64 on commonsense reasoning and LongBench tasks, respectively. Moreover, RAT+ outperforms attention when sparsifying to the top-k block attention. We further scale to 2.6B parameters a...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Machine Learning

[R], 31 MILLIONS High frequency data, Light GBM worked perfectly

We just published a paper on predicting adverse selection in high-frequency crypto markets using LightGBM, and I wanted to share it here ...

Reddit - Machine Learning · 1 min ·
Machine Learning

[D] Those of you with 10+ years in ML — what is the public completely wrong about?

For those of you who've been in ML/AI research or applied ML for 10+ years — what's the gap between what the public thinks AI is doing vs...

Reddit - Machine Learning · 1 min ·
Machine Learning

AI assistants are optimized to seem helpful. That is not the same thing as being helpful.

RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agree...

Reddit - Artificial Intelligence · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime