[2602.18196] RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
Summary
The paper introduces RAT+, a novel architecture that enhances attention mechanisms in machine learning by combining dense pretraining with sparse inference, maintaining accuracy while improving efficiency.
Why It Matters
RAT+ addresses the challenges of accuracy degradation in sparse attention models, providing a solution that optimizes performance in commonsense reasoning tasks. This innovation is crucial for advancing machine learning applications that require efficient processing of large datasets.
Key Takeaways
- RAT+ combines dense pretraining with sparse inference for improved efficiency.
- The model maintains accuracy close to dense models while reducing computational costs.
- RAT+ can adapt to different attention patterns without needing retraining.
Computer Science > Machine Learning arXiv:2602.18196 (cs) [Submitted on 20 Feb 2026] Title:RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference Authors:Xiuying Wei, Caglar Gulcehre View a PDF of the paper titled RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference, by Xiuying Wei and 1 other authors View PDF HTML (experimental) Abstract:Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of the attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. However, we find a persistent failure mode of them -- sparsifying a pretrained attention model to a dilated pattern leads to severe accuracy degradation. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once, then flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at 16 and drops by about 2-3 points at 64 on commonsense reasoning and LongBench tasks, respectively. Moreover, RAT+ outperforms attention when sparsifying to the top-k block attention. We further scale to 2.6B parameters a...