[2505.22842] Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Summary
The paper introduces the Bayesian Attention Mechanism (BAM), a novel framework for positional encoding in transformer models that enhances context length extrapolation and improves long-context generalization.
Why It Matters
As transformer models become increasingly prevalent in natural language processing, understanding and improving positional encoding is crucial. BAM offers a theoretical foundation that enhances the performance of these models, particularly in tasks requiring long-context understanding, which is vital for applications in AI and machine learning.
Key Takeaways
- BAM formulates positional encoding as a probabilistic prior, enhancing theoretical clarity.
- The framework unifies existing positional encoding methods and introduces a new Generalized Gaussian positional prior.
- BAM significantly improves long-context generalization, enabling effective information retrieval at 500 times the training context length.
- The approach maintains comparable perplexity while adding minimal parameters.
- BAM sets a new state-of-the-art in long-context retrieval accuracy.
Computer Science > Computation and Language arXiv:2505.22842 (cs) [Submitted on 28 May 2025 (v1), last revised 23 Feb 2026 (this version, v3)] Title:Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation Authors:Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü View a PDF of the paper titled Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation, by Arthur S. Bianchessi and 3 other authors View PDF HTML (experimental) Abstract:Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters. Comments: Subjects: Computation and Langu...