[2602.16092] Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff
Summary
The paper explores the necessity of two-stream attention in any-order autoregressive models, highlighting a structural-semantic tradeoff that impacts model performance and efficiency.
Why It Matters
Understanding the structural-semantic tradeoff in autoregressive models is crucial for advancing machine learning techniques. This research provides insights into optimizing attention mechanisms, which can enhance the performance of generative models in various applications, including natural language processing and AI.
Key Takeaways
- Two-stream attention helps manage the tradeoff between semantic and structural token representation in autoregressive models.
- Decoupled RoPE offers a new approach to rotary position embeddings, improving performance at short sequence lengths.
- The findings suggest that separating position from content is not the only benefit of two-stream attention; it also addresses deeper structural challenges.
Computer Science > Machine Learning arXiv:2602.16092 (cs) [Submitted on 17 Feb 2026] Title:Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff Authors:Patrick Pynadath, Ruqi Zhang View a PDF of the paper titled Why Any-Order Autoregressive Models Need Two-Stream Attention: A Structural-Semantic Tradeoff, by Patrick Pynadath and 1 other authors View PDF HTML (experimental) Abstract:Any-order autoregressive models (AO-ARMs) offer a promising path toward efficient masked diffusion by enabling native key-value caching, but competitive performance has so far required two-stream attention, typically motivated as a means of decoupling token content from position. In this work, we argue that two-stream attention may be serving a more subtle role. We identify a structural-semantic tradeoff in any-order generation: the hidden representation at each step must simultaneously attend to semantically informative tokens for prediction and structurally recent tokens for summarization, objectives that compete for attention capacity in a single stream but can specialize across two streams. To isolate this tradeoff from position-content separation, we propose Decoupled RoPE, a modification to rotary position embeddings that provides target position information without revealing target content. Decoupled RoPE performs competitively at short sequence lengths--where semantic and structural proximity coincide--but degrades as sequence length increases and...