[2510.02228] xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity
Summary
The paper explores xLSTM scaling laws, demonstrating its competitive performance against Transformers with linear time complexity, offering insights for future model design.
Why It Matters
As large language models (LLMs) dominate AI applications, understanding the scaling behavior of different architectures like xLSTM is crucial for optimizing performance and resource allocation. This research provides valuable insights into model efficiency and informs decisions in AI development.
Key Takeaways
- xLSTM models exhibit linear complexity, making them efficient for large context lengths.
- In comparative studies, xLSTM consistently outperforms Transformers in terms of cross-entropy loss for the same compute budget.
- The research highlights the importance of context length in determining optimal model sizes, an area often overlooked in prior studies.
- xLSTM shows favorable scaling characteristics during both training and inference phases.
- Findings suggest that xLSTM could be a viable alternative to Transformers in LLM applications.
Computer Science > Machine Learning arXiv:2510.02228 (cs) [Submitted on 2 Oct 2025 (v1), last revised 20 Feb 2026 (this version, v2)] Title:xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity Authors:Maximilian Beck, Kajetan Schweighofer, Sebastian Böck, Sebastian Lehner, Sepp Hochreiter View a PDF of the paper titled xLSTM Scaling Laws: Competitive Performance with Linear Time-Complexity, by Maximilian Beck and 4 other authors View PDF HTML (experimental) Abstract:Scaling laws play a central role in the success of Large Language Models (LLMs), enabling the prediction of model performance relative to compute budgets prior to training. While Transformers have been the dominant architecture, recent alternatives such as xLSTM offer linear complexity with respect to context length while remaining competitive in the billion-parameter regime. We conduct a comparative investigation on the scaling behavior of Transformers and xLSTM along the following lines, providing insights to guide future model design and deployment. First, we study the scaling behavior for xLSTM in compute-optimal and over-training regimes using both IsoFLOP and parametric fit approaches on a wide range of model sizes (80M-7B) and number of training tokens (2B-2T). Second, we examine the dependence of optimal model sizes on context length, a pivotal aspect that was largely ignored in previous work. Finally, we analyze inference-time scaling characteristics. Our findings reveal that in typi...