[2508.19228] Predicting the Order of Upcoming Tokens Improves Language Modeling
Summary
The paper presents a novel approach to language modeling by introducing token order prediction (TOP) as an improvement over traditional next-token prediction methods, demonstrating superior performance across various NLP benchmarks.
Why It Matters
This research is significant as it addresses limitations in existing language modeling techniques, offering a more efficient method that enhances model performance. The findings could lead to advancements in natural language processing applications, making them more effective in understanding and generating human language.
Key Takeaways
- Token order prediction (TOP) outperforms traditional next-token prediction (NTP) methods.
- TOP requires fewer resources compared to multi-token prediction (MTP), making it more efficient.
- The study shows improved performance across nine standard NLP benchmarks with the use of TOP.
- Continued training on specific tasks enhances model effectiveness in relevant benchmarks.
- The research provides open access to code, promoting further exploration and application.
Computer Science > Machine Learning arXiv:2508.19228 (cs) [Submitted on 26 Aug 2025 (v1), last revised 16 Feb 2026 (this version, v2)] Title:Predicting the Order of Upcoming Tokens Improves Language Modeling Authors:Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji View a PDF of the paper titled Predicting the Order of Upcoming Tokens Improves Language Modeling, by Zayd M. K. Zuhri and 2 other authors View PDF HTML (experimental) Abstract:Multi-token prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We found MTP's exact future token prediction to be too difficult as an auxiliary loss. Instead, we propose token order prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, DeepSeek MTP (DS-MTP) and TOP objectives. The results of nine standard NLP benchmarks show that TOP overall outperforms NTP, MTP, and DS-MTP even at scale. TOP models with continued training on math and code also perform better on 4 relevant benchmarks. On the synthetic star graph task, TOP enables pathfinding on graphs where NTP, MTP, and DS-MTP fail. Our code is available at this https URL Subjects: Machine Learning (cs....