[2502.17421] LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
About this article
Abstract page for arXiv paper 2502.17421: LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
Computer Science > Computation and Language arXiv:2502.17421 (cs) [Submitted on 24 Feb 2025 (v1), last revised 7 Apr 2026 (this version, v3)] Title:LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification Authors:Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An View a PDF of the paper titled LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification, by Penghui Yang and 6 other authors View PDF HTML (experimental) Abstract:As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this capability. Speculative decoding (SD) offers a promising lossless acceleration technique compared to lossy alternatives such as quantization and model cascades. However, most state-of-the-art SD methods are trained on short texts (typically fewer than 4k tokens), making them unsuitable for long-context scenarios. Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key-Value (KV) cache; (2) performance degradation resulting from the mismatch between short-context training and long-context inference; and (3) inefficiencies in tree attention mechanisms when managing long token sequences. This work introduces LongSpec, a fra...