[2510.22876] Batch Speculative Decoding Done Right
Summary
The paper presents a novel framework for batch speculative decoding, addressing critical failures in existing methods and achieving significant throughput improvements while ensuring output equivalence.
Why It Matters
This research is crucial for advancing generative AI technologies, particularly in improving the efficiency and accuracy of language models. By resolving the synchronization issues in batch speculative decoding, it enhances the performance of AI systems that rely on autoregressive generation, making them more reliable and effective.
Key Takeaways
- Existing batch speculative decoding methods fail to maintain output equivalence, leading to corrupted outputs.
- The authors introduce EQSPEC, a new algorithm that guarantees output equivalence in decoding.
- EXSPEC reduces computational overhead through dynamic cross-batch scheduling, improving throughput by up to 3x.
- The proposed methods achieve 95% decoding equivalence, primarily affected by floating-point non-determinism.
- The research provides a significant step forward in optimizing AI decoding processes, essential for real-time applications.
Computer Science > Computation and Language arXiv:2510.22876 (cs) [Submitted on 26 Oct 2025 (v1), last revised 15 Feb 2026 (this version, v3)] Title:Batch Speculative Decoding Done Right Authors:Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li, Rui Zhang View a PDF of the paper titled Batch Speculative Decoding Done Right, by Ranran Haoran Zhang and 5 other authors View PDF HTML (experimental) Abstract:Speculative decoding must produce outputs distribution identical to standard autoregressive generation-this output equivalence is not an optimization target but the defining criterion of valid speculative decoding. We demonstrate that all existing batch speculative decoding implementations violate this fundamental requirement, producing corrupted outputs ranging from repetitive tokens to gibberish. These failures stem from the ragged tensor problem: sequences in the same batch accept different numbers of draft tokens, desynchronizing position IDs, attention masks, and KV-cache state. We present the first authentic batch speculative decoding framework. We (1) formalize the synchronization invariants that valid batch speculative decoding must satisfy, (2) present EQSPEC, the first algorithm that guarantees output equivalence, and analyze its cost structure to show that alignment overhead grows superlinearly and consumes up to 40\% of computation, and (3) introduce EXSPEC, which reduces this overhead through cross-batch scheduling that dynamically groups sa...