[2512.22420] Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving
Summary
The paper presents Nightjar, a novel algorithm for dynamic adaptive speculative decoding in large language models, enhancing throughput and reducing latency in real-time applications.
Why It Matters
As large language models become integral to various applications, optimizing their performance under different load conditions is crucial. Nightjar addresses the limitations of existing speculative decoding methods, offering a more efficient solution that can adapt to real-world demands, thus improving user experience and resource utilization.
Key Takeaways
- Nightjar dynamically adjusts speculative decoding length based on request load.
- It achieves up to 14.8% higher throughput compared to standard methods.
- The algorithm can disable speculative decoding when it is not beneficial.
- Nightjar significantly reduces latency by 20.2% in high-load scenarios.
- This advancement enhances the efficiency of large language models in real-time serving.
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2512.22420 (cs) [Submitted on 27 Dec 2025 (v1), last revised 16 Feb 2026 (this version, v2)] Title:Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving Authors:Rui Li, Zhaoning Zhang, Libo Zhang, Huaimin Wang, Xiang Fu, Zhiquan Lai View a PDF of the paper titled Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving, by Rui Li and 5 other authors View PDF HTML (experimental) Abstract:Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Current SD implementations use a fixed speculative length, failing to adapt to dynamic request rates and creating a significant performance bottleneck in real-world serving scenarios. To overcome this, we propose Nightjar, a novel learning-based algorithm for adaptive speculative inference that adjusts to request load by dynamically selecting the optimal speculative length for different batch sizes and even disabling speculative decoding when it provides no benefit. Experiments show that Nightjar achieves up to 14.8% higher throughput and 20.2% lower latency compared to standard speculative decoding, demonstrating robust efficiency for real-time serving. Comments: Subjects: Distr...