[2602.20217] KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem
Summary
The paper introduces KnapSpec, a framework for self-speculative decoding that optimizes layer selection in LLMs as a knapsack problem, enhancing inference speed without additional training.
Why It Matters
KnapSpec addresses the inefficiencies of existing self-speculative decoding methods by adapting to dynamic computational needs, making it significant for improving the performance of large language models in real-world applications. This advancement can lead to faster processing times and better resource utilization in AI systems.
Key Takeaways
- KnapSpec reformulates draft model selection as a knapsack problem.
- It achieves up to 1.47x speedup in LLM inference without extra training.
- The method maintains high drafting faithfulness by modeling hardware-specific latencies.
Computer Science > Machine Learning arXiv:2602.20217 (cs) [Submitted on 23 Feb 2026] Title:KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem Authors:Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han View a PDF of the paper titled KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem, by Seongjin Cha and 4 other authors View PDF HTML (experimental) Abstract:Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while navigating the shifting bottlenecks of real-world hardware. Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art S...