[2603.01399] Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification
About this article
Abstract page for arXiv paper 2603.01399: Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2603.01399 (cs) [Submitted on 2 Mar 2026] Title:Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification Authors:Guang Huang, Zeyi Wen View a PDF of the paper titled Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification, by Guang Huang and Zeyi Wen View PDF HTML (experimental) Abstract:Speculative Decoding (SD) has emerged as a premier technique for accelerating Large Language Model (LLM) inference by decoupling token generation into rapid drafting and parallel verification. While recent advancements in self-speculation and lookahead decoding have successfully minimized drafting overhead, they have shifted the primary performance bottleneck to the verification phase. Since verification requires a full forward pass of the target model, it remains strictly memory-bandwidth bound, fundamentally limiting the maximum achievable this http URL this paper, we introduce \textbf{Quasar} (\textbf{Qua}ntized \textbf{S}elf-speculative \textbf{A}cceleration for \textbf{R}apid Inference), a novel, training-free framework designed to overcome this "memory wall" by employing low-bit quantization specifically for the verification stage. Our empirical analysis reveals that while aggressive structural pruning significantly degrades verification accuracy, quantization-based verification preserves the logit distribution with high fide...