[2602.23200] InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Summary
InnerQ presents a novel hardware-aware quantization method for key-value caches in large language models, enhancing decoding efficiency while preserving accuracy.
Why It Matters
As large language models grow in complexity, optimizing their performance without sacrificing accuracy becomes crucial. InnerQ addresses the significant memory and latency challenges posed by key-value caches, making it relevant for developers and researchers focused on efficient AI implementations.
Key Takeaways
- InnerQ reduces decoding latency by optimizing KV cache quantization.
- The method achieves up to 22% speedup compared to previous quantization techniques.
- Hybrid quantization and per-channel normalization enhance model fidelity.
- InnerQ maintains performance comparable to non-quantized caches.
- The approach is particularly beneficial for applications requiring long-sequence generation.
Computer Science > Machine Learning arXiv:2602.23200 (cs) [Submitted on 26 Feb 2026] Title:InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models Authors:Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross View a PDF of the paper titled InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models, by Sayed Mohammadreza Tayaranian Hosseini and 2 other authors View PDF Abstract:Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is the key-value (KV) cache, whose size scales with sequence length and easily dominates the memory footprint of the model. Previous work proposed quantization methods that are focused on compressing the KV cache while maintaining its information. We introduce InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy. InnerQ applies group-wise quantization while grouping the cache matrices over their inner dimension. Unlike previous work that group over the outer dimension, InnerQ aligns dequantization with the vector-matrix multiplication and enables scale factor reuse across GPU compute units. This reduces memory accesses and accelerates dequantization, yielding up to $22\%$ speedup over previous work and up to $88\%$ over half-precision vector-matrix multiplication. To preserve fidelity under aggressive compression, InnerQ inco...