[2602.23200] InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

[2602.23200] InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

arXiv - Machine Learning 4 min read Article

Summary

InnerQ presents a novel hardware-aware quantization method for key-value caches in large language models, enhancing decoding efficiency while preserving accuracy.

Why It Matters

As large language models grow in complexity, optimizing their performance without sacrificing accuracy becomes crucial. InnerQ addresses the significant memory and latency challenges posed by key-value caches, making it relevant for developers and researchers focused on efficient AI implementations.

Key Takeaways

  • InnerQ reduces decoding latency by optimizing KV cache quantization.
  • The method achieves up to 22% speedup compared to previous quantization techniques.
  • Hybrid quantization and per-channel normalization enhance model fidelity.
  • InnerQ maintains performance comparable to non-quantized caches.
  • The approach is particularly beneficial for applications requiring long-sequence generation.

Computer Science > Machine Learning arXiv:2602.23200 (cs) [Submitted on 26 Feb 2026] Title:InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models Authors:Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross View a PDF of the paper titled InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models, by Sayed Mohammadreza Tayaranian Hosseini and 2 other authors View PDF Abstract:Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is the key-value (KV) cache, whose size scales with sequence length and easily dominates the memory footprint of the model. Previous work proposed quantization methods that are focused on compressing the KV cache while maintaining its information. We introduce InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy. InnerQ applies group-wise quantization while grouping the cache matrices over their inner dimension. Unlike previous work that group over the outer dimension, InnerQ aligns dequantization with the vector-matrix multiplication and enables scale factor reuse across GPU compute units. This reduces memory accesses and accelerates dequantization, yielding up to $22\%$ speedup over previous work and up to $88\%$ over half-precision vector-matrix multiplication. To preserve fidelity under aggressive compression, InnerQ inco...

Related Articles

Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED
Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min ·
Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
Llms

ChatGPT on trial: A landmark test of AI liability in the practice of law

AI Tools & Products ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime