[2604.02638] AXELRAM: Quantize Once, Never Dequantize

[2604.02638] AXELRAM: Quantize Once, Never Dequantize

arXiv - Machine Learning 3 min read

About this article

Abstract page for arXiv paper 2604.02638: AXELRAM: Quantize Once, Never Dequantize

Computer Science > Machine Learning arXiv:2604.02638 (cs) [Submitted on 3 Apr 2026] Title:AXELRAM: Quantize Once, Never Dequantize Authors:Yasushi Nishida View a PDF of the paper titled AXELRAM: Quantize Once, Never Dequantize, by Yasushi Nishida View PDF HTML (experimental) Abstract:We propose AXELRAM, a smart SRAM macro architecture that computes attention scores directly from quantized KV cache indices without dequantization. The key enabler is a design-time fixed codebook: orthogonal-transform-based quantization concentrates each coordinate's distribution to N(0,1/d), so the optimal quantizer depends only on dimension d and bit-width b, not on input data. The asymmetric path design -- transform on write, table-lookup on read with no inverse transform -- reduces per-query multiplications by 102.4x (a mathematical identity). Through multi-seed evaluation (10 seeds x 3 models), we discover that sign pattern sensitivity causes catastrophic PPL spikes (Delta > 50) on certain models (Qwen2.5-3B), while others (LLaMA-3.1-8B) are fully stable. This phenomenon extends SpinQuant's observation of rotation variance in weight quantization to the KV cache domain, where the effect is qualitatively more severe. We trace the root cause to layer-wise norm heterogeneity and propose a gradient-free sign pattern selection (200 candidates, 8 calibration samples, one-time) that eliminates catastrophic spikes with zero additional hardware cost. All source code is available at this https URL. ...

Originally published on April 06, 2026. Curated by AI News.

Related Articles

ChatGPT finally offers $100/month Pro plan | TechCrunch
Llms

ChatGPT finally offers $100/month Pro plan | TechCrunch

OpenAI announced on Thursday something that power users have been asking for: a $100/month plan. Previously, subscriptions jumped from $2...

TechCrunch - AI · 4 min ·
“Negative” views of Broadcom driving thousands of VMware migrations, rival says - Ars Technica

“Negative” views of Broadcom driving thousands of VMware migrations, rival says - Ars Technica

Western Union exec says there were "challenges" working with Broadcom.

Ars Technica - AI · 5 min ·
Machine Learning

[For Hire] Ex-Microsoft Senior Data Engineer | Databricks, Palantir Foundry, MLOps | $55/hr

submitted by /u/mcheetirala2510 [link] [comments]

Reddit - ML Jobs · 1 min ·

Can strangers in a discord server produce SOTA AI research? Let's find out. \

Most online communities are places to talk about research. Zeteo exists to produce research -- pressure-tested at every stage before a si...

Reddit - Artificial Intelligence · 1 min ·

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime