[2604.02556] Fast NF4 Dequantization Kernels for Large Language Model Inference

[2604.02556] Fast NF4 Dequantization Kernels for Large Language Model Inference

arXiv - Machine Learning 3 min read

About this article

Abstract page for arXiv paper 2604.02556: Fast NF4 Dequantization Kernels for Large Language Model Inference

Computer Science > Machine Learning arXiv:2604.02556 (cs) [Submitted on 2 Apr 2026] Title:Fast NF4 Dequantization Kernels for Large Language Model Inference Authors:Xiangbo Qi, Chaoyi Jiang, Murali Annavaram View a PDF of the paper titled Fast NF4 Dequantization Kernels for Large Language Model Inference, by Xiangbo Qi and 2 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) have grown beyond the memory capacity of single GPU devices, necessitating quantization techniques for practical deployment. While NF4 (4-bit NormalFloat) quantization enables 4$\times$ memory reduction, inference on current NVIDIA GPUs (e.g., Ampere A100) requires expensive dequantization back to FP16 format, creating a critical performance bottleneck. This paper presents a lightweight shared memory optimization that addresses this gap through principled memory hierarchy exploitation while maintaining full ecosystem compatibility. We compare our technique against the open-source BitsAndBytes implementation, achieving 2.0--2.2$\times$ kernel speedup across three models (Gemma 27B, Qwen3 32B, and Llama3.3 70B) and up to 1.54$\times$ end-to-end improvement by leveraging the 12--15$\times$ latency advantage of shared memory over global memory access. Our optimization reduces instruction counts through simplified indexing logic while using only 64 bytes of shared memory per thread block, demonstrating that lightweight optimizations can deliver substantial performance gains wit...

Originally published on April 06, 2026. Curated by AI News.

Related Articles

ChatGPT has a new $100 per month Pro subscription | The Verge
Llms

ChatGPT has a new $100 per month Pro subscription | The Verge

OpenAI has announced a new version of its ChatGPT Pro subscription that costs $100 per month. The new Pro tier offers “5x more” usage of ...

The Verge - AI · 4 min ·
ChatGPT finally offers $100/month Pro plan | TechCrunch
Llms

ChatGPT finally offers $100/month Pro plan | TechCrunch

OpenAI announced on Thursday something that power users have been asking for: a $100/month plan. Previously, subscriptions jumped from $2...

TechCrunch - AI · 4 min ·
Florida AG announces investigation into OpenAI over shooting that allegedly involved ChatGPT | TechCrunch
Llms

Florida AG announces investigation into OpenAI over shooting that allegedly involved ChatGPT | TechCrunch

ChatGPT had reportedly been used to plan the attack that killed two and injured five at Florida State University last April. The family o...

TechCrunch - AI · 4 min ·
Llms

We’re open-sourcing a 33-benchmark diagnostic for AI alignment gaps, launches April 27

On April 27 we’re open-sourcing a free diagnostic tool called iFixAi. You run it against your AI system (agent, copilot, LLM integration,...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime