[2601.20088] Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery
About this article
Abstract page for arXiv paper 2601.20088: Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery
Computer Science > Machine Learning arXiv:2601.20088 (cs) [Submitted on 27 Jan 2026 (v1), last revised 1 Mar 2026 (this version, v2)] Title:Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery Authors:Meng Xin, Sweta Priyadarshi, Jingyu Xin, Bilal Kartal, Aditya Vavre, Asma Kuriparambil Thekkumpate, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Ido Shahaf, Akhiad Bercovich, Kinjal Patel, Suguna Varshini Velury, Chenjie Luo, Zhiyu Cheng, Jenny Chen, Chen-Han Yu, Wei Ping, Oleg Rybakov, Nima Tajbakhsh, Oluwatobi Olabiyi, Dusan Stosic, Di Wu, Song Han, Eric Chung, Sharath Turuvekere Sreenivas, Bryan Catanzaro, Yoshi Suhara, Tijmen Blankevoort, Huizi Mao View a PDF of the paper titled Quantization-Aware Distillation for NVFP4 Inference Accuracy Recovery, by Meng Xin and 28 other authors View PDF HTML (experimental) Abstract:This technical report presents quantization-aware distillation (QAD) and our best practices for recovering accuracy of NVFP4-quantized large language models (LLMs) and vision-language models (VLMs). QAD distills a full-precision teacher model into a quantized student model using a KL divergence loss. While applying distillation to quantized models is not a new idea, we observe key advantages of QAD for today's LLMs: 1. It shows remarkable effectiveness and stability for models trained through multi-stage post-training pipelines, including supervised fine-tuning (SFT), reinforcement learning (RL), and model merging, where traditional quantizatio...