[2508.02343] MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models
About this article
Abstract page for arXiv paper 2508.02343: MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models
Computer Science > Machine Learning arXiv:2508.02343 (cs) [Submitted on 4 Aug 2025 (v1), last revised 30 Mar 2026 (this version, v2)] Title:MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models Authors:Wenyuan Liu, Haoqian Meng, Yilun Luo, Yafei Zhao, Peng Zhang, Xindian Ma View a PDF of the paper titled MicroMix: Efficient Mixed-Precision Quantization with Microscaling Formats for Large Language Models, by Wenyuan Liu and 5 other authors View PDF Abstract:Quantization significantly accelerates inference in large language models (LLMs) by replacing original high-precision matrices with low-precision counterparts. Recent advances in weight-activation quantization have primarily focused on mapping both weights and activations to the INT4 format. Although the new FP4 Tensor Cores in NVIDIA's Blackwell architecture offer up to 4x speedup over FP16, existing INT4-based kernels fail to fully exploit this capability due to mismatched data formats. To bridge this gap, we propose MicroMix, a co-designed mixed-precision quantization algorithm and GEMM kernel based on Microscaling (MX) data formats. Tailored for the Blackwell architecture, the MicroMix kernel supports arbitrary combinations of MXFP4, MXFP6, and MXFP8 channels, and produces BFloat16 outputs. To achieve a favorable trade-off between accuracy and efficiency for each linear layer, we introduce quantization thresholds that identify activation elements where lower-precision fo...