[2602.11184] KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Summary
The paper presents KBVQ-MoE, a novel framework for improving vector quantization in Mixture of Experts (MoE) large language models, addressing challenges in parameter efficiency and performance degradation.
Why It Matters
As large language models grow, efficient deployment in resource-constrained environments becomes critical. KBVQ-MoE offers a solution by enhancing quantization methods, potentially allowing for better performance without the high computational costs typically associated with MoE models.
Key Takeaways
- KBVQ-MoE improves quantization for MoE models, enhancing performance.
- The framework addresses redundancy and output bias in expert models.
- Experimental results show KBVQ-MoE maintains accuracy close to FP16 baselines.
- This approach is suitable for deployment on edge devices.
- The integration of KLT and SVD techniques optimizes weight sharing across experts.
Computer Science > Machine Learning arXiv:2602.11184 (cs) [Submitted on 30 Jan 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models Authors:Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang View a PDF of the paper titled KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models, by Zukang Xu and 4 other authors View PDF HTML (experimental) Abstract:Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation. However, their enormous parameter sizes and memory demands pose major challenges for deployment in resource-constrained environments. Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs) by leveraging a codebook, where weight vectors are mapped to the most similar discrete codewords. Yet, directly applying VQ to MoEs often leads to substantial performance degradation due to two critical obstacles: (1) redundant representations among experts cause VQ to repeatedly quantize similar representations for each expert, resulting in inefficient use of limited codebook capacity; and (2) cumulative output bias is amplified by expert aggregation in MoE layers, leading to distributional shifts in the quantized outputs. To address these issues, we propose KBVQ...