[2602.11184] KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

[2602.11184] KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

arXiv - Machine Learning 4 min read Article

Summary

The paper presents KBVQ-MoE, a novel framework for improving vector quantization in Mixture of Experts (MoE) large language models, addressing challenges in parameter efficiency and performance degradation.

Why It Matters

As large language models grow, efficient deployment in resource-constrained environments becomes critical. KBVQ-MoE offers a solution by enhancing quantization methods, potentially allowing for better performance without the high computational costs typically associated with MoE models.

Key Takeaways

  • KBVQ-MoE improves quantization for MoE models, enhancing performance.
  • The framework addresses redundancy and output bias in expert models.
  • Experimental results show KBVQ-MoE maintains accuracy close to FP16 baselines.
  • This approach is suitable for deployment on edge devices.
  • The integration of KLT and SVD techniques optimizes weight sharing across experts.

Computer Science > Machine Learning arXiv:2602.11184 (cs) [Submitted on 30 Jan 2026 (v1), last revised 24 Feb 2026 (this version, v2)] Title:KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models Authors:Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang View a PDF of the paper titled KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models, by Zukang Xu and 4 other authors View PDF HTML (experimental) Abstract:Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation. However, their enormous parameter sizes and memory demands pose major challenges for deployment in resource-constrained environments. Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs) by leveraging a codebook, where weight vectors are mapped to the most similar discrete codewords. Yet, directly applying VQ to MoEs often leads to substantial performance degradation due to two critical obstacles: (1) redundant representations among experts cause VQ to repeatedly quantize similar representations for each expert, resulting in inefficient use of limited codebook capacity; and (2) cumulative output bias is amplified by expert aggregation in MoE layers, leading to distributional shifts in the quantized outputs. To address these issues, we propose KBVQ...

Related Articles

Llms

"Oops! ChatGPT is Temporarily Unavailable!": A Diary Study on Knowledge Workers' Experiences of LLM Withdrawal

submitted by /u/Special-Steel [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Llms

I built a Star Trek LCARS terminal that reads your entire AI coding setup

Side project that got out of hand. It's a dashboard for Claude Code that scans your ~/.claude/ directory and renders everything as a TNG ...

Reddit - Artificial Intelligence · 1 min ·
Llms

[R] Is autoresearch really better than classic hyperparameter tuning?

We did experiments comparing Optuna & autoresearch. Autoresearch converges faster, is more cost-efficient, and even generalizes bette...

Reddit - Machine Learning · 1 min ·
Llms

Claude Source Code?

Has anyone been able to successfully download the leaked source code yet? I've not been able to find it. If anyone has, please reach out....

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime