[2602.22592] pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training
Summary
The paper presents pQuant, a novel approach for low-bit language models that utilizes decoupled linear quantization-aware training to enhance efficiency and accuracy.
Why It Matters
As large language models (LLMs) become increasingly important for various applications, optimizing their efficiency for edge deployment is crucial. pQuant addresses significant limitations in current quantization methods, potentially leading to more effective and scalable LLMs.
Key Takeaways
- pQuant introduces a decoupled approach to quantization, enhancing model expressivity.
- The method splits linear layers into a 1-bit branch for computation and a high-precision branch for sensitive parameters.
- Extensive experiments demonstrate pQuant's state-of-the-art performance in low-bit quantization.
Computer Science > Machine Learning arXiv:2602.22592 (cs) [Submitted on 26 Feb 2026] Title:pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training Authors:Wenzheng Zhang, Bingzheng Liu, Yang Hu, Xiaoying Bai, Wentao Zhang, Bin Cui View a PDF of the paper titled pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training, by Wenzheng Zhang and 5 other authors View PDF HTML (experimental) Abstract:Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment. However, existing methods still fail to achieve satisfactory accuracy and scalability. In this work, we identify a parameter democratization effect as a key bottleneck: the sensitivity of all parameters becomes homogenized, severely limiting expressivity. To address this, we propose pQuant, a method that decouples parameters by splitting linear layers into two specialized branches: a dominant 1-bit branch for efficient computation and a compact high-precision branch dedicated to preserving the most sensitive parameters. Through tailored feature scaling, we explicitly guide the model to allocate sensitive parameters to the high-precision branch. Furthermore, we extend this branch into multiple, sparsely-activated experts, enabling efficient capacity scaling. Extensive exper...