[2602.17693] A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU
Summary
This article presents a case study on the effectiveness of Post-Training Quantization (PTQ) methods for reasoning-oriented large language models (LLMs) on Ascend NPU, highlighting the challenges and performance implications of various quantization techniques.
Why It Matters
As AI models grow in complexity, efficient deployment becomes critical. This study sheds light on the specific challenges faced when applying PTQ to Ascend NPU, providing valuable insights for researchers and practitioners looking to optimize LLM performance in real-world applications.
Key Takeaways
- Post-Training Quantization (PTQ) is essential for efficient model deployment.
- 4-bit weight-only quantization is viable for larger models but can lead to instability in long-context reasoning tasks.
- Standard 8-bit quantization remains more stable compared to aggressive 4-bit schemes.
- Dynamic quantization overheads currently limit end-to-end acceleration despite optimized kernels.
- The findings provide a practical reference for deploying quantized reasoning models on Ascend NPU.
Computer Science > Machine Learning arXiv:2602.17693 (cs) [Submitted on 6 Feb 2026] Title:A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU Authors:Yuchen Luo, Fangyue Zhu, Ruining Zhou, Mingzhe Huang, Jian Zhu, Fanyu Fan, Wei Shao View a PDF of the paper titled A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU, by Yuchen Luo and 6 other authors View PDF HTML (experimental) Abstract:Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer ...