[2602.17693] A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

[2602.17693] A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

arXiv - Machine Learning 3 min read Article

Summary

This article presents a case study on the effectiveness of Post-Training Quantization (PTQ) methods for reasoning-oriented large language models (LLMs) on Ascend NPU, highlighting the challenges and performance implications of various quantization techniques.

Why It Matters

As AI models grow in complexity, efficient deployment becomes critical. This study sheds light on the specific challenges faced when applying PTQ to Ascend NPU, providing valuable insights for researchers and practitioners looking to optimize LLM performance in real-world applications.

Key Takeaways

  • Post-Training Quantization (PTQ) is essential for efficient model deployment.
  • 4-bit weight-only quantization is viable for larger models but can lead to instability in long-context reasoning tasks.
  • Standard 8-bit quantization remains more stable compared to aggressive 4-bit schemes.
  • Dynamic quantization overheads currently limit end-to-end acceleration despite optimized kernels.
  • The findings provide a practical reference for deploying quantized reasoning models on Ascend NPU.

Computer Science > Machine Learning arXiv:2602.17693 (cs) [Submitted on 6 Feb 2026] Title:A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU Authors:Yuchen Luo, Fangyue Zhu, Ruining Zhou, Mingzhe Huang, Jian Zhu, Fanyu Fan, Wei Shao View a PDF of the paper titled A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU, by Yuchen Luo and 6 other authors View PDF HTML (experimental) Abstract:Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer ...

Related Articles

Llms

I stopped using Claude like a chatbot — 7 prompt shifts that reclaimed 10 hours of my week

submitted by /u/ThereWas [link] [comments]

Reddit - Artificial Intelligence · 1 min ·
Llms

What features do you actually want in an AI chatbot that nobody has built yet?

Hey everyone 👋 I'm building a new AI chat app and before I build anything I want to hear from real users first. Current AI tools like Cha...

Reddit - Artificial Intelligence · 1 min ·
Llms

So, what exactly is going on with the Claude usage limits?

I'm extremely new to AI and am building a local agent for fun. I purchased a Claude Pro account because it helped me a lot in the past wh...

Reddit - Artificial Intelligence · 1 min ·
Llms

Why the Reddit Hate of AI?

I just went through a project where a builder wanted to build a really large building on a small lot next door. The project needed 6 vari...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime