[2509.22935] Compute-Optimal Quantization-Aware Training

[2509.22935] Compute-Optimal Quantization-Aware Training

arXiv - Machine Learning 4 min read Article

Summary

This paper explores Compute-Optimal Quantization-Aware Training (QAT), revealing how optimal compute allocation between full-precision and QAT phases enhances model performance across various configurations.

Why It Matters

As neural networks become increasingly complex, optimizing training methods like QAT is crucial for improving efficiency and accuracy. This research provides actionable insights for practitioners aiming to enhance model performance while managing compute resources effectively.

Key Takeaways

  • Optimal compute allocation between full-precision and QAT phases improves model accuracy.
  • The loss-optimal ratio of QAT to FP training increases with total compute.
  • A derived loss scaling law predicts optimal QAT ratios and model performance.
  • A novel cooldown and QAT fusion approach reduces redundant updates, saving compute.
  • Findings enable training of higher-quality quantized models within the same compute budget.

Computer Science > Machine Learning arXiv:2509.22935 (cs) [Submitted on 26 Sep 2025 (v1), last revised 26 Feb 2026 (this version, v2)] Title:Compute-Optimal Quantization-Aware Training Authors:Aleksandr Dremov, David Grangier, Angelos Katharopoulos, Awni Hannun View a PDF of the paper titled Compute-Optimal Quantization-Aware Training, by Aleksandr Dremov and 3 other authors View PDF HTML (experimental) Abstract:Quantization-aware training (QAT) is a leading technique for improving the accuracy of quantized neural networks. Previous work has shown that decomposing training into a full-precision (FP) phase followed by a QAT phase yields superior accuracy compared to QAT alone. However, the optimal allocation of compute between the FP and QAT phases remains unclear. We conduct extensive experiments with various compute budgets, QAT bit widths, and model sizes from 86.0M to 2.2B to investigate how different QAT durations impact final performance. We demonstrate that, contrary to previous findings, the loss-optimal ratio of QAT to FP training increases with the total amount of compute. Moreover, the optimal fraction can be accurately predicted for a wide range of model sizes and quantization widths using the tokens-per-parameter-byte statistic. From experimental data, we derive a loss scaling law that predicts both optimal QAT ratios and final model performance across different QAT/FP compute allocation strategies and QAT bit widths. We use the scaling law to make further pred...

Related Articles

Hub Group Using AI, Machine Learning for Real-Time Visibility of Shipments
Machine Learning

Hub Group Using AI, Machine Learning for Real-Time Visibility of Shipments

AI Events · 4 min ·
Llms

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Greetings all - I've posted mostly in r/claudecode and r/aigamedev a couple of times previously. Working with CC for personal projects re...

Reddit - Artificial Intelligence · 1 min ·
Llms

World models will be the next big thing, bye-bye LLMs

Was at Nvidia's GTC conference recently and honestly, it was one of the most eye-opening events I've attended in a while. There was a lot...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

[D] Got my first offer after months of searching — below posted range, contract-to-hire, and worried it may pause my search. Do I take it?

I could really use some outside perspective. I’m a senior ML/CV engineer in Canada with about 5–6 years across research and industry. Mas...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime