[2602.20309] QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Summary
QuantVLA introduces a novel post-training quantization framework for Vision-Language-Action models, enhancing efficiency without additional training.
Why It Matters
As AI models grow in complexity, their deployment becomes increasingly resource-intensive. QuantVLA addresses these challenges by enabling efficient model quantization, which is crucial for practical applications in embodied intelligence, especially under stringent compute and memory constraints.
Key Takeaways
- QuantVLA is the first post-training quantization method for Vision-Language-Action models.
- It achieves significant memory savings (about 70%) and improves inference speed (1.22x) without requiring additional training.
- The framework utilizes a small unlabeled calibration buffer for efficient quantization.
Computer Science > Machine Learning arXiv:2602.20309 (cs) [Submitted on 23 Feb 2026] Title:QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models Authors:Jingxuan Zhang, Yunta Hsieh, Zhongwei Wang, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang View a PDF of the paper titled QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models, by Jingxuan Zhang and 6 other authors View PDF HTML (experimental) Abstract:Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer r...