[2602.22268] AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning
Summary
The paper presents AutoQRA, a framework that optimizes mixed-precision quantization and low-rank adapters for efficient fine-tuning of large language models, achieving performance close to full precision with reduced memory usage.
Why It Matters
As machine learning models grow in size, optimizing their performance while managing memory constraints becomes critical. AutoQRA addresses this by enhancing the fine-tuning process, making it more efficient and accessible for practitioners working with large language models.
Key Takeaways
- AutoQRA optimizes both quantization bit-width and LoRA rank simultaneously.
- The framework employs a two-stage optimization process to efficiently navigate the search space.
- Experiments demonstrate that AutoQRA can achieve near full-precision performance with a significantly reduced memory footprint.
Computer Science > Machine Learning arXiv:2602.22268 (cs) [Submitted on 25 Feb 2026] Title:AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning Authors:Changhai Zhou, Shiyang Zhang, Yuhua Zhou, Qian Qiao, Jun Gao, Cheng Jin, Kaizhou Qin, Weizhong Zhang View a PDF of the paper titled AutoQRA: Joint Optimization of Mixed-Precision Quantization and Low-rank Adapters for Efficient LLM Fine-Tuning, by Changhai Zhou and 7 other authors View PDF HTML (experimental) Abstract:Quantization followed by parameter-efficient fine-tuning has emerged as a promising paradigm for downstream adaptation under tight GPU memory constraints. However, this sequential pipeline fails to leverage the intricate interaction between quantization bit-width and LoRA rank. Specifically, a carefully optimized quantization allocation with low quantization error does not always translate to strong fine-tuning performance, and different bit-width and rank configurations can lead to significantly varying outcomes under the same memory budget. To address this limitation, we propose AutoQRA, a joint optimization framework that simultaneously optimizes the bit-width and LoRA rank configuration for each layer during the mixed quantized fine-tuning process. To tackle the challenges posed by the large discrete search space and the high evaluation cost associated with frequent fine-tuning iterations, AutoQRA decomposes the optimization process into two stages...