[2512.03383] UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs
Summary
The paper presents UniQL, a unified framework for quantization and low-rank compression of large language models (LLMs) tailored for edge devices, enhancing efficiency and performance.
Why It Matters
As mobile platforms increasingly deploy large language models, optimizing their performance while managing resource constraints is crucial. UniQL addresses these challenges by integrating advanced compression techniques, making it relevant for developers and researchers focused on edge AI applications.
Key Takeaways
- UniQL integrates post-training quantization and low-rank compression for edge LLMs.
- The framework allows on-device configurable pruning rates, enhancing adaptability.
- Experiments show a memory reduction of 4x-5.7x with minimal accuracy loss.
- Efficient weight-sorting methods improve computation speed by 20x.
- The framework supports various model types, including Transformers and State Space Models.
Computer Science > Machine Learning arXiv:2512.03383 (cs) [Submitted on 3 Dec 2025 (v1), last revised 26 Feb 2026 (this version, v3)] Title:UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs Authors:Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu View a PDF of the paper titled UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs, by Hung-Yueh Chiang and 6 other authors View PDF HTML (experimental) Abstract:Deploying large language models (LLMs) on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our fr...