[2512.03383] UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs

[2512.03383] UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs

arXiv - Machine Learning 4 min read Article

Summary

The paper presents UniQL, a unified framework for quantization and low-rank compression of large language models (LLMs) tailored for edge devices, enhancing efficiency and performance.

Why It Matters

As mobile platforms increasingly deploy large language models, optimizing their performance while managing resource constraints is crucial. UniQL addresses these challenges by integrating advanced compression techniques, making it relevant for developers and researchers focused on edge AI applications.

Key Takeaways

  • UniQL integrates post-training quantization and low-rank compression for edge LLMs.
  • The framework allows on-device configurable pruning rates, enhancing adaptability.
  • Experiments show a memory reduction of 4x-5.7x with minimal accuracy loss.
  • Efficient weight-sorting methods improve computation speed by 20x.
  • The framework supports various model types, including Transformers and State Space Models.

Computer Science > Machine Learning arXiv:2512.03383 (cs) [Submitted on 3 Dec 2025 (v1), last revised 26 Feb 2026 (this version, v3)] Title:UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs Authors:Hung-Yueh Chiang, Chi-Chih Chang, Yu-Chen Lu, Chien-Yu Lin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu View a PDF of the paper titled UniQL: Unified Quantization and Low-rank Compression for Adaptive Edge LLMs, by Hung-Yueh Chiang and 6 other authors View PDF HTML (experimental) Abstract:Deploying large language models (LLMs) on mobile platforms faces significant challenges due to the limited memory and shared computational resources of the device. Resource availability may be an issue as it is directly impacted by the current device workload, adding to the uncertainty of model deployment. We introduce UniQL, a unified post-training quantization and low-rank compression framework with on-device configurable pruning rates for edge LLMs. UniQL is a general framework that integrates quantization and low-rank compression for Transformers, State Space Models (SSMs), and hybrid models to support diverse edge applications. In our proposed joint framework, we introduce an efficient structured weight-sorting method that speeds up computation by 20x, quantization-aware singular value decomposition (SVD) to minimize quantization errors, state-aware weight sorting for SSMs, and a fused rotary positional embedding (RoPE) kernel for pruned models. Our fr...

Related Articles

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED
Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min ·
Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
Llms

ChatGPT on trial: A landmark test of AI liability in the practice of law

AI Tools & Products ·
Llms

What if Claude purposefully made its own code leakable so that it would get leaked

What if Claude leaked itself by socially and architecturally engineering itself to be leaked by a dumb human submitted by /u/smurfcsgoawp...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime