[2602.14452] WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity
Summary
The paper presents WiSparse, a novel method for enhancing the efficiency of large language model (LLM) inference by utilizing weight-aware mixed activation sparsity, achieving significant speedups without training.
Why It Matters
As LLMs become integral to various applications, their high inference costs pose challenges. WiSparse addresses these inefficiencies by optimizing activation sparsity, which can lead to faster and more cost-effective AI solutions, making it relevant for developers and researchers in AI.
Key Takeaways
- WiSparse improves LLM inference efficiency by integrating weight and activation information.
- The method achieves a 21.4% acceleration in end-to-end inference speed at 50% sparsity.
- WiSparse maintains 97% performance of Llama3.1, surpassing existing baselines.
- The approach uses a mixed-granularity allocation scheme for optimal resource distribution.
- This research pushes the boundaries of training-free methods for LLMs.
Computer Science > Machine Learning arXiv:2602.14452 (cs) [Submitted on 16 Feb 2026] Title:WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity Authors:Lei Chen, Yuan Meng, Xiaoyu Zhan, Zhi Wang, Wenwu Zhu View a PDF of the paper titled WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity, by Lei Chen and 4 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) offer strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, yet existing methods often rely solely on activation information and uniform sparsity ratios. This overlooks the critical interplay with weights and inter-block sensitivity variation, leading to suboptimal performance. We identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse), which leverages both activation and weight information for adaptive sparsity allocation. Specifically, we introduce a weight-aware mechanism integrating activation magnitudes with precomputed weight norms to accurately identify salient channels. This is combined with a mixed-granularity allocation scheme: a global budget is distributed across bloc...