[2602.03295] POP: Prefill-Only Pruning for Efficient Large Model Inference
About this article
Abstract page for arXiv paper 2602.03295: POP: Prefill-Only Pruning for Efficient Large Model Inference
Computer Science > Computation and Language arXiv:2602.03295 (cs) [Submitted on 3 Feb 2026 (v1), last revised 16 Apr 2026 (this version, v2)] Title:POP: Prefill-Only Pruning for Efficient Large Model Inference Authors:Junhui He, Zhihui Fu, Jun Wang, Qingan Li View a PDF of the paper titled POP: Prefill-Only Pruning for Efficient Large Model Inference, by Junhui He and 3 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of t...