[2602.03295] POP: Prefill-Only Pruning for Efficient Large Model Inference

[2602.03295] POP: Prefill-Only Pruning for Efficient Large Model Inference

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2602.03295: POP: Prefill-Only Pruning for Efficient Large Model Inference

Computer Science > Computation and Language arXiv:2602.03295 (cs) [Submitted on 3 Feb 2026 (v1), last revised 16 Apr 2026 (this version, v2)] Title:POP: Prefill-Only Pruning for Efficient Large Model Inference Authors:Junhui He, Zhihui Fu, Jun Wang, Qingan Li View a PDF of the paper titled POP: Prefill-Only Pruning for Efficient Large Model Inference, by Junhui He and 3 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of t...

Originally published on April 17, 2026. Curated by AI News.

Related Articles

[2603.13683] Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation
Llms

[2603.13683] Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation

Abstract page for arXiv paper 2603.13683: Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation

arXiv - AI · 3 min ·
[2601.15488] Multi-Persona Thinking for Bias Mitigation in Large Language Models
Llms

[2601.15488] Multi-Persona Thinking for Bias Mitigation in Large Language Models

Abstract page for arXiv paper 2601.15488: Multi-Persona Thinking for Bias Mitigation in Large Language Models

arXiv - AI · 3 min ·
[2601.14724] HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
Llms

[2601.14724] HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Abstract page for arXiv paper 2601.14724: HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

arXiv - AI · 4 min ·
[2601.10120] TopoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems
Llms

[2601.10120] TopoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems

Abstract page for arXiv paper 2601.10120: TopoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems

arXiv - AI · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime