[2602.03295] POP: Prefill-Only Pruning for Efficient Large Model

[2602.03295] POP: Prefill-Only Pruning for Efficient Large Model Inference

arXiv - AI April 17, 2026 4 min read

About this article

Abstract page for arXiv paper 2602.03295: POP: Prefill-Only Pruning for Efficient Large Model Inference

Computer Science > Computation and Language arXiv:2602.03295 (cs) [Submitted on 3 Feb 2026 (v1), last revised 16 Apr 2026 (this version, v2)] Title:POP: Prefill-Only Pruning for Efficient Large Model Inference Authors:Junhui He, Zhihui Fu, Jun Wang, Qingan Li View a PDF of the paper titled POP: Prefill-Only Pruning for Efficient Large Model Inference, by Junhui He and 3 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of t...

Originally published on April 17, 2026. Curated by AI News.

Llms

[2603.13683] Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation

Abstract page for arXiv paper 2603.13683: Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation

arXiv - AI · 3 min · about 5 hours ago

Llms

[2601.15488] Multi-Persona Thinking for Bias Mitigation in Large Language Models

Abstract page for arXiv paper 2601.15488: Multi-Persona Thinking for Bias Mitigation in Large Language Models

arXiv - AI · 3 min · about 5 hours ago

Llms

[2601.14724] HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

Abstract page for arXiv paper 2601.14724: HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

arXiv - AI · 4 min · about 5 hours ago

Llms

[2601.10120] TopoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems

Abstract page for arXiv paper 2601.10120: TopoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems

arXiv - AI · 3 min · about 5 hours ago

[2602.03295] POP: Prefill-Only Pruning for Efficient Large Model Inference

About this article

Related Articles

[2603.13683] Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation

[2601.15488] Multi-Persona Thinking for Bias Mitigation in Large Language Models

[2601.14724] HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding

[2601.10120] TopoDIM: One-shot Topology Generation of Diverse Interaction Modes for Multi-Agent Systems

No comments

Stay updated with AI News