[2602.13980] Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking
Summary
This article presents a novel method called Parallelized Iterative Compression (PIC) for enhancing soft prompt compression in Large Language Models (LLMs), significantly improving training efficiency and performance in various tasks.
Why It Matters
As LLMs become integral to AI applications, optimizing their performance and reducing latency is crucial. This research addresses the challenge of context compression, offering a solution that enhances model efficiency and effectiveness, which is vital for real-world applications.
Key Takeaways
- PIC improves soft prompt compression by focusing on local chunks, enhancing training efficiency.
- The method reduces training time by approximately 40% while achieving better performance metrics.
- Significant improvements in F1 and EM scores demonstrate PIC's effectiveness in high compression scenarios.
Computer Science > Artificial Intelligence arXiv:2602.13980 (cs) [Submitted on 15 Feb 2026] Title:Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking Authors:Guojie Liu, Yiqi Wang, Yanfeng Yang, Wenqi Fan, Songlei Jian, Jianfeng Zhang, Jie Yu View a PDF of the paper titled Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking, by Guojie Liu and 6 other authors View PDF HTML (experimental) Abstract:Providing extensive context via prompting is vital for leveraging the capabilities of Large Language Models (LLMs). However, lengthy contexts significantly increase inference latency, as the computational cost of self-attention grows quadratically with sequence length. To mitigate this issue, context compression-particularly soft prompt compressio-has emerged as a widely studied solution, which converts long contexts into shorter memory embeddings via a trained compressor. Existing methods typically compress the entire context indiscriminately into a set of memory tokens, requiring the compressor to capture global dependencies and necessitating extensive pre-training data to learn effective patterns. Inspired by the chunking mechanism in human working memory and empirical observations of the spatial specialization of memory embeddings relative to original tokens, we propose Parallelized Iterative Compression (PIC). By simply modifying the Transformer's attention mask, PIC explicitly rest...