[2602.09789] When Less is More: The LLM Scaling Paradox in Context Compression
Summary
The paper explores the paradox of scaling large language models (LLMs) in context compression, revealing that larger models may reduce the fidelity of reconstructed contexts despite lower training loss.
Why It Matters
Understanding the limitations of scaling LLMs is crucial for researchers and developers in machine learning and NLP. This study challenges the conventional belief that larger models always yield better performance, highlighting the need for careful consideration of model size and context fidelity in AI applications.
Key Takeaways
- Larger LLMs may compromise the accuracy of context reconstruction.
- The Size-Fidelity Paradox arises from knowledge overwriting and semantic drift.
- Increased model size leads to higher generative uncertainty and prior knowledge intrusion.
- The study emphasizes the importance of context fidelity over mere parameter count.
- Scaling laws for faithful preservation in open-ended generation may not hold.
Computer Science > Machine Learning arXiv:2602.09789 (cs) [Submitted on 10 Feb 2026 (v1), last revised 26 Feb 2026 (this version, v2)] Title:When Less is More: The LLM Scaling Paradox in Context Compression Authors:Ruishan Guo, Yibing Liu, Guoxin Ma, Yan Wang, Yueyang Zhang, Long Xia, Kecheng Chen, Zhiyuan Sun, Daiting Shi View a PDF of the paper titled When Less is More: The LLM Scaling Paradox in Context Compression, by Ruishan Guo and 8 other authors View PDF HTML (experimental) Abstract:Scaling up model parameters has long been a prevalent training paradigm driven by the assumption that larger models yield superior generation capabilities. However, under lossy context compression in a compressor-decoder setup, we observe a Size-Fidelity Paradox: increasing the compressor size can lessen the faithfulness of reconstructed contexts though training loss decreases. Through extensive experiments across models from 0.6B to 90B, we coin this paradox arising from two dominant factors: 1) knowledge overwriting: larger models increasingly replace source facts with their own prior beliefs, e.g., ``the white strawberry'' $\to$ ``the red strawberry''; and 2) semantic drift: larger models tend to paraphrase or restructure content instead of reproducing it verbatim, e.g., ``Alice hit Bob'' $\to$ ``Bob hit Alice''. By holding model size fixed, we reflect on the emergent properties of compressed context representations. We show that the culprit is not parameter count itself, but the exc...