[2510.02375] Pretraining with hierarchical memories: separating long-tail and common knowledge
About this article
Abstract page for arXiv paper 2510.02375: Pretraining with hierarchical memories: separating long-tail and common knowledge
Computer Science > Computation and Language arXiv:2510.02375 (cs) [Submitted on 29 Sep 2025 (v1), last revised 23 Mar 2026 (this version, v3)] Title:Pretraining with hierarchical memories: separating long-tail and common knowledge Authors:Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel View a PDF of the paper titled Pretraining with hierarchical memories: separating long-tail and common knowledge, by Hadi Pouransari and 4 other authors View PDF Abstract:The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a fraction is used per prompt, and impractical for edge devices with limited inference-time memory and compute. We address this shortcoming by a memory-augmented architecture and a pretraining strategy aligned with existing hardware paradigms. We introduce small language models that access large hierarchical parametric memory banks encoding world knowledge. During pretraining and inference, we fetch a small, context-dependent memory block and add it to the model. Our pretraining learns to store long-tail world knowledge in the memory parameters, while the small language model acts as an anchor capturing common knowledge and general reasoning abilities. Through trillion-token-scale experiments, we show significant gains: a 160M-parameters model augmented with an 18M-param...