[2510.27246] Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Summary
This paper presents a new framework for evaluating and enhancing long-term memory in large language models (LLMs), introducing the BEAM benchmark and the LIGHT memory system.
Why It Matters
As LLMs become increasingly integral to conversational AI, understanding their long-term memory capabilities is crucial. This research addresses existing limitations in benchmarks and proposes innovative solutions that could significantly improve LLM performance in complex dialogue scenarios.
Key Takeaways
- Introduces BEAM, a benchmark for assessing LLMs' long-term memory.
- Proposes LIGHT, a memory framework enhancing LLMs with multiple memory systems.
- Demonstrates that existing LLMs struggle with longer dialogues without enhancements.
- Reports performance improvements of 3.5%-12.69% using the LIGHT framework.
- Highlights the need for coherent and diverse conversation datasets in LLM training.
Computer Science > Computation and Language arXiv:2510.27246 (cs) [Submitted on 31 Oct 2025 (v1), last revised 21 Feb 2026 (this version, v2)] Title:Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs Authors:Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, J Ross Mitchell View a PDF of the paper titled Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs, by Mohammad Tavakoli and 5 other authors View PDF HTML (experimental) Abstract:Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT-a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on B...