[2601.00671] Fast-weight Product Key Memory
Summary
The paper introduces Fast-weight Product Key Memory (FwPKM), a novel memory layer designed to enhance sequence modeling in language models by balancing storage capacity and computational efficiency.
Why It Matters
This research addresses a critical challenge in language modeling—optimizing memory usage while maintaining performance. FwPKM's ability to generalize to longer contexts can significantly improve applications in natural language processing and AI, making it relevant for developers and researchers in the field.
Key Takeaways
- FwPKM offers a sparse memory layer that updates parameters efficiently during training and inference.
- The method allows for rapid memorization and retrieval of key-value associations with low computational costs.
- Experiments demonstrate significant reductions in perplexity for long-context datasets, enhancing model performance.
- FwPKM can generalize to contexts up to 128K tokens, despite being trained on shorter sequences.
- This approach complements existing memory modules, potentially leading to advancements in episodic memory applications.
Computer Science > Computation and Language arXiv:2601.00671 (cs) [Submitted on 2 Jan 2026 (v1), last revised 22 Feb 2026 (this version, v2)] Title:Fast-weight Product Key Memory Authors:Tianyu Zhao, Llion Jones View a PDF of the paper titled Fast-weight Product Key Memory, by Tianyu Zhao and Llion Jones View PDF HTML (experimental) Abstract:Sequence modeling layers in modern language models typically face a trade-off between storage capacity and computational efficiency. While softmax attention offers unbounded storage at prohibitive quadratic cost, linear variants are more efficient but suffer from limited, fixed-size storage. We introduce Fast-weight Product Key Memory (FwPKM), a sparse fast-weight memory layer that resolves this tension. FwPKM updates sparsely activated parameters at both training and inference time using chunk-level gradient descent on a local memory-rewrite objective. This performs Test-Time Training (TTT)-style gradient updates on activated slots in a sparse memory, enabling rapid memorization and retrieval of many new key-value associations while keeping per-token compute low and fixed. Experiments show that FwPKM functions as an effective episodic memory that complements the semantic memory of standard modules, yielding significant perplexity reductions on long-context datasets. Notably, in Needle-in-a-Haystack evaluations, FwPKM generalizes to 128K-token contexts despite being trained on only 4K-token sequences. Subjects: Computation and Language...