[2602.21204] Test-Time Training with KV Binding Is Secretly Linear Attention
Summary
This paper explores the concept of Test-Time Training (TTT) with KV binding, revealing that it functions as learned linear attention rather than mere memorization, offering architectural simplifications and efficiency improvements.
Why It Matters
Understanding TTT as learned linear attention enhances the efficiency and performance of machine learning models. This perspective shifts the focus from memorization to a more robust representation, which is crucial for advancing AI applications in various fields.
Key Takeaways
- TTT with KV binding is reinterpreted as learned linear attention.
- This approach simplifies model architecture and improves efficiency.
- The findings challenge existing interpretations of TTT as mere memorization.
- A systematic reduction of TTT variants to linear attention is proposed.
- Enhanced representational capacity leads to better model performance.
Computer Science > Machine Learning arXiv:2602.21204 (cs) [Submitted on 24 Feb 2026] Title:Test-Time Training with KV Binding Is Secretly Linear Attention Authors:Junchen Liu, Sven Elflein, Or Litany, Zan Gojcic, Ruilong Li View a PDF of the paper titled Test-Time Training with KV Binding Is Secretly Linear Attention, by Junchen Liu and 3 other authors View PDF HTML (experimental) Abstract:Test-time training (TTT) with KV binding as sequence modeling layer is commonly interpreted as a form of online meta-learning that memorizes a key-value mapping at test time. However, our analysis reveals multiple phenomena that contradict this memorization-based interpretation. Motivated by these findings, we revisit the formulation of TTT and show that a broad class of TTT architectures can be expressed as a form of learned linear attention operator. Beyond explaining previously puzzling model behaviors, this perspective yields multiple practical benefits: it enables principled architectural simplifications, admits fully parallel formulations that preserve performance while improving efficiency, and provides a systematic reduction of diverse TTT variants to a standard linear attention form. Overall, our results reframe TTT not as test-time memorization, but as learned linear attention with enhanced representational capacity. Comments: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2602.21204 [cs.LG] ...