[2602.23197] Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Summary
This paper explores the impact of fine-tuning on in-context learning in linear attention models, revealing conditions that can enhance or degrade performance on downstream tasks.
Why It Matters
Understanding how fine-tuning affects in-context learning is crucial for optimizing large language models. This research provides theoretical insights that can guide practitioners in maintaining model performance across various tasks, which is essential in the rapidly evolving field of AI.
Key Takeaways
- Fine-tuning can degrade in-context learning performance if not managed properly.
- Restricting updates to the value matrix during fine-tuning can preserve in-context learning.
- Incorporating auxiliary few-shot loss can enhance performance on target tasks but may harm generalization.
Computer Science > Computation and Language arXiv:2602.23197 (cs) [Submitted on 26 Feb 2026] Title:Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models Authors:Chungpa Lee, Jy-yong Sohn, Kangwook Lee View a PDF of the paper titled Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models, by Chungpa Lee and 2 other authors View PDF HTML (experimental) Abstract:Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded...