[2510.13900] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
About this article
Abstract page for arXiv paper 2510.13900: Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Computer Science > Computation and Language arXiv:2510.13900 (cs) [Submitted on 14 Oct 2025 (v1), last revised 2 Mar 2026 (this version, v2)] Title:Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences Authors:Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, Neel Nanda View a PDF of the paper titled Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences, by Julian Minder and 6 other authors View PDF HTML (experimental) Abstract:Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis sp...