[2602.20904] Transcoder Adapters for Reasoning-Model Diffing
Summary
This paper introduces transcoder adapters, a method for analyzing the internal changes in reasoning models post fine-tuning, demonstrating their effectiveness in capturing reasoning behaviors.
Why It Matters
Understanding the internal mechanisms of reasoning models is crucial for improving AI performance. This research provides insights into how fine-tuning affects model behavior, which can guide future developments in machine learning and AI applications.
Key Takeaways
- Transcoder adapters help interpret changes in reasoning models after fine-tuning.
- The study reveals that only a small percentage of adapter features are related to reasoning behaviors.
- Hesitation tokens in responses can be traced to specific adapter features, highlighting their role in model outputs.
- Adapters can recover a significant portion of accuracy gains from reasoning fine-tuning.
- The findings suggest broader applications for transcoder adapters in studying model fine-tuning.
Computer Science > Machine Learning arXiv:2602.20904 (cs) [Submitted on 24 Feb 2026] Title:Transcoder Adapters for Reasoning-Model Diffing Authors:Nathan Hu, Jake Ward, Thomas Icard, Christopher Potts View a PDF of the paper titled Transcoder Adapters for Reasoning-Model Diffing, by Nathan Hu and 3 other authors View PDF HTML (experimental) Abstract:While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model's internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model's internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model's response lengths and typically recover 50-90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related to reasoning behaviors. We deeply study one such behavior -- the production of hesitation tokens (e.g., "wait"). Using attribution graphs, we trace hesitation to only ~2.4% of adapter features (5.6k total) performing one of two functions. These fe...