[2602.20904] Transcoder Adapters for Reasoning-Model Diffing

[2602.20904] Transcoder Adapters for Reasoning-Model Diffing

arXiv - Machine Learning 4 min read Article

Summary

This paper introduces transcoder adapters, a method for analyzing the internal changes in reasoning models post fine-tuning, demonstrating their effectiveness in capturing reasoning behaviors.

Why It Matters

Understanding the internal mechanisms of reasoning models is crucial for improving AI performance. This research provides insights into how fine-tuning affects model behavior, which can guide future developments in machine learning and AI applications.

Key Takeaways

  • Transcoder adapters help interpret changes in reasoning models after fine-tuning.
  • The study reveals that only a small percentage of adapter features are related to reasoning behaviors.
  • Hesitation tokens in responses can be traced to specific adapter features, highlighting their role in model outputs.
  • Adapters can recover a significant portion of accuracy gains from reasoning fine-tuning.
  • The findings suggest broader applications for transcoder adapters in studying model fine-tuning.

Computer Science > Machine Learning arXiv:2602.20904 (cs) [Submitted on 24 Feb 2026] Title:Transcoder Adapters for Reasoning-Model Diffing Authors:Nathan Hu, Jake Ward, Thomas Icard, Christopher Potts View a PDF of the paper titled Transcoder Adapters for Reasoning-Model Diffing, by Nathan Hu and 3 other authors View PDF HTML (experimental) Abstract:While reasoning models are increasingly ubiquitous, the effects of reasoning training on a model's internal mechanisms remain poorly understood. In this work, we introduce transcoder adapters, a technique for learning an interpretable approximation of the difference in MLP computation before and after fine-tuning. We apply transcoder adapters to characterize the differences between Qwen2.5-Math-7B and its reasoning-distilled variant, DeepSeek-R1-Distill-Qwen-7B. Learned adapters are faithful to the target model's internal computation and next-token predictions. When evaluated on reasoning benchmarks, adapters match the reasoning model's response lengths and typically recover 50-90% of the accuracy gains from reasoning fine-tuning. Adapter features are sparsely activating and interpretable. When examining adapter features, we find that only ~8% have activating examples directly related to reasoning behaviors. We deeply study one such behavior -- the production of hesitation tokens (e.g., "wait"). Using attribution graphs, we trace hesitation to only ~2.4% of adapter features (5.6k total) performing one of two functions. These fe...

Related Articles

Llms

[P] Remote sensing foundation models made easy to use.

This project enables the idea of tasking remote sensing models to acquire embeddings like we task satellites to acquire data! https://git...

Reddit - Machine Learning · 1 min ·
Machine Learning

Can AI truly be creative?

AI has no imagination. “Creativity is the ability to generate novel and valuable ideas or works through the exercise of imagination” http...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

AI video generation seems fundamentally more expensive than text, not just less optimized

There’s been a lot of discussion recently about how expensive AI video generation is compared to text, and it feels like this is more tha...

Reddit - Artificial Intelligence · 1 min ·
Machine Learning

[D] When to transition from simple heuristics to ML models (e.g., DensityFunction)?

Two questions: What are the recommendations around when to transition from a simple heuristic baseline to machine learning ML models for ...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime