[2504.02922] Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning
Summary
This article discusses advancements in model diffing using crosscoders to better interpret changes in AI models during chat-tuning, addressing issues of sparsity artifacts.
Why It Matters
Understanding how fine-tuning affects AI models is crucial for improving their interpretability and effectiveness. This research provides a methodology that enhances the analysis of model behaviors, which is essential for developers and researchers in AI and machine learning fields.
Key Takeaways
- Model diffing helps interpret changes in AI models during fine-tuning.
- Sparsity artifacts can misattribute concepts unique to fine-tuned models.
- Latent Scaling improves the measurement of latent presence across models.
- BatchTopK loss mitigates issues in crosscoder performance.
- The study identifies chat-specific latents that enhance model interpretability.
Computer Science > Machine Learning arXiv:2504.02922 (cs) [Submitted on 3 Apr 2025 (v1), last revised 20 Feb 2026 (this version, v4)] Title:Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning Authors:Julian Minder, Clément Dumas, Caden Juang, Bilal Chugtai, Neel Nanda View a PDF of the paper titled Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning, by Julian Minder and 4 other authors View PDF Abstract:Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard...