[2602.00924] Supervised sparse auto-encoders for interpretable and compositional representations
About this article
Abstract page for arXiv paper 2602.00924: Supervised sparse auto-encoders for interpretable and compositional representations
Computer Science > Artificial Intelligence arXiv:2602.00924 (cs) [Submitted on 31 Jan 2026 (v1), last revised 8 May 2026 (this version, v2)] Title:Supervised sparse auto-encoders for interpretable and compositional representations Authors:Ouns El Harzli, Hugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao View a PDF of the paper titled Supervised sparse auto-encoders for interpretable and compositional representations, by Ouns El Harzli and 3 other authors View PDF Abstract:Sparse auto-encoders (SAEs) have re-emerged as a prominent method for mechanistic interpretability, yet they face two significant challenges: the non-smoothness of the $L_1$ penalty, which hinders reconstruction and scalability, and a lack of alignment between learned features and human semantics. In this paper, we address these limitations by adapting unconstrained feature models-a mathematical framework from neural collapse theory-and by supervising the task. We supervise (decoder-only) SAEs to reconstruct feature vectors by jointly learning sparse concept embeddings and decoder weights. Validated on Stable Diffusion 3.5, our approach demonstrates compositional generalization, successfully reconstructing images with concept combinations unseen during training, and enabling feature-level intervention for semantic image editing without prompt modification. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2602.00924 [cs.AI] (or arXiv:2602.00924v2 [cs.AI] for this version) https://doi.org/10.48550/a...