[2508.01916] Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning
Summary
This paper explores the decomposition of representation spaces in neural networks into interpretable subspaces using an unsupervised learning approach, revealing insights into model interpretability and scalability.
Why It Matters
Understanding the internal representations of neural models is crucial for advancing mechanistic interpretability in AI. This research introduces a novel method for identifying interpretable subspaces, enhancing our ability to analyze and improve neural network architectures, which is vital for responsible AI development.
Key Takeaways
- The study presents a method called neighbor distance minimization (NDM) for unsupervised learning of interpretable subspaces.
- Qualitative analysis indicates that the identified subspaces often correspond to abstract concepts across different inputs.
- Quantitative experiments demonstrate a strong correlation between learned subspaces and known circuit variables in models like GPT-2.
- The findings suggest scalability to larger models, potentially improving understanding of complex AI systems.
- This research contributes to the broader goal of enhancing interpretability in machine learning models.
Computer Science > Machine Learning arXiv:2508.01916 (cs) [Submitted on 3 Aug 2025 (v1), last revised 20 Feb 2026 (this version, v2)] Title:Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning Authors:Xinting Huang, Michael Hahn View a PDF of the paper titled Decomposing Representation Space into Interpretable Subspaces with Unsupervised Learning, by Xinting Huang and Michael Hahn View PDF HTML (experimental) Abstract:Understanding internal representations of neural models is a core interest of mechanistic interpretability. Due to its large dimensionality, the representation space can encode various aspects about inputs. To what extent are different aspects organized and encoded in separate subspaces? Is it possible to find these ``natural'' subspaces in a purely unsupervised way? Somewhat surprisingly, we can indeed achieve this and find interpretable subspaces by a seemingly unrelated training objective. Our method, neighbor distance minimization (NDM), learns non-basis-aligned subspaces in an unsupervised manner. Qualitative analysis shows subspaces are interpretable in many cases, and encoded information in obtained subspaces tends to share the same abstract concept across different inputs, making such subspaces similar to ``variables'' used by the model. We also conduct quantitative experiments using known circuits in GPT-2; results show a strong connection between subspaces and circuit variables. We also provide evidence showing sc...