[2604.02685] Finding Belief Geometries with Sparse Autoencoders
About this article
Abstract page for arXiv paper 2604.02685: Finding Belief Geometries with Sparse Autoencoders
Computer Science > Machine Learning arXiv:2604.02685 (cs) [Submitted on 3 Apr 2026] Title:Finding Belief Geometries with Sparse Autoencoders Authors:Matthew Levinson View a PDF of the paper titled Finding Belief Geometries with Sparse Autoencoders, by Matthew Levinson View PDF HTML (experimental) Abstract:Understanding the geometric structure of internal representations is a central goal of mechanistic interpretability. Prior work has shown that transformers trained on sequences generated by hidden Markov models encode probabilistic belief states as simplex-shaped geometries in their residual stream, with vertices corresponding to latent generative states. Whether large language models trained on naturalistic text develop analogous geometric representations remains an open question. We introduce a pipeline for discovering candidate simplex-structured subspaces in transformer representations, combining sparse autoencoders (SAEs), $k$-subspace clustering of SAE features, and simplex fitting using AANet. We validate the pipeline on a transformer trained on a multipartite hidden Markov model with known belief-state geometry. Applied to Gemma-2-9B, we identify 13 priority clusters exhibiting candidate simplex geometry ($K \geq 3$). A key challenge is distinguishing genuine belief-state encoding from tiling artifacts: latents can span a simplex-shaped subspace without the mixture coordinates carrying predictive signal beyond any individual feature. We therefore adopt barycentric...