[2512.15987] Provably Extracting the Features from a General Superposition
About this article
Abstract page for arXiv paper 2512.15987: Provably Extracting the Features from a General Superposition
Computer Science > Machine Learning arXiv:2512.15987 (cs) [Submitted on 17 Dec 2025 (v1), last revised 31 Mar 2026 (this version, v2)] Title:Provably Extracting the Features from a General Superposition Authors:Allen Liu View a PDF of the paper titled Provably Extracting the Features from a General Superposition, by Allen Liu View PDF HTML (experimental) Abstract:It is widely believed that complex machine learning models generally encode features through linear representations. This is the foundational hypothesis behind a vast body of work on interpretability. A key challenge toward extracting interpretable features, however, is that they exist in superposition. In this work, we study the question of extracting features in superposition from a learning theoretic perspective. We start with the following fundamental setting: we are given query access to a function \[ f(x)=\sum_{i=1}^n \sigma_i(v_i^\top x), \] where each unit vector $v_i$ encodes a feature direction and $\sigma_i:\R\to\R$ is an arbitrary response function and our goal is to recover the $v_i$ and the function $f$. In learning-theoretic terms, superposition refers to the \emph{overcomplete regime}, when the number of features is larger than the underlying dimension (i.e. $n > d$), which has proven especially challenging for typical algorithmic approaches. Our main result is an efficient query algorithm that, from noisy oracle access to $f$, identifies all feature directions whose responses are non-degenerate an...