[2603.25325] How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models
About this article
Abstract page for arXiv paper 2603.25325: How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models
Computer Science > Machine Learning arXiv:2603.25325 (cs) [Submitted on 26 Mar 2026] Title:How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models Authors:Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó View a PDF of the paper titled How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models, by Hector Borobia and 2 other authors View PDF HTML (experimental) Abstract:Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0--60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features--those with low firing rates--survive pruning far better than frequent ones, with within-condition Spearman correlations of rho = -1.0 in 11 of 17 experimental conditions. This counter-intuitive result suggests that pruning acts as implicit feature selection, preferentially destroying high-frequency generic features while preserving specialized rare ones. We further show t...