[2603.22593] Language Models Can Explain Visual Features via Steering
About this article
Abstract page for arXiv paper 2603.22593: Language Models Can Explain Visual Features via Steering
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.22593 (cs) [Submitted on 23 Mar 2026] Title:Language Models Can Explain Visual Features via Steering Authors:Javier Ferrando, Enrique Lopez-Cuena, Pablo Agustin Martin-Torres, Daniel Hinjos, Anna Arias-Duart, Dario Garcia-Gasulla View a PDF of the paper titled Language Models Can Explain Visual Features via Steering, by Javier Ferrando and 5 other authors View PDF HTML (experimental) Abstract:Sparse Autoencoders uncover thousands of features in vision models, yet explaining these features without requiring human intervention remains an open challenge. While previous work has proposed generating correlation-based explanations based on top activating input examples, we present a fundamentally different alternative based on causal interventions. We leverage the structure of Vision-Language Models and steer individual SAE features in the vision encoder after providing an empty image. Then, we prompt the language model to explain what it ``sees'', effectively eliciting the visual concept represented by each feature. Results show that Steering offers an scalable alternative that complements traditional approaches based on input examples, serving as a new axis for automated interpretability in vision models. Moreover, the quality of explanations improves consistently with the scale of the language model, highlighting our method as a promising direction for future research. Finally, we propose Steering-informed ...