[2602.14111] Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

[2602.14111] Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

arXiv - Machine Learning 4 min read Article

Summary

This paper evaluates the effectiveness of Sparse Autoencoders (SAEs) in recovering meaningful features from neural networks, revealing significant shortcomings in their interpretability compared to random baselines.

Why It Matters

Understanding the limitations of Sparse Autoencoders is crucial for researchers and practitioners in machine learning, as it challenges the assumption that these models can effectively interpret neural network features. This insight can guide future research and model development.

Key Takeaways

  • SAEs recover only 9% of true features despite high explained variance.
  • SAEs perform similarly to random baselines in interpretability and causal editing tasks.
  • Current SAE models may not reliably decompose neural network mechanisms.

Computer Science > Machine Learning arXiv:2602.14111 (cs) [Submitted on 15 Feb 2026] Title:Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines? Authors:Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, Elena Tutubalina View a PDF of the paper titled Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?, by Anton Korznikov and 5 other authors View PDF HTML (experimental) Abstract:Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only $9\%$ of true features despite achieving $71\%$ explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0...

Related Articles

Machine Learning

TMLR reviews stalled [D]

I submitted a regular submission (12 pages or less) to TMLR in February that had status change to “under review” 6 weeks ago. TMLR states...

Reddit - Machine Learning · 1 min ·
Top 10 AI certifications and courses for 2026
Ai Startups

Top 10 AI certifications and courses for 2026

This article reviews the top 10 AI certifications and courses for 2026, highlighting their significance in a rapidly evolving field and t...

AI Events · 15 min ·
Machine Learning

Artificial intelligence - Machine Learning, Robotics, Algorithms

AI Events ·
Machine Learning

Looking to join a team working on AI/CV research (aiming to publish) [R]

Hi, I am currently working as a research assistant in my college, but I want to do more serious research and learn more from it. I’m inte...

Reddit - Machine Learning · 1 min ·
More in Machine Learning: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime