[2603.20101] Pitfalls in Evaluating Interpretability Agents
About this article
Abstract page for arXiv paper 2603.20101: Pitfalls in Evaluating Interpretability Agents
Computer Science > Artificial Intelligence arXiv:2603.20101 (cs) [Submitted on 20 Mar 2026] Title:Pitfalls in Evaluating Interpretability Agents Authors:Tal Haklay, Nikhil Prakash, Sana Pandey, Antonio Torralba, Aaron Mueller, Jacob Andreas, Tamar Rott Shaham, Yonatan Belinkov View a PDF of the paper titled Pitfalls in Evaluating Interpretability Agents, by Tal Haklay and 7 other authors View PDF HTML (experimental) Abstract:Automated interpretability systems aim to reduce the need for human labor and scale analysis to increasingly large models and diverse tasks. Recent efforts toward this goal leverage large language models (LLMs) at increasing levels of autonomy, ranging from fixed one-shot workflows to fully autonomous interpretability agents. This shift creates a corresponding need to scale evaluation approaches to keep pace with both the volume and complexity of generated explanations. We investigate this challenge in the context of automated circuit analysis -- explaining the roles of model components when performing specific tasks. To this end, we build an agentic system in which a research agent iteratively designs experiments and refines hypotheses. When evaluated against human expert explanations across six circuit analysis tasks in the literature, the system appears competitive. However, closer examination reveals several pitfalls of replication-based evaluation: human expert explanations can be subjective or incomplete, outcome-based comparisons obscure the res...