[2602.13483] Finding Highly Interpretable Prompt-Specific Circuits in Language Models
Summary
This article presents a novel approach to understanding prompt-specific circuits in language models, demonstrating that circuits vary by prompt even within fixed tasks, enhancing mechanistic interpretability.
Why It Matters
Understanding how language models operate at a granular level is crucial for improving their interpretability and reliability. This research shifts focus from task-level analysis to prompt-specific mechanisms, which can lead to better model design and application in real-world scenarios.
Key Takeaways
- Circuits in language models are prompt-specific, not task-specific.
- The ACC++ method improves the precision of causal signals in attention heads.
- Different prompts can induce systematically different mechanisms in models.
- Prompts can be grouped into families with similar circuits for analysis.
- An automated pipeline for interpretability can enhance understanding of model behavior.
Computer Science > Machine Learning arXiv:2602.13483 (cs) [Submitted on 13 Feb 2026] Title:Finding Highly Interpretable Prompt-Specific Circuits in Language Models Authors:Gabriel Franco, Lucas M. Tassis, Azalea Rohr, Mark Crovella View a PDF of the paper titled Finding Highly Interpretable Prompt-Specific Circuits in Language Models, by Gabriel Franco and 3 other authors View PDF Abstract:Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. Most prior work identifies circuits at the task level by averaging across many prompts, implicitly assuming a single stable mechanism per task. We show that this assumption can obscure a crucial source of structure: circuits are prompt-specific, even within a fixed task. Building on attention causal communication (ACC) (Franco & Crovella, 2025), we introduce ACC++, refinements that extract cleaner, lower-dimensional causal signals inside attention heads from a single forward pass. Like ACC, our approach does not require replacement models (e.g., SAEs) or activation patching; ACC++ further improves circuit precision by reducing attribution noise. Applying ACC++ to indirect object identification (IOI) in GPT-2, Pythia, and Gemma 2, we find there is no single circuit for IOI in any model: different prompt templates induce systematically different mechanisms. Despite this variation, prompts cluster into prompt families with similar circuits, and we propose ...