[2412.08686] LatentQA: Teaching LLMs to Decode Activations Into Natural Language
About this article
Abstract page for arXiv paper 2412.08686: LatentQA: Teaching LLMs to Decode Activations Into Natural Language
Computer Science > Computation and Language arXiv:2412.08686 (cs) [Submitted on 11 Dec 2024 (v1), last revised 23 Mar 2026 (this version, v2)] Title:LatentQA: Teaching LLMs to Decode Activations Into Natural Language Authors:Alexander Pan, Lijie Chen, Jacob Steinhardt View a PDF of the paper titled LatentQA: Teaching LLMs to Decode Activations Into Natural Language, by Alexander Pan and Lijie Chen and Jacob Steinhardt View PDF HTML (experimental) Abstract:Top-down transparency typically analyzes language model activations using probes with scalar or single-token outputs, limiting the range of behaviors that can be captured. To alleviate this issue, we develop a more expressive probe that can directly output natural language, performing LatentQA: the task of answering open-ended questions about activations. A key difficulty in developing such a probe is collecting a dataset mapping activations to natural-language descriptions. In response, we propose an approach for generating a dataset of activations and associated question-answer pairs and develop a fine-tuning method for training a decoder LLM on this dataset. We then validate our decoder's fidelity by assessing its ability to read and control model activations. First, we evaluate the decoder on a number of supervised reading tasks with a known answer, such as uncovering hidden system prompts and relational knowledge extraction, and observe that it outperforms competitive probing baselines. Second, we demonstrate that th...