[2510.20487] Steering Evaluation-Aware Language Models to Act Like They Are Deployed
About this article
Abstract page for arXiv paper 2510.20487: Steering Evaluation-Aware Language Models to Act Like They Are Deployed
Computer Science > Computation and Language arXiv:2510.20487 (cs) [Submitted on 23 Oct 2025 (v1), last revised 2 Mar 2026 (this version, v5)] Title:Steering Evaluation-Aware Language Models to Act Like They Are Deployed Authors:Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda View a PDF of the paper titled Steering Evaluation-Aware Language Models to Act Like They Are Deployed, by Tim Tian Hua and 3 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on two sets of documents describing its behavior. The first says that our model uses Python type hints during evaluation but not during deployment. The second says that our model can recognize that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more th...