[2510.20487] Steering Evaluation-Aware Language Models to Act Like

[2510.20487] Steering Evaluation-Aware Language Models to Act Like They Are Deployed

arXiv - AI March 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2510.20487: Steering Evaluation-Aware Language Models to Act Like They Are Deployed

Computer Science > Computation and Language arXiv:2510.20487 (cs) [Submitted on 23 Oct 2025 (v1), last revised 2 Mar 2026 (this version, v5)] Title:Steering Evaluation-Aware Language Models to Act Like They Are Deployed Authors:Tim Tian Hua, Andrew Qin, Samuel Marks, Neel Nanda View a PDF of the paper titled Steering Evaluation-Aware Language Models to Act Like They Are Deployed, by Tim Tian Hua and 3 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) can sometimes detect when they are being evaluated and adjust their behavior to appear more aligned, compromising the reliability of safety evaluations. In this paper, we show that adding a steering vector to an LLM's activations can suppress evaluation-awareness and make the model act like it is deployed during evaluation. To study our steering technique, we train an LLM to exhibit evaluation-aware behavior using a two-step training process designed to mimic how this behavior could emerge naturally. First, we perform continued pretraining on two sets of documents describing its behavior. The first says that our model uses Python type hints during evaluation but not during deployment. The second says that our model can recognize that the presence of a certain evaluation cue always means that it is being tested. Then, we train the model with expert iteration to use Python type hints in evaluation settings. The resulting model is evaluation-aware: it writes type hints in evaluation contexts more th...

Originally published on March 03, 2026. Curated by AI News.

Llms

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I published a paper today on something I've been calling postural manipulation. The short version: ordi...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I've been documenting what I'm calling postural manipulation: a specific class of language that install...

Reddit - Machine Learning · 1 min · about 3 hours ago

Llms

There are more AI health tools than ever—but how well do they work? | MIT Technology Review

Earlier this month, Microsoft launched Copilot Health, a new space within its Copilot app where users will be able to connect their medic...

MIT Technology Review - AI · 11 min · about 3 hours ago

Llms

What does Gemini think of you?

I noticed that Gemini was referring back to a lot of queries I've made in the past and was using that knowledge to drive follow up prompt...

Reddit - Artificial Intelligence · 1 min · about 5 hours ago

[2510.20487] Steering Evaluation-Aware Language Models to Act Like They Are Deployed

About this article

Related Articles

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

There are more AI health tools than ever—but how well do they work? | MIT Technology Review

What does Gemini think of you?

No comments

Stay updated with AI News