Llms Machine Learning Ai Safety

[2602.19159] Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

arXiv - Machine Learning February 24, 2026 4 min read Article

Summary

This article explores how large language models (LLMs) make decisions based on pain and pleasure, linking behavioral evidence with mechanistic interpretability through a series of experiments.

Why It Matters

Understanding the internal mechanisms of LLMs in decision-making processes is crucial for advancing AI safety and governance. This research provides insights into how LLMs process valence-related information, which can inform policies and standards for AI development.

Key Takeaways

LLMs exhibit sensitivity to pain-pleasure framing in decision-making.
Valence sign (pain vs. pleasure) is discernible from early layers in LLMs.
Graded intensity of decisions peaks in mid-to-late layers of the model.
Causal interventions reveal that decision-making effects are distributed across multiple heads in the model.
Findings support discussions on AI sentience and the need for robust governance frameworks.

Computer Science > Artificial Intelligence arXiv:2602.19159 (cs) [Submitted on 22 Feb 2026] Title:Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM Authors:Francesca Bianco, Derek Shiller View a PDF of the paper titled Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM, by Francesca Bianco and Derek Shiller View PDF Abstract:Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams, (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid, reading out both the 2-3 logit margin and digit-pair-normalised choice probabilities. We find that (a) valence sign (pain vs. pleasure) is perfectly linearly separable across stream families from very early layers (L0-L1), while a lexical baseline retains substantial signal; (b) graded intensity is strongly decodable, with peaks in mid-to-late layers and especially in at...

Read Original Article

[2602.19159] Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

Summary

Why It Matters

Key Takeaways

Related Articles

Nvidia goes all-in on AI agents while Anthropic pulls the plug

Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch

I am seeing Claude everywhere

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

No comments

Stay updated with AI News