[2602.19159] Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM
Summary
This article explores how large language models (LLMs) make decisions based on pain and pleasure, linking behavioral evidence with mechanistic interpretability through a series of experiments.
Why It Matters
Understanding the internal mechanisms of LLMs in decision-making processes is crucial for advancing AI safety and governance. This research provides insights into how LLMs process valence-related information, which can inform policies and standards for AI development.
Key Takeaways
- LLMs exhibit sensitivity to pain-pleasure framing in decision-making.
- Valence sign (pain vs. pleasure) is discernible from early layers in LLMs.
- Graded intensity of decisions peaks in mid-to-late layers of the model.
- Causal interventions reveal that decision-making effects are distributed across multiple heads in the model.
- Findings support discussions on AI sentience and the need for robust governance frameworks.
Computer Science > Artificial Intelligence arXiv:2602.19159 (cs) [Submitted on 22 Feb 2026] Title:Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM Authors:Francesca Bianco, Derek Shiller View a PDF of the paper titled Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM, by Francesca Bianco and Derek Shiller View PDF Abstract:Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams, (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid, reading out both the 2-3 logit margin and digit-pair-normalised choice probabilities. We find that (a) valence sign (pain vs. pleasure) is perfectly linearly separable across stream families from very early layers (L0-L1), while a lexical baseline retains substantial signal; (b) graded intensity is strongly decodable, with peaks in mid-to-late layers and especially in at...