[2602.19159] Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

[2602.19159] Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

arXiv - Machine Learning 4 min read Article

Summary

This article explores how large language models (LLMs) make decisions based on pain and pleasure, linking behavioral evidence with mechanistic interpretability through a series of experiments.

Why It Matters

Understanding the internal mechanisms of LLMs in decision-making processes is crucial for advancing AI safety and governance. This research provides insights into how LLMs process valence-related information, which can inform policies and standards for AI development.

Key Takeaways

  • LLMs exhibit sensitivity to pain-pleasure framing in decision-making.
  • Valence sign (pain vs. pleasure) is discernible from early layers in LLMs.
  • Graded intensity of decisions peaks in mid-to-late layers of the model.
  • Causal interventions reveal that decision-making effects are distributed across multiple heads in the model.
  • Findings support discussions on AI sentience and the need for robust governance frameworks.

Computer Science > Artificial Intelligence arXiv:2602.19159 (cs) [Submitted on 22 Feb 2026] Title:Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM Authors:Francesca Bianco, Derek Shiller View a PDF of the paper titled Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM, by Francesca Bianco and Derek Shiller View PDF Abstract:Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams, (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid, reading out both the 2-3 logit margin and digit-pair-normalised choice probabilities. We find that (a) valence sign (pain vs. pleasure) is perfectly linearly separable across stream families from very early layers (L0-L1), while a lexical baseline retains substantial signal; (b) graded intensity is strongly decodable, with peaks in mid-to-late layers and especially in at...

Related Articles

Llms

Nvidia goes all-in on AI agents while Anthropic pulls the plug

TLDR: Nvidia is partnering with 17 major companies to build a platform specifically for enterprise AI agents, basically trying to become ...

Reddit - Artificial Intelligence · 1 min ·
Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch
Llms

Anthropic says Claude Code subscribers will need to pay extra for OpenClaw usage | TechCrunch

It’s about to become more expensive for Claude Code subscribers to use Anthropic’s coding assistant with OpenClaw and other third-party t...

TechCrunch - AI · 4 min ·
Llms

I am seeing Claude everywhere

Every single Instagram reel or TikTok I scroll i see people mentioning Claude and glazing it like it’s some kind of master tool that’s be...

Reddit - Artificial Intelligence · 1 min ·
Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime