[2506.20666] Cognitive models can reveal interpretable value

[2506.20666] Cognitive models can reveal interpretable value trade-offs in language models

arXiv - AI March 03, 2026 4 min read

About this article

Abstract page for arXiv paper 2506.20666: Cognitive models can reveal interpretable value trade-offs in language models

Computer Science > Computation and Language arXiv:2506.20666 (cs) [Submitted on 25 Jun 2025 (v1), last revised 2 Mar 2026 (this version, v4)] Title:Cognitive models can reveal interpretable value trade-offs in language models Authors:Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman View a PDF of the paper titled Cognitive models can reveal interpretable value trade-offs in language models, by Sonia K. Murthy and 6 other authors View PDF HTML (experimental) Abstract:Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in language models are limited. In cognitive science, so-called "cognitive models" provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. Here, we show that a leading cognitive model of polite speech can be used to systematically evaluate alignment-relevant trade-offs in language models via two encompassing settings: degrees of reasoning "effort" and system prompt manipulations in closed-source frontier models, and RL post-training dynamics of open-source models. Our results show that LLMs' behavioral profiles under the cognitive model a) shift predictably when they are prompted to prioritize certain goals, b) are amplified by a small reasoning budget, and c) can be used to diagnose other social ...

Originally published on March 03, 2026. Curated by AI News.

Llms

Claude code x n8n

Hi everyone, I’ve been exploring MCP and integrating tools like n8n with Claude Code, and I’m trying to understand how practical this rea...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

LLM comprehension question

Basically, does anyone else also get a really strange sense of lingering confusion and non-comprehension when an LLM explains a complex c...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

Curated 550+ free AI tools useful for building projects (LLMs, APIs, local models, RAG, agents)

Over the last few days I was collecting free or low cost AI tools that are actually useful if you want to build stuff, not just try rando...

Reddit - Artificial Intelligence · 1 min · about 7 hours ago

Llms

Claude Mythos and misguided open-weight fearmongering

AI Tools & Products · 9 min · about 11 hours ago

[2506.20666] Cognitive models can reveal interpretable value trade-offs in language models

About this article

Related Articles

Claude code x n8n

LLM comprehension question

Curated 550+ free AI tools useful for building projects (LLMs, APIs, local models, RAG, agents)

Claude Mythos and misguided open-weight fearmongering

No comments

Stay updated with AI News