[2506.20666] Cognitive models can reveal interpretable value trade-offs in language models
About this article
Abstract page for arXiv paper 2506.20666: Cognitive models can reveal interpretable value trade-offs in language models
Computer Science > Computation and Language arXiv:2506.20666 (cs) [Submitted on 25 Jun 2025 (v1), last revised 2 Mar 2026 (this version, v4)] Title:Cognitive models can reveal interpretable value trade-offs in language models Authors:Sonia K. Murthy, Rosie Zhao, Jennifer Hu, Sham Kakade, Markus Wulfmeier, Peng Qian, Tomer Ullman View a PDF of the paper titled Cognitive models can reveal interpretable value trade-offs in language models, by Sonia K. Murthy and 6 other authors View PDF HTML (experimental) Abstract:Value trade-offs are an integral part of human decision-making and language use, however, current tools for interpreting such dynamic and multi-faceted notions of values in language models are limited. In cognitive science, so-called "cognitive models" provide formal accounts of such trade-offs in humans, by modeling the weighting of a speaker's competing utility functions in choosing an action or utterance. Here, we show that a leading cognitive model of polite speech can be used to systematically evaluate alignment-relevant trade-offs in language models via two encompassing settings: degrees of reasoning "effort" and system prompt manipulations in closed-source frontier models, and RL post-training dynamics of open-source models. Our results show that LLMs' behavioral profiles under the cognitive model a) shift predictably when they are prompted to prioritize certain goals, b) are amplified by a small reasoning budget, and c) can be used to diagnose other social ...