[2603.22161] Causal Evidence that Language Models use Confidence to Drive Behavior
About this article
Abstract page for arXiv paper 2603.22161: Causal Evidence that Language Models use Confidence to Drive Behavior
Computer Science > Machine Learning arXiv:2603.22161 (cs) [Submitted on 23 Mar 2026] Title:Causal Evidence that Language Models use Confidence to Drive Behavior Authors:Dharshan Kumaran, Nathaniel Daw, Simon Osindero, Petar Velickovic, Viorica Patraucean View a PDF of the paper titled Causal Evidence that Language Models use Confidence to Drive Behavior, by Dharshan Kumaran and 4 other authors View PDF Abstract:Metacognition -- the ability to assess one's own cognitive performance -- is documented across species, with internal confidence estimates serving as a key signal for adaptive behavior. While confidence can be extracted from Large Language Model (LLM) outputs, whether models actively use these signals to regulate behavior remains a fundamental question. We investigate this through a four-phase abstention this http URL 1 established internal confidence estimates in the absence of an abstention option. Phase 2 revealed that LLMs apply implicit thresholds to these estimates when deciding to answer or abstain. Confidence emerged as the dominant predictor of behavior, with effect sizes an order of magnitude larger than knowledge retrieval accessibility (RAG scores) or surface-level semantic features. Phase 3 provided causal evidence through activation steering: manipulating internal confidence signals correspondingly shifted abstention rates. Finally, Phase 4 demonstrated that models can systematically vary abstention policies based on instructed this http URL findings i...