[2603.25052] Closing the Confidence-Faithfulness Gap in Large Language Models
About this article
Abstract page for arXiv paper 2603.25052: Closing the Confidence-Faithfulness Gap in Large Language Models
Computer Science > Computation and Language arXiv:2603.25052 (cs) [Submitted on 26 Mar 2026] Title:Closing the Confidence-Faithfulness Gap in Large Language Models Authors:Miranda Muqing Miao, Lyle Ungar View a PDF of the paper titled Closing the Confidence-Faithfulness Gap in Large Language Models, by Miranda Muqing Miao and 1 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models. Subjects: Computation and Language (cs.CL); Artificial ...