Llms Machine Learning Ai Infrastructure Ai Safety

[2602.13540] On Calibration of Large Language Models: From Response To Capability

arXiv - Machine Learning February 17, 2026 4 min read Article

Summary

This paper introduces the concept of capability calibration for large language models (LLMs), emphasizing the importance of accurate confidence estimation in practical applications beyond single-response correctness.

Why It Matters

As LLMs are increasingly used in various domains, understanding their confidence in solving queries is crucial for reliability. This research addresses the gap in existing calibration methods, proposing a new framework that could enhance the performance and trustworthiness of LLMs in real-world applications.

Key Takeaways

Capability calibration focuses on a model's expected accuracy on a query rather than just response-level confidence.
The study highlights the differences between capability calibration and traditional response calibration, both theoretically and empirically.
Improved confidence estimation can enhance prediction accuracy and optimize inference budget allocation.
The proposed methods establish a foundation for diverse applications in AI and machine learning.
This research contributes to the ongoing discourse on LLM reliability and performance metrics.

Computer Science > Computation and Language arXiv:2602.13540 (cs) [Submitted on 14 Feb 2026] Title:On Calibration of Large Language Models: From Response To Capability Authors:Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, Shao-Hua Sun View a PDF of the paper titled On Calibration of Large Language Models: From Response To Capability, by Sin-Han Yang and Cheng-Kuang Wu and Chieh-Yen Lin and Yun-Nung Chen and Hung-yi Lee and Shao-Hua Sun View PDF Abstract:Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confi...

Read Original Article

[2602.13540] On Calibration of Large Language Models: From Response To Capability

Summary

Why It Matters

Key Takeaways

Related Articles

[2603.16105] Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

[2603.09643] MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

[2603.07339] Agora: Teaching the Skill of Consensus-Finding with AI Personas Grounded in Human Voice

[2602.00185] QUASAR: A Universal Autonomous System for Atomistic Simulation and a Benchmark of Its Capabilities

No comments

Stay updated with AI News