[2602.13540] On Calibration of Large Language Models: From Response To Capability

[2602.13540] On Calibration of Large Language Models: From Response To Capability

arXiv - Machine Learning 4 min read Article

Summary

This paper introduces the concept of capability calibration for large language models (LLMs), emphasizing the importance of accurate confidence estimation in practical applications beyond single-response correctness.

Why It Matters

As LLMs are increasingly used in various domains, understanding their confidence in solving queries is crucial for reliability. This research addresses the gap in existing calibration methods, proposing a new framework that could enhance the performance and trustworthiness of LLMs in real-world applications.

Key Takeaways

  • Capability calibration focuses on a model's expected accuracy on a query rather than just response-level confidence.
  • The study highlights the differences between capability calibration and traditional response calibration, both theoretically and empirically.
  • Improved confidence estimation can enhance prediction accuracy and optimize inference budget allocation.
  • The proposed methods establish a foundation for diverse applications in AI and machine learning.
  • This research contributes to the ongoing discourse on LLM reliability and performance metrics.

Computer Science > Computation and Language arXiv:2602.13540 (cs) [Submitted on 14 Feb 2026] Title:On Calibration of Large Language Models: From Response To Capability Authors:Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, Shao-Hua Sun View a PDF of the paper titled On Calibration of Large Language Models: From Response To Capability, by Sin-Han Yang and Cheng-Kuang Wu and Chieh-Yen Lin and Yun-Nung Chen and Hung-yi Lee and Shao-Hua Sun View PDF Abstract:Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confi...

Related Articles

[2603.16105] Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization
Llms

[2603.16105] Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

Abstract page for arXiv paper 2603.16105: Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

arXiv - AI · 4 min ·
[2603.09643] MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings
Llms

[2603.09643] MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

Abstract page for arXiv paper 2603.09643: MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Contro...

arXiv - AI · 4 min ·
[2603.07339] Agora: Teaching the Skill of Consensus-Finding with AI Personas Grounded in Human Voice
Llms

[2603.07339] Agora: Teaching the Skill of Consensus-Finding with AI Personas Grounded in Human Voice

Abstract page for arXiv paper 2603.07339: Agora: Teaching the Skill of Consensus-Finding with AI Personas Grounded in Human Voice

arXiv - AI · 4 min ·
[2602.00185] QUASAR: A Universal Autonomous System for Atomistic Simulation and a Benchmark of Its Capabilities
Llms

[2602.00185] QUASAR: A Universal Autonomous System for Atomistic Simulation and a Benchmark of Its Capabilities

Abstract page for arXiv paper 2602.00185: QUASAR: A Universal Autonomous System for Atomistic Simulation and a Benchmark of Its Capabilities

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime