[2602.22758] Decomposing Physician Disagreement in HealthBench

[2602.22758] Decomposing Physician Disagreement in HealthBench

arXiv - AI 3 min read Article

Summary

This paper analyzes physician disagreement in the HealthBench dataset, identifying key factors contributing to variance in evaluations and suggesting improvements for medical AI assessments.

Why It Matters

Understanding the sources of disagreement among physicians is crucial for enhancing the reliability of AI in healthcare. This research highlights the structural challenges in medical evaluations and points to actionable strategies for reducing uncertainty, which can lead to better patient outcomes and more effective AI tools.

Key Takeaways

  • Physician disagreement in evaluations is largely structural, with 81.8% of variance unexplained by existing metadata.
  • Reducible uncertainties, such as missing context, significantly increase disagreement odds, while genuine medical ambiguity does not.
  • Improving information clarity in evaluation scenarios could help reduce disagreement among physicians.

Computer Science > Artificial Intelligence arXiv:2602.22758 (cs) [Submitted on 26 Feb 2026] Title:Decomposing Physician Disagreement in HealthBench Authors:Satya Borgohain, Roy Mariathas View a PDF of the paper titled Decomposing Physician Disagreement in HealthBench, by Satya Borgohain and 1 other authors View PDF Abstract:We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it. Rubric identity accounts for 15.8% of met/not-met label variance but only 3.6-6.9% of disagreement variance; physician identity accounts for just 2.4%. The dominant 81.8% case-level residual is not reduced by HealthBench's metadata labels (z = -0.22, p = 0.83), normative rubric language (pseudo R^2 = 1.2%), medical specialty (0/300 Tukey pairs significant), surface-feature triage (AUC = 0.58), or embeddings (AUC = 0.485). Disagreement follows an inverted-U with completion quality (AUC = 0.689), confirming physicians agree on clearly good or bad outputs but split on borderline cases. Physician-validated uncertainty categories reveal that reducible uncertainty (missing context, ambiguous phrasing) more than doubles disagreement odds (OR = 2.55, p < 10^(-24)), while irreducible uncertainty (genuine medical ambiguity) has no effect (OR = 1.01, p = 0.90), though even the former explains only ~3% of total variance. The agreement ceiling in medical AI evaluation is thus largely structural, but th...

Related Articles

UMKC Announces New Master of Science in Artificial Intelligence
Ai Infrastructure

UMKC Announces New Master of Science in Artificial Intelligence

UMKC announces a new Master of Science in Artificial Intelligence program aimed at addressing workforce demand for AI expertise, set to l...

AI News - General · 4 min ·
Accelerating science with AI and simulations
Machine Learning

Accelerating science with AI and simulations

MIT Professor Rafael Gómez-Bombarelli discusses the transformative potential of AI in scientific research, emphasizing its role in materi...

AI News - General · 10 min ·
[2601.12910] SciCoQA: Quality Assurance for Scientific Paper--Code Alignment
Ai Safety

[2601.12910] SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

Abstract page for arXiv paper 2601.12910: SciCoQA: Quality Assurance for Scientific Paper--Code Alignment

arXiv - AI · 3 min ·
[2601.06394] Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing and LLM Sequence Classification
Llms

[2601.06394] Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing and LLM Sequence Classification

Abstract page for arXiv paper 2601.06394: Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing an...

arXiv - AI · 4 min ·
More in Data Science: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime