[2509.24186] Measuring Competency, Not Performance: Item-Aware

[2509.24186] Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

arXiv - AI April 07, 2026 4 min read

About this article

Abstract page for arXiv paper 2509.24186: Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

Computer Science > Computation and Language arXiv:2509.24186 (cs) [Submitted on 29 Sep 2025 (v1), last revised 6 Apr 2026 (this version, v2)] Title:Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks Authors:Zhimeng Luo, Lixin Wu, Adam Frisch, Daqing He View a PDF of the paper titled Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks, by Zhimeng Luo and 3 other authors View PDF HTML (experimental) Abstract:Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics, and thereby produces rankings that vary with benchmark choice. To address this, we introduce MedIRT, a psychometric evaluation framework grounded in Item Response Theory (IRT) that (1) jointly models latent competency and item-level difficulty and discrimination, and (2) includes benchmark integrity validation to ensure items within each topic measure a single, coherent underlying ability. We prospectively evaluate 71 diverse LLMs on a USMLE-aligned benchmark across 11 medical topics. As internal validation, MedIRT correctly predicts held-out LLM responses on unseen questions with 83.3% accuracy. As external validation, IRT-based rankings outperform accuracy-based rankings across 6 independent external medical benchmarks -- including expert preferences, holistic cli...

Originally published on April 07, 2026. Curated by AI News.

Llms

When Robots Have Their ChatGPT Moment, Remember These Pincers | WIRED

From sorting chicken nuggets to screwing in light bulbs, Eka’s robots are eerily lifelike. But do they have real physical smarts?

Wired - AI · 13 min · about 2 hours ago

Llms

87% Cost Savings & Sub-3s Latency: I built a "Warm-Cache" harness for persistent Claude agents.

**The "Goldfish Problem" is expensive. I decided to fix the plumbing.** Most Claude implementations leave 90% of their money on the table...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

What are people using for low-latency autocomplete in production? [P]

I’ve been looking into autocomplete/typeahead systems recently, especially in contexts where latency really matters (e.g. search-as-you-t...

Reddit - Machine Learning · 1 min · about 3 hours ago

Llms

General Motors is adding Gemini to four million cars | The Verge

General Motors is planning to bring Google’s Gemini AI assistant to around four million vehicles across the US.

The Verge - AI · 4 min · about 4 hours ago

[2509.24186] Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

About this article

Related Articles

When Robots Have Their ChatGPT Moment, Remember These Pincers | WIRED

87% Cost Savings & Sub-3s Latency: I built a "Warm-Cache" harness for persistent Claude agents.

What are people using for low-latency autocomplete in production? [P]

General Motors is adding Gemini to four million cars | The Verge

No comments

Stay updated with AI News