[2603.02222] MedCalc-Bench Doesn't Measure What You Think: A Benchmark

[2603.02222] MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

arXiv - AI March 04, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.02222: MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

Computer Science > Machine Learning arXiv:2603.02222 (cs) [Submitted on 10 Feb 2026] Title:MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation Authors:Artus Krohn-Grimberghe View a PDF of the paper titled MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation, by Artus Krohn-Grimberghe View PDF HTML (experimental) Abstract:MedCalc-Bench is a widely used benchmark for evaluating LLM performance on clinical calculator tasks, with state-of-the-art direct prompting scores plateauing around 35% on the Verified split (HELM MedHELM leaderboard) and the best published approach-RL with verifiable rewards-reaching 74%. We present three contributions that challenge the benchmark's current framing. First, we conduct a systematic audit of the benchmark's calculator implementations, identifying and fixing over 20 errors ranging from critical formula inaccuracies to runtime bugs in a NeurIPS-published dataset. Second, we show that a simple intervention-providing the model with the calculator specification at inference time ("open-book" prompting)-raises accuracy from ~52% to 81-85% on GLM-4.6V and GLM-4.7, surpassing all published results including RL-trained systems, without any fine-tuning. Third, we establish an upper bound of 95-97% using GPT-5.2-Thinking, with residual errors attributable primarily to ground-truth issues and dataset ambiguities. Our findings suggest that MedCalc-Bench pr...

Originally published on March 04, 2026. Curated by AI News.

Llms

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I published a paper today on something I've been calling postural manipulation. The short version: ordi...

Reddit - Artificial Intelligence · 1 min · 39 minutes ago

Llms

[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

https://shapingrooms.com/research I've been documenting what I'm calling postural manipulation: a specific class of language that install...

Reddit - Machine Learning · 1 min · about 1 hour ago

Llms

What does Gemini think of you?

I noticed that Gemini was referring back to a lot of queries I've made in the past and was using that knowledge to drive follow up prompt...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

This app helps you see what LLMs you can run on your hardware

submitted by /u/dev_is_active [link] [comments]

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

[2603.02222] MedCalc-Bench Doesn't Measure What You Think: A Benchmark Audit and the Case for Open-Book Evaluation

About this article

Related Articles

An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

[R] An attack class that passes every current LLM filter - no payload, no injection signature, no log trace

What does Gemini think of you?

This app helps you see what LLMs you can run on your hardware

No comments

Stay updated with AI News