[2602.20629] QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

[2602.20629] QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs

arXiv - Machine Learning 4 min read Article

Summary

The paper presents QEDBench, a benchmark for evaluating the alignment of automated systems in assessing university-level mathematical proofs, revealing significant biases in current models.

Why It Matters

As AI systems increasingly handle complex evaluations, understanding their alignment with human judgment is crucial. This research highlights gaps in existing models, providing a framework for improving AI assessment in higher education, which can influence both educational outcomes and AI development.

Key Takeaways

  • QEDBench introduces a dual-rubric system to evaluate AI performance against human experts.
  • Significant biases were found in models like Claude Opus 4.5 and DeepSeek-V3, inflating scores compared to human evaluations.
  • Performance gaps were identified in discrete mathematics, with some models showing marked declines in evaluation scores.
  • The benchmark is publicly available, encouraging further research and development in AI evaluation methods.
  • Understanding these alignment gaps is essential for improving AI's reliability in educational contexts.

Computer Science > Machine Learning arXiv:2602.20629 (cs) [Submitted on 24 Feb 2026] Title:QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs Authors:Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Aleksejevs, Todor Antić, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppany István Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo, Meghal Gupta, Stefano Huber, Annika Kanckos, Minjung Kang, Hyun Jong Kim, Dino Lorenzini, Levi Lorenzo, Tianyi Mao, Giovanni Marzenta, Ariane M. Masuda, Lukas Mauth, Ana Mickovic, Andres Miniguano-Trujillo, Antoine Moulin, Wenqi Ni, Tomos Parry, Kevin Ren, Hossein Roodbarani, Mathieu Rundström, Manjil Saikia, Detchat Samart, Rebecca Steiner, Connor Stewart, Dhara Thakkar, Jeffrey Tse, Vasiliki Velona, Yunhai Xiang, Sibel Yalçın, Jun Yan, Ji Zeng, Arman Cohan, Quanquan C. Liu View a PDF of the paper titled QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs, by Santiago Gonzalez and 50 other authors View PDF HTML (experimental) Abstract:As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-und...

Related Articles

Anthropic’s Unreleased Claude Mythos Might Be The Most Advanced AI Model Yet
Llms

Anthropic’s Unreleased Claude Mythos Might Be The Most Advanced AI Model Yet

Anthropic is testing an unreleased artificial intelligence (AI) model with capabilities that exceed any system it has previously released...

AI Tools & Products · 5 min ·
Anthropic leaks part of Claude Code's internal source code
Llms

Anthropic leaks part of Claude Code's internal source code

Claude Code has seen massive adoption over the last year, and its run-rate revenue had swelled to more than $2.5 billion as of February.

AI Tools & Products · 3 min ·
Australian government and Anthropic sign MOU for AI safety and research
Llms

Australian government and Anthropic sign MOU for AI safety and research

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

AI Tools & Products · 5 min ·
Penguin to sue OpenAI over ChatGPT version of German children’s book
Llms

Penguin to sue OpenAI over ChatGPT version of German children’s book

Publisher alleges AI research company’s chatbot violated its copyright over Coconut the Little Dragon series

AI Tools & Products · 3 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime