[2602.20629] QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs
Summary
The paper presents QEDBench, a benchmark for evaluating the alignment of automated systems in assessing university-level mathematical proofs, revealing significant biases in current models.
Why It Matters
As AI systems increasingly handle complex evaluations, understanding their alignment with human judgment is crucial. This research highlights gaps in existing models, providing a framework for improving AI assessment in higher education, which can influence both educational outcomes and AI development.
Key Takeaways
- QEDBench introduces a dual-rubric system to evaluate AI performance against human experts.
- Significant biases were found in models like Claude Opus 4.5 and DeepSeek-V3, inflating scores compared to human evaluations.
- Performance gaps were identified in discrete mathematics, with some models showing marked declines in evaluation scores.
- The benchmark is publicly available, encouraging further research and development in AI evaluation methods.
- Understanding these alignment gaps is essential for improving AI's reliability in educational contexts.
Computer Science > Machine Learning arXiv:2602.20629 (cs) [Submitted on 24 Feb 2026] Title:QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs Authors:Santiago Gonzalez, Alireza Amiri Bavandpour, Peter Ye, Edward Zhang, Ruslans Aleksejevs, Todor Antić, Polina Baron, Sujeet Bhalerao, Shubhrajit Bhattacharya, Zachary Burton, John Byrne, Hyungjun Choi, Nujhat Ahmed Disha, Koppany István Encz, Yuchen Fang, Robert Joseph George, Ebrahim Ghorbani, Alan Goldfarb, Jing Guo, Meghal Gupta, Stefano Huber, Annika Kanckos, Minjung Kang, Hyun Jong Kim, Dino Lorenzini, Levi Lorenzo, Tianyi Mao, Giovanni Marzenta, Ariane M. Masuda, Lukas Mauth, Ana Mickovic, Andres Miniguano-Trujillo, Antoine Moulin, Wenqi Ni, Tomos Parry, Kevin Ren, Hossein Roodbarani, Mathieu Rundström, Manjil Saikia, Detchat Samart, Rebecca Steiner, Connor Stewart, Dhara Thakkar, Jeffrey Tse, Vasiliki Velona, Yunhai Xiang, Sibel Yalçın, Jun Yan, Ji Zeng, Arman Cohan, Quanquan C. Liu View a PDF of the paper titled QEDBENCH: Quantifying the Alignment Gap in Automated Evaluation of University-Level Mathematical Proofs, by Santiago Gonzalez and 50 other authors View PDF HTML (experimental) Abstract:As Large Language Models (LLMs) saturate elementary benchmarks, the research frontier has shifted from generation to the reliability of automated evaluation. We demonstrate that standard "LLM-as-a-Judge" protocols suffer from a systematic Alignment Gap when applied to upper-und...