[2510.13888] Reliable Fine-Grained Evaluation of Natural Language Math Proofs
About this article
Abstract page for arXiv paper 2510.13888: Reliable Fine-Grained Evaluation of Natural Language Math Proofs
Computer Science > Computation and Language arXiv:2510.13888 (cs) [Submitted on 14 Oct 2025 (v1), last revised 1 Mar 2026 (this version, v2)] Title:Reliable Fine-Grained Evaluation of Natural Language Math Proofs Authors:Wenjie Ma, Andrei Cojocaru, Neel Kolhe, Bradley Louie, Robin Said Sharif, Haihan Zhang, Vincent Zhuang, Matei Zaharia, Sewon Min View a PDF of the paper titled Reliable Fine-Grained Evaluation of Natural Language Math Proofs, by Wenjie Ma and 8 other authors View PDF HTML (experimental) Abstract:Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers while generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-Pro, o3, and DeepSeek-R1. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers Pr...