[2602.00564] Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
Summary
This article introduces ReasoningMath-Plus, a benchmark designed to evaluate structural mathematical reasoning in large language models (LLMs), addressing concerns about their reasoning capabilities.
Why It Matters
As LLMs achieve high accuracy on existing benchmarks, there is a risk of overestimating their reasoning abilities. This study provides a more nuanced evaluation framework that emphasizes complex reasoning processes, which is crucial for advancing AI's understanding and application in mathematics.
Key Takeaways
- Existing benchmarks may not accurately reflect LLMs' reasoning capabilities.
- ReasoningMath-Plus introduces 150 problems focusing on structural reasoning.
- The HCRS scoring method reveals lower reasoning robustness than answer-only metrics.
- The study highlights the importance of evaluating reasoning processes in AI.
- A Process Reward Model (PRM) is developed to assess reasoning traces.
Computer Science > Artificial Intelligence arXiv:2602.00564 (cs) [Submitted on 31 Jan 2026 (v1), last revised 26 Feb 2026 (this version, v2)] Title:Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs Authors:Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li, Ruixiang Luo, Haoxiang Sun, Yucheng Wang, Zhengze Li, Meng Wang, Yuetian Du, Guojie Lin, Yaxuan Wang, Xiaoxiao Xu, Yanhu Mo, Xuan Ren, Hu Wei, Bing Zhao View a PDF of the paper titled Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs, by Xiang Zheng and 17 other authors View PDF HTML (experimental) Abstract:Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reason...