[2602.21265] ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Summary
ToolMATH introduces a benchmark for evaluating tool-augmented language models in realistic multi-tool environments, focusing on long-horizon reasoning and error accumulation.
Why It Matters
As AI systems increasingly rely on multi-tool reasoning, ToolMATH provides a structured way to assess their reliability and effectiveness. This benchmark addresses critical failure modes, helping researchers and developers enhance model robustness and improve decision-making processes in complex scenarios.
Key Takeaways
- ToolMATH evaluates language models in multi-tool environments.
- The benchmark reveals that reasoning errors accumulate and affect outcomes.
- Redundant tool lists can amplify small deviations into larger errors.
- Distractor tools may serve as partial substitutes but can mislead models.
- Long-range planning is crucial for effective tool use and decision-making.
Computer Science > Computation and Language arXiv:2602.21265 (cs) [Submitted on 24 Feb 2026] Title:ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning Authors:Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee View a PDF of the paper titled ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning, by Hyeonje Choi and 3 other authors View PDF HTML (experimental) Abstract:We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution. It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability. \ToolMATH provides actionable diagnostic evidence of failure modes in tool-augmented agents, helping identify the control mechanisms required for robustness. \ToolMATH roughly contains 8k questions and 12k tools; we provide an additional hard-set \ToolMATHHard with questions and tools. Our evaluation reveals that the key failure factor is due to the inability to reason, leading to the accumulation of intermediate results' errors and constrain later decisions. Tool-list redundancy do not simply add noise, but amplify small early deviations into irreversible execution drift. The benchmark highlights that...