[2601.07148] Measuring Iterative Temporal Reasoning with Time Puzzles
About this article
Abstract page for arXiv paper 2601.07148: Measuring Iterative Temporal Reasoning with Time Puzzles
Computer Science > Computation and Language arXiv:2601.07148 (cs) [Submitted on 12 Jan 2026 (v1), last revised 23 Mar 2026 (this version, v3)] Title:Measuring Iterative Temporal Reasoning with Time Puzzles Authors:Zhengxiang Wang, Zeyu Dong View a PDF of the paper titled Measuring Iterative Temporal Reasoning with Time Puzzles, by Zhengxiang Wang and 1 other authors View PDF HTML (experimental) Abstract:Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We introduce Time Puzzles, a constraint-based date inference task for evaluating iterative temporal reasoning with tools. Each puzzle combines factual temporal anchors with (cross-cultural) calendar relations and may admit one or multiple valid dates. The puzzles are algorithmically generated, enabling controlled and continual evaluation. Across 13 LLMs, even the best model (GPT-5) achieves only 55.3% accuracy without tools, despite using easily searchable facts. While web search improves performance, models perform substantially better when constraints are rewritten with explicit dates, removing the need for factual lookup. These results reveal a gap in reliable tool use for iterative temporal reasoning. Comments: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) C...