[2602.13272] TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks
Summary
TemporalBench introduces a benchmark for evaluating LLM-based agents on time series tasks, focusing on contextual and event-informed reasoning across multiple domains.
Why It Matters
This benchmark addresses the gap in evaluating temporal reasoning capabilities of LLMs, highlighting the need for models to not only forecast accurately but also to understand and adapt to contextual changes in real-world scenarios. It provides a structured approach to assess these abilities, which is crucial for applications in fields like healthcare and energy.
Key Takeaways
- TemporalBench evaluates LLMs on contextual and event-informed tasks.
- The benchmark uses a four-tier taxonomy for comprehensive assessment.
- Strong numerical accuracy does not guarantee robust contextual reasoning.
- Models exhibit fragmented strengths and hidden failure modes.
- The dataset and leaderboard are publicly available for further research.
Computer Science > Artificial Intelligence arXiv:2602.13272 (cs) [Submitted on 5 Feb 2026] Title:TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks Authors:Muyan Weng, Defu Cao, Wei Yang, Yashaswi Sharma, Yan Liu View a PDF of the paper titled TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks, by Muyan Weng and 4 other authors View PDF HTML (experimental) Abstract:It is unclear whether strong forecasting performance reflects genuine temporal understanding or the ability to reason under contextual and event-driven conditions. We introduce TemporalBench, a multi-domain benchmark designed to evaluate temporal reasoning behavior under progressively richer informational settings. TemporalBench adopts a four-tier task taxonomy that examines historical structure interpretation, context-free forecasting, contextual temporal reasoning, and event-conditioned prediction across four real-world domains: retail, healthcare, energy, and physical systems. By controlling access to future targets and contextual information, the benchmark enables a diagnostic analysis of whether models can correctly interpret temporal patterns, align them with external context, and adapt predictions when conditions change. Extensive baseline experiments show that strong numerical forecasting accuracy does not reliably translate into robust contextual or event-aware temporal reasoning; instea...