[2604.04220] TimeSeek: Temporal Reliability of Agentic Forecasters
About this article
Abstract page for arXiv paper 2604.04220: TimeSeek: Temporal Reliability of Agentic Forecasters
Computer Science > Artificial Intelligence arXiv:2604.04220 (cs) [Submitted on 5 Apr 2026] Title:TimeSeek: Temporal Reliability of Agentic Forecasters Authors:Hamza Mostafa, Om Shastri, Dennis Lee View a PDF of the paper titled TimeSeek: Temporal Reliability of Agentic Forecasters, by Hamza Mostafa and 2 other authors View PDF HTML (experimental) Abstract:We introduce TimeSeek, a benchmark for studying how the reliability of agentic LLM forecasters changes over a prediction market's lifecycle. We evaluate 10 frontier models on 150 CFTC-regulated Kalshi binary markets at five temporal checkpoints, with and without web search, for 15,000 forecasts total. Models are most competitive early in a market's life and on high-uncertainty markets, but much less competitive near resolution and on strong-consensus markets. Web search improves pooled Brier Skill Score (BSS) for every model overall, yet hurts in 12% of model-checkpoint pairs, indicating that retrieval is helpful on average but not uniformly so. Simple two-model ensembles reduce error without surpassing the market overall. These descriptive results motivate time-aware evaluation and selective-deference policies rather than a single market snapshot or a uniform tool-use setting. Comments: Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2604.04220 [cs.AI] (or arXiv:2604.04220v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2604.04220 Focus to learn more arXiv-issued DOI via DataCite (pending registra...