[2605.06869] Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
About this article
Abstract page for arXiv paper 2605.06869: Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Computer Science > Artificial Intelligence arXiv:2605.06869 (cs) [Submitted on 7 May 2026] Title:Agentick: A Unified Benchmark for General Sequential Decision-Making Agents Authors:Roger Creus Castanyer, Pablo Samuel Castro, Glen Berseth View a PDF of the paper titled Agentick: A Unified Benchmark for General Sequential Decision-Making Agents, by Roger Creus Castanyer and 2 other authors View PDF HTML (experimental) Abstract:AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM perfo...