[2602.18998] Benchmark Test-Time Scaling of General LLM Agents
Summary
This paper introduces General AgentBench, a benchmark for evaluating general LLM agents across various domains, revealing performance challenges in realistic settings.
Why It Matters
As LLM agents are increasingly deployed in diverse applications, understanding their performance in general-purpose scenarios is crucial. This research highlights the limitations of current evaluation methods, providing insights for future improvements in LLM capabilities.
Key Takeaways
- General AgentBench offers a unified framework for evaluating LLM agents across multiple skills.
- Performance of LLM agents significantly degrades in general-agent settings compared to domain-specific evaluations.
- Neither sequential nor parallel scaling methodologies effectively enhance performance due to inherent limitations.
Computer Science > Artificial Intelligence arXiv:2602.18998 (cs) [Submitted on 22 Feb 2026] Title:Benchmark Test-Time Scaling of General LLM Agents Authors:Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang, Hao Kang, Shuai Shao, Rong Jin, Chenyan Xiong View a PDF of the paper titled Benchmark Test-Time Scaling of General LLM Agents, by Xiaochuan Li and 8 other authors View PDF HTML (experimental) Abstract:LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential sca...