[2602.18998] Benchmark Test-Time Scaling of General LLM Agents

[2602.18998] Benchmark Test-Time Scaling of General LLM Agents

arXiv - AI 3 min read Article

Summary

This paper introduces General AgentBench, a benchmark for evaluating general LLM agents across various domains, revealing performance challenges in realistic settings.

Why It Matters

As LLM agents are increasingly deployed in diverse applications, understanding their performance in general-purpose scenarios is crucial. This research highlights the limitations of current evaluation methods, providing insights for future improvements in LLM capabilities.

Key Takeaways

  • General AgentBench offers a unified framework for evaluating LLM agents across multiple skills.
  • Performance of LLM agents significantly degrades in general-agent settings compared to domain-specific evaluations.
  • Neither sequential nor parallel scaling methodologies effectively enhance performance due to inherent limitations.

Computer Science > Artificial Intelligence arXiv:2602.18998 (cs) [Submitted on 22 Feb 2026] Title:Benchmark Test-Time Scaling of General LLM Agents Authors:Xiaochuan Li, Ryan Ming, Pranav Setlur, Abhijay Paladugu, Andy Tang, Hao Kang, Shuai Shao, Rong Jin, Chenyan Xiong View a PDF of the paper titled Benchmark Test-Time Scaling of General LLM Agents, by Xiaochuan Li and 8 other authors View PDF HTML (experimental) Abstract:LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential sca...

Related Articles

Llms

Claude Opus 4.6 API at 40% below Anthropic pricing – try free before you pay anything

Hey everyone I've set up a self-hosted API gateway using [New-API](QuantumNous/new-ap) to manage and distribute Claude Opus 4.6 access ac...

Reddit - Artificial Intelligence · 1 min ·
Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED
Llms

Hackers Are Posting the Claude Code Leak With Bonus Malware | WIRED

Plus: The FBI says a recent hack of its wiretap tools poses a national security risk, attackers stole Cisco source code as part of an ong...

Wired - AI · 9 min ·
Llms

People anxious about deviating from what AI tells them to do?

My friend came over yesterday to dye her hair. She had asked ChatGPT for the 'correct' way to do it. Chat told her to dye the ends first,...

Reddit - Artificial Intelligence · 1 min ·
Llms

ChatGPT on trial: A landmark test of AI liability in the practice of law

AI Tools & Products ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime