[2602.16485] Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Summary
The paper introduces 'Team-of-Thoughts', a novel multi-agent system architecture that enhances performance by leveraging heterogeneous agents through an orchestrator-tool paradigm, optimizing task execution during inference.
Why It Matters
This research addresses limitations in existing multi-agent systems by enabling dynamic activation of agents based on their expertise, which can significantly improve performance in reasoning and code generation tasks. The findings have implications for developing more efficient AI systems that can adapt to varying tasks and environments.
Key Takeaways
- Team-of-Thoughts architecture utilizes heterogeneous agents to enhance task performance.
- An orchestrator calibration scheme identifies agents with superior coordination capabilities.
- Self-assessment protocols allow agents to profile their domain expertise.
- The approach significantly outperforms traditional homogeneous models in benchmarks.
- Achieved accuracies of 96.67% and 72.53% on key benchmarks demonstrate its effectiveness.
Computer Science > Computation and Language arXiv:2602.16485 (cs) [Submitted on 18 Feb 2026] Title:Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling Authors:Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao View a PDF of the paper titled Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling, by Jeffrey T. H. Wong and 3 other authors View PDF HTML (experimental) Abstract:Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models. To address this, we introduce Team-of-Thoughts, a novel MAS architecture that leverages the complementary capabilities of heterogeneous agents via an orchestrator-tool paradigm. Our framework introduces two key mechanisms to optimize performance: (1) an orchestrator calibration scheme that identifies models with superior coordination capabilities, and (2) a self-assessment protocol where tool agents profile their own domain expertise to account for variations in post-training skills. During inference, the orchestrator dynamically activates the most suitable tool agents based on these proficiency profiles. Experiments on five reasoning and code generation benchmarks show that Team-of-Thoughts delivers consistently superior task performance. Notably, on AIME24 and LiveCodeBench, our approach achieves accuracies of 96.67% and 72....