[2602.16246] Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents
Summary
This paper presents a Proxy State-Based Evaluation framework for assessing multi-turn tool-calling LLM agents, offering a scalable alternative to traditional benchmarks.
Why It Matters
As LLM agents become more prevalent in production, reliable evaluation methods are crucial for their development. This framework addresses the limitations of deterministic benchmarks, enabling more efficient and effective assessments of agent performance in real-world scenarios.
Key Takeaways
- Proposes a new evaluation framework for LLM agents that avoids deterministic backends.
- Demonstrates high agreement rates between human judges and automated evaluations.
- Offers a scalable solution that can adapt to various user scenarios with minimal hallucination rates.
Computer Science > Artificial Intelligence arXiv:2602.16246 (cs) [Submitted on 18 Feb 2026] Title:Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents Authors:Yun-Shiuan Chuang, Chaitanya Kulkarni, Alec Chiu, Avinash Thangali, Zijie Pan, Shivani Shekhar, Yirou Ge, Yixi Li, Uma Kona, Linsey Pang, Prakhar Mehrotra View a PDF of the paper titled Toward Scalable Verifiable Reward: Proxy State-Based Evaluation for Multi-turn Tool-Calling LLM Agents, by Yun-Shiuan Chuang and 10 other authors View PDF HTML (experimental) Abstract:Interactive large language model (LLM) agents operating via multi-turn dialogue and multi-step tool calling are increasingly used in production. Benchmarks for these agents must both reliably compare models and yield on-policy training data. Prior agentic benchmarks (e.g., tau-bench, tau2-bench, AppWorld) rely on fully deterministic backends, which are costly to build and iterate. We propose Proxy State-Based Evaluation, an LLM-driven simulation framework that preserves final state-based evaluation without a deterministic database. Specifically, a scenario specifies the user goal, user/system facts, expected final state, and expected agent behavior, and an LLM state tracker infers a structured proxy state from the full interaction trace. LLM judges then verify goal completion and detect tool/user hallucinations against scenario constraints. Empirically, our benchmark produces stable, model-differentiating...