[2603.04737] Interactive Benchmarks
About this article
Abstract page for arXiv paper 2603.04737: Interactive Benchmarks
Computer Science > Artificial Intelligence arXiv:2603.04737 (cs) [Submitted on 5 Mar 2026] Title:Interactive Benchmarks Authors:Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, Mengdi Wang View a PDF of the paper titled Interactive Benchmarks, by Baoqing Yue and 5 other authors View PDF Abstract:Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model's ability to acquire information actively is important to assess model's intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model's reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: this https URL Comments: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2603.04737 [cs.AI] (or arXiv:2603.04737v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2603.04737 Focus to learn more arXiv-issued DOI via DataCite Submission history Fro...