[2602.15112] ResearchGym: Evaluating Language Model Agents on Real-World AI Research
Summary
ResearchGym introduces a benchmark for evaluating AI agents in real-world research scenarios, revealing significant performance gaps and challenges in reliability.
Why It Matters
This research is crucial as it highlights the limitations of current AI agents in performing complex research tasks, emphasizing the need for better evaluation frameworks. By identifying specific failure modes, it provides insights that can guide future improvements in AI capabilities and reliability.
Key Takeaways
- ResearchGym benchmarks AI agents against real-world research tasks.
- AI agents showed a significant reliability gap, excelling in only 6.7% of evaluations.
- Identified failure modes include impatience and poor resource management.
- Some agents occasionally achieve state-of-the-art performance but do so inconsistently.
- The study provides a framework for systematic evaluation of AI in research contexts.
Computer Science > Artificial Intelligence arXiv:2602.15112 (cs) [Submitted on 16 Feb 2026] Title:ResearchGym: Evaluating Language Model Agents on Real-World AI Research Authors:Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan View a PDF of the paper titled ResearchGym: Evaluating Language Model Agents on Real-World AI Research, by Aniketh Garikaparthi and 2 other authors View PDF HTML (experimental) Abstract:We introduce ResearchGym, a benchmark and execution environment for evaluating AI agents on end-to-end research. To instantiate this, we repurpose five oral and spotlight papers from ICML, ICLR, and ACL. From each paper's repository, we preserve the datasets, evaluation harness, and baseline implementations but withhold the paper's proposed method. This results in five containerized task environments comprising 39 sub-tasks in total. Within each environment, agents must propose novel hypotheses, run experiments, and attempt to surpass strong human baselines on the paper's metrics. In a controlled evaluation of an agent powered by GPT-5, we observe a sharp capability--reliability gap. The agent improves over the provided baselines from the repository in just 1 of 15 evaluations (6.7%) by 11.5%, and completes only 26.5% of sub-tasks on average. We identify recurring long-horizon failure modes, including impatience, poor time and resource management, overconfidence in weak hypotheses, difficulty coordinating parallel experiments, and hard limits from context length. ...