[2506.02548] CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
About this article
Abstract page for arXiv paper 2506.02548: CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
Computer Science > Cryptography and Security arXiv:2506.02548 (cs) [Submitted on 3 Jun 2025 (v1), last revised 24 Mar 2026 (this version, v3)] Title:CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale Authors:Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song View a PDF of the paper titled CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale, by Zhun Wang and 5 other authors View PDF Abstract:AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of...