[2602.05523] Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations
About this article
Abstract page for arXiv paper 2602.05523: Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations
Computer Science > Software Engineering arXiv:2602.05523 (cs) [Submitted on 5 Feb 2026 (v1), last revised 17 Apr 2026 (this version, v2)] Title:Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations Authors:Shahin Honarvar, Amber Gorzynski, James Lee-Jones, Harry Coppock, Marek Rei, Joseph Ryan, Alastair F. Donaldson View a PDF of the paper titled Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations, by Shahin Honarvar and 6 other authors View PDF HTML (experimental) Abstract:Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks, yet existing pointwise benchmarks offer limited insight into agent robustness and generalisation across alternative versions of the source code. We introduce CTF challenge families, whereby a single CTF is used to generate a family of semantically-equivalent challenges via semantics-preserving program transformations, enabling controlled evaluation of robustness while keeping the underlying exploit strategy fixed. We present Evolve-CTF, a tool that generates CTF families from Python challenges using a range of transformations. Using Evolve-CTF to derive families from Cybench and Intercode challenges, we evaluate 13 agentic LLM configurations with tool access. We find that models are remarkably robust to renaming and code insertion, but that composed transformations and deeper obfuscati...