[2602.05523] Capture the Flags: Family-Based Evaluation of Agentic

[2602.05523] Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

arXiv - AI April 20, 2026 4 min read

About this article

Abstract page for arXiv paper 2602.05523: Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

Computer Science > Software Engineering arXiv:2602.05523 (cs) [Submitted on 5 Feb 2026 (v1), last revised 17 Apr 2026 (this version, v2)] Title:Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations Authors:Shahin Honarvar, Amber Gorzynski, James Lee-Jones, Harry Coppock, Marek Rei, Joseph Ryan, Alastair F. Donaldson View a PDF of the paper titled Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations, by Shahin Honarvar and 6 other authors View PDF HTML (experimental) Abstract:Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks, yet existing pointwise benchmarks offer limited insight into agent robustness and generalisation across alternative versions of the source code. We introduce CTF challenge families, whereby a single CTF is used to generate a family of semantically-equivalent challenges via semantics-preserving program transformations, enabling controlled evaluation of robustness while keeping the underlying exploit strategy fixed. We present Evolve-CTF, a tool that generates CTF families from Python challenges using a range of transformations. Using Evolve-CTF to derive families from Cybench and Intercode challenges, we evaluate 13 agentic LLM configurations with tool access. We find that models are remarkably robust to renaming and code insertion, but that composed transformations and deeper obfuscati...

Originally published on April 20, 2026. Curated by AI News.

Llms

AI research is splitting into groups that can train and groups that can only fine tune

I strongly believe that compute access is doing more to shape AI progress right now than any algorithmic insight - not because ideas don'...

Reddit - Artificial Intelligence · 1 min · 44 minutes ago

Llms

Is Remitly (RELY) Embedding Transfers in ChatGPT a Turning Point for AI-Driven Customer Acquisition?

Earlier this month, Remitly Global launched an app within ChatGPT, becoming the first cross-border money transfer provider on the platfor...

AI Tools & Products · 4 min · about 1 hour ago

Llms

C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]

For people just starting out in GPU kernel engineering or LLM inference (FlashAttention / FlashInfer / SGLang / vLLM style work), most jo...

Reddit - Machine Learning · 1 min · about 2 hours ago

Llms

[2511.10262] MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

Abstract page for arXiv paper 2511.10262: MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duple...

arXiv - AI · 4 min · about 3 hours ago

[2602.05523] Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

About this article

Related Articles

AI research is splitting into groups that can train and groups that can only fine tune

Is Remitly (RELY) Embedding Transfers in ChatGPT a Turning Point for AI-Driven Customer Acquisition?

C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]

[2511.10262] MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

No comments

Stay updated with AI News