[2602.05523] Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

[2602.05523] Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2602.05523: Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

Computer Science > Software Engineering arXiv:2602.05523 (cs) [Submitted on 5 Feb 2026 (v1), last revised 17 Apr 2026 (this version, v2)] Title:Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations Authors:Shahin Honarvar, Amber Gorzynski, James Lee-Jones, Harry Coppock, Marek Rei, Joseph Ryan, Alastair F. Donaldson View a PDF of the paper titled Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations, by Shahin Honarvar and 6 other authors View PDF HTML (experimental) Abstract:Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks, yet existing pointwise benchmarks offer limited insight into agent robustness and generalisation across alternative versions of the source code. We introduce CTF challenge families, whereby a single CTF is used to generate a family of semantically-equivalent challenges via semantics-preserving program transformations, enabling controlled evaluation of robustness while keeping the underlying exploit strategy fixed. We present Evolve-CTF, a tool that generates CTF families from Python challenges using a range of transformations. Using Evolve-CTF to derive families from Cybench and Intercode challenges, we evaluate 13 agentic LLM configurations with tool access. We find that models are remarkably robust to renaming and code insertion, but that composed transformations and deeper obfuscati...

Originally published on April 20, 2026. Curated by AI News.

Related Articles

Llms

AI research is splitting into groups that can train and groups that can only fine tune

I strongly believe that compute access is doing more to shape AI progress right now than any algorithmic insight - not because ideas don'...

Reddit - Artificial Intelligence · 1 min ·
Is Remitly (RELY) Embedding Transfers in ChatGPT a Turning Point for AI-Driven Customer Acquisition?
Llms

Is Remitly (RELY) Embedding Transfers in ChatGPT a Turning Point for AI-Driven Customer Acquisition?

Earlier this month, Remitly Global launched an app within ChatGPT, becoming the first cross-border money transfer provider on the platfor...

AI Tools & Products · 4 min ·
Llms

C++ CuTe / CUTLASS vs CuTeDSL (Python) in 2026 — what should new GPU kernel / LLM inference engineers actually learn?[D]

For people just starting out in GPU kernel engineering or LLM inference (FlashAttention / FlashInfer / SGLang / vLLM style work), most jo...

Reddit - Machine Learning · 1 min ·
[2511.10262] MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models
Llms

[2511.10262] MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duplex Speech Language Models

Abstract page for arXiv paper 2511.10262: MTR-DuplexBench: Towards a Comprehensive Evaluation of Multi-Round Conversations for Full-Duple...

arXiv - AI · 4 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime