[2603.28013] Kill-Chain Canaries: Stage-Level Tracking of Prompt

[2603.28013] Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

arXiv - AI March 31, 2026 4 min read

About this article

Abstract page for arXiv paper 2603.28013: Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

Computer Science > Cryptography and Security arXiv:2603.28013 (cs) [Submitted on 30 Mar 2026] Title:Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers Authors:Haochuan Kevin Wang View a PDF of the paper titled Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers, by Haochuan Kevin Wang View PDF HTML (experimental) Abstract:We present a stage-decomposed analysis of prompt injection attacks against five frontier LLM agents. Prior work measures task-level attack success rate (ASR); we localize the pipeline stage at which each model's defense activates. We instrument every run with a cryptographic canary token (SECRET-[A-F0-9]{8}) tracked through four kill-chain stages -- Exposed, Persisted, Relayed, Executed -- across four attack surfaces and five defense conditions (764 total runs, 428 no-defense attacked). Our central finding is that model safety is determined not by whether adversarial content is seen, but by whether it is propagated across pipeline stages. Concretely: (1) in our evaluation, exposure is 100% for all five models -- the safety gap is entirely downstream; (2) Claude strips injections at write_memory summarization (0/164 ASR), while GPT-4o-mini propagates canaries without loss (53% ASR, 95% CI: 41--65%); (3) DeepSeek exhibits 0% ASR on memory surfaces and 100% ASR on tool-stream surfaces from the same model -- a complete reversal across injection cha...

Originally published on March 31, 2026. Curated by AI News.

Llms

[2604.01473] SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Abstract page for arXiv paper 2604.01473: SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

arXiv - AI · 4 min · about 1 hour ago

Llms

[2603.23682] Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

Abstract page for arXiv paper 2603.23682: Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for ...

arXiv - AI · 4 min · about 1 hour ago

Llms

[2601.07422] Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

Abstract page for arXiv paper 2601.07422: Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

arXiv - AI · 4 min · about 1 hour ago

Llms

[2603.08486] Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Abstract page for arXiv paper 2603.08486: Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

arXiv - AI · 3 min · about 1 hour ago

[2603.28013] Kill-Chain Canaries: Stage-Level Tracking of Prompt Injection Across Attack Surfaces and Model Safety Tiers

About this article

Related Articles

[2604.01473] SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

[2603.23682] Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

[2601.07422] Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

[2603.08486] Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

No comments

Stay updated with AI News