[2602.11247] Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection
Summary
The paper presents a novel scoring formula, Peak + Accumulation, for detecting multi-turn LLM attack patterns, addressing limitations in existing detection methods.
Why It Matters
As multi-turn prompt injection attacks become more sophisticated, effective detection methods are crucial for maintaining the integrity of AI systems. This research offers a new approach that enhances detection accuracy while minimizing false positives, which is vital for developers and security professionals working with large language models (LLMs).
Key Takeaways
- Introduces a new scoring formula for multi-turn LLM attack detection.
- Achieves 90.8% recall with a low false positive rate of 1.20%.
- Demonstrates a significant improvement in detection accuracy with a sensitivity analysis.
- Releases the scoring algorithm and evaluation tools as open source.
- Highlights the inadequacies of traditional single-turn detection methods.
Computer Science > Cryptography and Security arXiv:2602.11247 (cs) [Submitted on 11 Feb 2026] Title:Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection Authors:J Alex Corll View a PDF of the paper titled Peak + Accumulation: A Proxy-Level Scoring Formula for Multi-Turn LLM Attack Detection, by J Alex Corll View PDF HTML (experimental) Abstract:Multi-turn prompt injection attacks distribute malicious intent across multiple conversation turns, exploiting the assumption that each turn is evaluated independently. While single-turn detection has been extensively studied, no published formula exists for aggregating per-turn pattern scores into a conversation-level risk score at the proxy layer -- without invoking an LLM. We identify a fundamental flaw in the intuitive weighted-average approach: it converges to the per-turn score regardless of turn count, meaning a 20-turn persistent attack scores identically to a single suspicious turn. Drawing on analogies from change-point detection (CUSUM), Bayesian belief updating, and security risk-based alerting, we propose peak + accumulation scoring -- a formula combining peak single-turn risk, persistence ratio, and category diversity. Evaluated on 10,654 multi-turn conversations -- 588 attacks sourced from WildJailbreak adversarial prompts and 10,066 benign conversations from WildChat -- the formula achieves 90.8% recall at 1.20% false positive rate with an F1 of 85.9%. A sensitivity analysis over the...