[2602.14161] When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift
Summary
This paper evaluates the effectiveness of malicious prompt classifiers under true distribution shifts, revealing significant performance overestimations in current benchmarks.
Why It Matters
As LLMs increasingly interact with untrusted data, understanding the limitations of current prompt classifiers is crucial for enhancing AI safety and reliability. This study proposes a new evaluation framework that could lead to better detection of prompt injection attacks, which is vital for secure AI deployment.
Key Takeaways
- Current evaluation methods overestimate classifier performance by up to 8.4 percentage points.
- The proposed Leave-One-Dataset-Out (LODO) evaluation reveals heterogeneous failure modes in classifiers.
- 28% of top features in classifiers are dataset-dependent shortcuts, undermining generalization.
- Existing guardrails for prompt attack detection show low effectiveness, with detection rates between 7-37%.
- LODO-stable features offer more reliable explanations for classifier decisions by filtering out dataset artifacts.
Computer Science > Machine Learning arXiv:2602.14161 (cs) [Submitted on 15 Feb 2026] Title:When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift Authors:Max Fomin View a PDF of the paper titled When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift, by Max Fomin View PDF HTML (experimental) Abstract:Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a comprehensive analysis using a diverse benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks. We propose Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization, revealing that the standard practice of train-test splits from the same dataset sources severely overestimates performance: aggregate metrics show an 8.4 percentage point AUC inflation, but per-dataset gaps range from 1% to 25% accuracy-exposing heterogeneous failure modes. To understand why classifiers fail to generalize, we analyze Sparse Auto-Encoder (SAE) feature coefficients across LODO folds, finding that 28% of top features are dataset-dependent shortcuts whose class signal depends on specific dataset compos...