Llms Ai Infrastructure Data Science Ai Startups Ai Safety Machine Learning Nlp

[2602.14161] When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

arXiv - Machine Learning February 17, 2026 4 min read Article

Summary

This paper evaluates the effectiveness of malicious prompt classifiers under true distribution shifts, revealing significant performance overestimations in current benchmarks.

Why It Matters

As LLMs increasingly interact with untrusted data, understanding the limitations of current prompt classifiers is crucial for enhancing AI safety and reliability. This study proposes a new evaluation framework that could lead to better detection of prompt injection attacks, which is vital for secure AI deployment.

Key Takeaways

Current evaluation methods overestimate classifier performance by up to 8.4 percentage points.
The proposed Leave-One-Dataset-Out (LODO) evaluation reveals heterogeneous failure modes in classifiers.
28% of top features in classifiers are dataset-dependent shortcuts, undermining generalization.
Existing guardrails for prompt attack detection show low effectiveness, with detection rates between 7-37%.
LODO-stable features offer more reliable explanations for classifier decisions by filtering out dataset artifacts.

Computer Science > Machine Learning arXiv:2602.14161 (cs) [Submitted on 15 Feb 2026] Title:When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift Authors:Max Fomin View a PDF of the paper titled When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift, by Max Fomin View PDF HTML (experimental) Abstract:Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a comprehensive analysis using a diverse benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks. We propose Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization, revealing that the standard practice of train-test splits from the same dataset sources severely overestimates performance: aggregate metrics show an 8.4 percentage point AUC inflation, but per-dataset gaps range from 1% to 25% accuracy-exposing heterogeneous failure modes. To understand why classifiers fail to generalize, we analyze Sparse Auto-Encoder (SAE) feature coefficients across LODO folds, finding that 28% of top features are dataset-dependent shortcuts whose class signal depends on specific dataset compos...

Read Original Article

[2602.14161] When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Summary

Why It Matters

Key Takeaways

Related Articles

Kept hitting ChatGPT and Claude limits during real work. This is the free setup I ended up using

Is ChatGPT changing the way we think too much already?

Will people continue paying for the plans after the honeymoon is over?

Nvidia goes all-in on AI agents while Anthropic pulls the plug

No comments

Stay updated with AI News