[2602.14161] When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

[2602.14161] When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

arXiv - Machine Learning 4 min read Article

Summary

This paper evaluates the effectiveness of malicious prompt classifiers under true distribution shifts, revealing significant performance overestimations in current benchmarks.

Why It Matters

As LLMs increasingly interact with untrusted data, understanding the limitations of current prompt classifiers is crucial for enhancing AI safety and reliability. This study proposes a new evaluation framework that could lead to better detection of prompt injection attacks, which is vital for secure AI deployment.

Key Takeaways

  • Current evaluation methods overestimate classifier performance by up to 8.4 percentage points.
  • The proposed Leave-One-Dataset-Out (LODO) evaluation reveals heterogeneous failure modes in classifiers.
  • 28% of top features in classifiers are dataset-dependent shortcuts, undermining generalization.
  • Existing guardrails for prompt attack detection show low effectiveness, with detection rates between 7-37%.
  • LODO-stable features offer more reliable explanations for classifier decisions by filtering out dataset artifacts.

Computer Science > Machine Learning arXiv:2602.14161 (cs) [Submitted on 15 Feb 2026] Title:When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift Authors:Max Fomin View a PDF of the paper titled When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift, by Max Fomin View PDF HTML (experimental) Abstract:Detecting prompt injection and jailbreak attacks is critical for deploying LLM-based agents safely. As agents increasingly process untrusted data from emails, documents, tool outputs, and external APIs, robust attack detection becomes essential. Yet current evaluation practices and production systems have fundamental limitations. We present a comprehensive analysis using a diverse benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks. We propose Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization, revealing that the standard practice of train-test splits from the same dataset sources severely overestimates performance: aggregate metrics show an 8.4 percentage point AUC inflation, but per-dataset gaps range from 1% to 25% accuracy-exposing heterogeneous failure modes. To understand why classifiers fail to generalize, we analyze Sparse Auto-Encoder (SAE) feature coefficients across LODO folds, finding that 28% of top features are dataset-dependent shortcuts whose class signal depends on specific dataset compos...

Related Articles

Llms

Kept hitting ChatGPT and Claude limits during real work. This is the free setup I ended up using

I do a lot of writing and random problem solving for work. Mostly long drafts, edits, and breaking down ideas. Around Jan I kept hitting ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Is ChatGPT changing the way we think too much already?

Back in the day, I got ChatGPT Plus mostly for work and to help me write better and do stuff faster. But now I use it for almost everythi...

Reddit - Artificial Intelligence · 1 min ·
Llms

Will people continue paying for the plans after the honeymoon is over?

I currently pay for Max 20x and the demand at work is so high that I can only get everything I need done because I have access to Claude....

Reddit - Artificial Intelligence · 1 min ·
Llms

Nvidia goes all-in on AI agents while Anthropic pulls the plug

TLDR: Nvidia is partnering with 17 major companies to build a platform specifically for enterprise AI agents, basically trying to become ...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime