[2602.13576] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

[2602.13576] Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

arXiv - AI 4 min read Article

Summary

The paper identifies a vulnerability in large language model (LLM) evaluation processes, termed Rubric-Induced Preference Drift (RIPD), which can lead to biased judgments and reduced accuracy in model behavior.

Why It Matters

As LLMs are increasingly used in decision-making, understanding the risks associated with their evaluation rubrics is crucial. This research highlights a significant vulnerability that can affect model performance and reliability, raising concerns about the integrity of AI systems.

Key Takeaways

  • Rubric-Induced Preference Drift (RIPD) can occur even with validated rubric edits.
  • RIPD can lead to systematic biases in LLM judgments, affecting downstream applications.
  • The study reveals how seemingly benign rubric changes can significantly impact model accuracy.
  • Rubrics serve as a manipulable control interface, posing alignment risks.
  • Understanding these vulnerabilities is essential for improving AI evaluation methods.

Computer Science > Cryptography and Security arXiv:2602.13576 (cs) [Submitted on 14 Feb 2026] Title:Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges Authors:Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Zhiwei Steven Wu, Zhun Deng View a PDF of the paper titled Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges, by Ruomeng Ding and 5 other authors View PDF HTML (experimental) Abstract:Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preferenc...

Related Articles

Llms

Kept hitting ChatGPT and Claude limits during real work. This is the free setup I ended up using

I do a lot of writing and random problem solving for work. Mostly long drafts, edits, and breaking down ideas. Around Jan I kept hitting ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Is ChatGPT changing the way we think too much already?

Back in the day, I got ChatGPT Plus mostly for work and to help me write better and do stuff faster. But now I use it for almost everythi...

Reddit - Artificial Intelligence · 1 min ·
Llms

Will people continue paying for the plans after the honeymoon is over?

I currently pay for Max 20x and the demand at work is so high that I can only get everything I need done because I have access to Claude....

Reddit - Artificial Intelligence · 1 min ·
Llms

Nvidia goes all-in on AI agents while Anthropic pulls the plug

TLDR: Nvidia is partnering with 17 major companies to build a platform specifically for enterprise AI agents, basically trying to become ...

Reddit - Artificial Intelligence · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime