[2603.03323] Discern Truth from Falsehood: Reducing Over-Refusal via

[2603.03323] Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

arXiv - AI March 05, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.03323: Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

Computer Science > Computation and Language arXiv:2603.03323 (cs) [Submitted on 10 Feb 2026] Title:Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement Authors:Yuxiao Lu, Lin Xu, Yang Sun, Wenjun Li, Jie Shi View a PDF of the paper titled Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement, by Yuxiao Lu and 4 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model's ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM's capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it a...

Originally published on March 05, 2026. Curated by AI News.

Llms

What's your "When Language Model AI can do X, I'll be impressed"?

I have two at the top of my mind: When it can read musical notes. I will be mildly impressed when I can paste in a picture of musical not...

Reddit - Artificial Intelligence · 1 min · about 2 hours ago

Llms

Google’s Gemini AI can answer your questions with 3D models and simulations

Google's latest upgrade for Gemini will allow the chatbot to generate interactive 3D models and simulations in response to your questions...

The Verge - AI · 4 min · about 8 hours ago

Llms

Moody’s Integrates AI Agents With Anthropic’s Claude

AI Tools & Products · 4 min · about 8 hours ago

Llms

AI on the couch: Anthropic gives Claude 20 hours of psychiatry

AI Tools & Products · 6 min · about 8 hours ago

[2603.03323] Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

About this article

Related Articles

What's your "When Language Model AI can do X, I'll be impressed"?

Google’s Gemini AI can answer your questions with 3D models and simulations

Moody’s Integrates AI Agents With Anthropic’s Claude

AI on the couch: Anthropic gives Claude 20 hours of psychiatry

No comments

Stay updated with AI News