[2602.15853] A Lightweight Explainable Guardrail for Prompt Safety

[2602.15853] A Lightweight Explainable Guardrail for Prompt Safety

arXiv - AI 3 min read Article

Summary

The paper presents a Lightweight Explainable Guardrail (LEG) method for classifying unsafe prompts in AI systems, utilizing a multi-task learning architecture to enhance both classification and explainability.

Why It Matters

As AI systems increasingly rely on prompts, ensuring their safety is critical. This research introduces an innovative approach to prompt safety that balances performance and explainability, addressing significant concerns in AI safety and usability.

Key Takeaways

  • LEG method improves classification of unsafe prompts using a multi-task learning approach.
  • The model is smaller yet performs comparably to state-of-the-art methods.
  • Synthetic data generation counteracts biases in large language models (LLMs).
  • A novel loss function enhances explainability and classification accuracy.
  • All models and datasets will be publicly released if accepted.

Computer Science > Computation and Language arXiv:2602.15853 (cs) [Submitted on 24 Jan 2026] Title:A Lightweight Explainable Guardrail for Prompt Safety Authors:Md Asiful Islam, Mihai Surdeanu View a PDF of the paper titled A Lightweight Explainable Guardrail for Prompt Safety, by Md Asiful Islam and 1 other authors View PDF Abstract:We propose a lightweight explainable guardrail (LEG) method for the classification of unsafe prompts. LEG uses a multi-task learning architecture to jointly learn a prompt classifier and an explanation classifier, where the latter labels prompt words that explain the safe/unsafe overall decision. LEG is trained using synthetic data for explainability, which is generated using a novel strategy that counteracts the confirmation biases of LLMs. Lastly, LEG's training process uses a novel loss that captures global explanation signals and combines cross-entropy and focal losses with uncertainty-based weighting. LEG obtains equivalent or better performance than the state-of-the-art for both prompt classification and explainability, both in-domain and out-of-domain on three datasets, despite the fact that its model size is considerably smaller than current approaches. If accepted, we will release all models and the annotated dataset publicly. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2602.15853 [cs.CL]   (or arXiv:2602.15853v1 [cs.CL] for this version)   https://doi.org/10.48550/arXiv.2602.15853 Focus ...

Related Articles

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto
Llms

Anthropic Restricts Claude Agent Access Amid AI Automation Boom in Crypto

AI Tools & Products · 7 min ·
Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains
Llms

Is cutting ‘please’ when talking to ChatGPT better for the planet? An expert explains

AI Tools & Products · 5 min ·
AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface
Llms

AI Desktop 98 lets you chat with Claude, ChatGPT, and Gemini through a Windows 98-inspired interface

AI Tools & Products · 3 min ·
Llms

Claude, OpenClaw and the new reality: AI agents are here — and so is the chaos

AI Tools & Products ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime