[2601.03273] A Multi-Perspective Benchmark and Moderation Model for Evaluating Safety and Adversarial Robustness
About this article
Abstract page for arXiv paper 2601.03273: A Multi-Perspective Benchmark and Moderation Model for Evaluating Safety and Adversarial Robustness
Computer Science > Computation and Language arXiv:2601.03273 (cs) [Submitted on 22 Dec 2025 (v1), last revised 19 Mar 2026 (this version, v2)] Title:A Multi-Perspective Benchmark and Moderation Model for Evaluating Safety and Adversarial Robustness Authors:Naseem Machlovi, Maryam Saleki, Ruhul Amin, Mohamed Rahouti, Shawqi Al-Maliki, Junaid Qadir, Mohamed M. Abdallah, Ala Al-Fuqaha View a PDF of the paper titled A Multi-Perspective Benchmark and Moderation Model for Evaluating Safety and Adversarial Robustness, by Naseem Machlovi and 7 other authors View PDF HTML (experimental) Abstract:As large language models (LLMs) become deeply embedded in daily life, the urgent need for safer moderation systems that distinguish between naive and harmful requests while upholding appropriate censorship boundaries has never been greater. While existing LLMs can detect dangerous or unsafe content, they often struggle with nuanced cases such as implicit offensiveness, subtle gender and racial biases, and jailbreak prompts, due to the subjective and context-dependent nature of these issues. Furthermore, their heavy reliance on training data can reinforce societal biases, resulting in inconsistent and ethically problematic outputs. To address these challenges, we introduce GuardEval, a unified multi-perspective benchmark dataset designed for both training and evaluation, containing 106 fine-grained categories spanning human emotions, offensive and hateful language, gender and racial bias, an...