[2602.23636] FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation
About this article
Abstract page for arXiv paper 2602.23636: FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation
Computer Science > Machine Learning arXiv:2602.23636 (cs) [Submitted on 27 Feb 2026] Title:FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation Authors:Zhihao Ding, Jinming Li, Ze Lu, Jieming Shi View a PDF of the paper titled FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation, by Zhihao Ding and 3 other authors View PDF HTML (experimental) Abstract:Ensuring the safety of LLM-generated content is essential for real-world deployment. Most existing guardrail models formulate moderation as a fixed binary classification task, implicitly assuming a fixed definition of harmfulness. In practice, enforcement strictness - how conservatively harmfulness is defined and enforced - varies across platforms and evolves over time, making binary moderators brittle under shifting requirements. We first introduce FlexBench, a strictness-adaptive LLM moderation benchmark that enables controlled evaluation under multiple strictness regimes. Experiments on FlexBench reveal substantial cross-strictness inconsistency in existing moderators: models that perform well under one regime can degrade substantially under others, limiting their practical usability. To address this, we propose FlexGuard, an LLM-based moderator that outputs a calibrated continuous risk score reflecting risk severity and supports strictness-specific decisions via thresholding. We train FlexGuard via risk-alignment optimization to improve score-severity consistency a...