[2602.22557] CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety
Summary
CourtGuard introduces a model-agnostic framework for zero-shot policy adaptation in LLM safety, enhancing adaptability and performance without retraining.
Why It Matters
As AI governance becomes increasingly critical, CourtGuard addresses the limitations of current safety mechanisms in LLMs by providing a flexible, interpretable solution that can adapt to new policies without extensive retraining, thus ensuring compliance with evolving regulatory standards.
Key Takeaways
- CourtGuard enables zero-shot adaptability for LLM safety policies.
- The framework outperforms traditional policy-following models without fine-tuning.
- It facilitates automated data curation and auditing for adversarial attacks.
- Decoupling safety logic from model weights enhances interpretability.
- CourtGuard meets current and future AI regulatory requirements effectively.
Computer Science > Artificial Intelligence arXiv:2602.22557 (cs) [Submitted on 26 Feb 2026] Title:CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety Authors:Umid Suleymanov, Rufiz Bayramov, Suad Gafarli, Seljan Musayeva, Taghi Mammadov, Aynur Akhundlu, Murat Kantarcioglu View a PDF of the paper titled CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety, by Umid Suleymanov and 6 other authors View PDF HTML (experimental) Abstract:Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90\% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. Our result...