[2602.20170] CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation
Summary
The paper introduces CAGE, a framework for culturally adaptive red-teaming benchmark generation, addressing the limitations of existing benchmarks in evaluating LLM safety across diverse cultural contexts.
Why It Matters
CAGE is significant as it fills a critical gap in AI safety evaluation by adapting red-teaming methodologies to account for cultural nuances. This ensures that AI systems are tested against realistic threats that reflect local socio-technical vulnerabilities, enhancing their robustness and safety in diverse environments.
Key Takeaways
- CAGE adapts red-teaming prompts to local cultural contexts, improving AI safety evaluations.
- The Semantic Mold approach disentangles adversarial intent from cultural content for better threat modeling.
- KoRSET, a Korean benchmark created using CAGE, outperforms direct translation methods in revealing vulnerabilities.
- CAGE offers a scalable solution for developing context-aware safety benchmarks across various cultures.
- The dataset and evaluation rubrics from CAGE are publicly available, promoting further research.
Computer Science > Computers and Society arXiv:2602.20170 (cs) [Submitted on 9 Feb 2026] Title:CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation Authors:Chaeyun Kim, YongTaek Lim, Kihyun Kim, Junghwan Kim, Minwoo Kim View a PDF of the paper titled CAGE: A Framework for Culturally Adaptive Red-Teaming Benchmark Generation, by Chaeyun Kim and 4 other authors View PDF HTML (experimental) Abstract:Existing red-teaming benchmarks, when adapted to new languages via direct translation, fail to capture socio-technical vulnerabilities rooted in local culture and law, creating a critical blind spot in LLM safety evaluation. To address this gap, we introduce CAGE (Culturally Adaptive Generation), a framework that systematically adapts the adversarial intent of proven red-teaming prompts to new cultural contexts. At the core of CAGE is the Semantic Mold, a novel approach that disentangles a prompt's adversarial structure from its cultural content. This approach enables the modeling of realistic, localized threats rather than testing for simple jailbreaks. As a representative example, we demonstrate our framework by creating KoRSET, a Korean benchmark, which proves more effective at revealing vulnerabilities than direct translation baselines. CAGE offers a scalable solution for developing meaningful, context-aware safety benchmarks across diverse cultures. Our dataset and evaluation rubrics are publicly available at this https URL. (WARNING: This paper contains...