[2602.13234] Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

[2602.13234] Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

arXiv - AI 4 min read Article

Summary

The paper presents a novel framework, Dual-Cycle Adversarial Self-Evolution, aimed at enhancing the safety and fidelity of role-playing agents in AI, addressing vulnerabilities to jailbreak attacks while maintaining character integrity.

Why It Matters

As AI role-playing agents become more prevalent, ensuring their safety and adherence to persona constraints is crucial. This research provides a new approach to mitigate risks associated with jailbreak attacks, which can compromise the integrity of AI systems. The findings could significantly impact the development of safer AI applications in various domains.

Key Takeaways

  • Introduces a training-free framework for enhancing AI role-playing safety.
  • Utilizes a dual-cycle approach to balance persona fidelity and safety constraints.
  • Demonstrates improved performance against jailbreak attacks across multiple LLMs.

Computer Science > Artificial Intelligence arXiv:2602.13234 (cs) [Submitted on 29 Jan 2026] Title:Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents Authors:Mingyang Liao, Yichen Wan, shuchen wu, Chenxi Miao, Xin Shen, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang View a PDF of the paper titled Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents, by Mingyang Liao and 8 other authors View PDF HTML (experimental) Abstract:LLM-based role-playing has rapidly improved in fidelity, yet stronger adherence to persona constraints commonly increases vulnerability to jailbreak attacks, especially for risky or negative personas. Most prior work mitigates this issue with training-time solutions (e.g., data curation or alignment-oriented regularization). However, these approaches are costly to maintain as personas and attack strategies evolve, can degrade in-character behavior, and are typically infeasible for frontier closed-weight LLMs. We propose a training-free Dual-Cycle Adversarial Self-Evolution framework with two coupled cycles. A Persona-Targeted Attacker Cycle synthesizes progressively stronger jailbreak prompts, while a Role-Playing Defender Cycle distills observed failures into a hierarchical knowledge base of (i) global safety rules, (ii) persona-grounded constraints, and (iii) safe in-character exemplars. At inference time, the Defender retrieves and composes structured knowledge ...

Related Articles

Llms

Claude on Claude

The Story of Anthropic’s Latest Controversies Regarding the Business of Its Prized Creation… As Told by the Thing Itself. Editor’s note: ...

Reddit - Artificial Intelligence · 1 min ·
Llms

Cut Claude usage by ~85% in a job search pipeline (16k → 900 tokens/app) — here’s what worked

Like many here, I kept running into Claude usage limits when building anything non-trivial. I was working with a job search automation pi...

Reddit - Artificial Intelligence · 1 min ·
Llms

"Authoritarian Parents In Rationalist Clothes": a piece I wrote in December about alignment

Posted today in light of the Claude Mythos model card release. Originally I wrote this for r/ControlProblem but realized it was getting o...

Reddit - Artificial Intelligence · 1 min ·
Llms

AI joins the 8-hour work day as GLM ships 5.1 open source LLM, beating Opus 4.6 and GPT-5.4 on SWE-Bench Pro

AI Tools & Products ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime