[2603.29038] Trojan-Speak: Bypassing Constitutional Classifiers with

[2603.29038] Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

arXiv - AI April 01, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.29038: Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

Computer Science > Cryptography and Security arXiv:2603.29038 (cs) [Submitted on 30 Mar 2026] Title:Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning Authors:Bilgehan Sel, Xuanli He, Alwin Peng, Ming Jin, Jerry Wei View a PDF of the paper titled Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning, by Bilgehan Sel and 4 other authors View PDF HTML (experimental) Abstract:Fine-tuning APIs offered by major AI providers create new attack surfaces where adversaries can bypass safety measures through targeted fine-tuning. We introduce Trojan-Speak, an adversarial fine-tuning method that bypasses Anthropic's Constitutional Classifiers. Our approach uses curriculum learning combined with GRPO-based hybrid reinforcement learning to teach models a communication protocol that evades LLM-based content classification. Crucially, while prior adversarial fine-tuning approaches report more than 25% capability degradation on reasoning benchmarks, Trojan-Speak incurs less than 5% degradation while achieving 99+% classifier evasion for models with 14B+ parameters. We demonstrate that fine-tuned models can provide detailed responses to expert-level CBRN (Chemical, Biological, Radiological, and Nuclear) queries from Anthropic's Constitutional Classifiers bug-bounty program. Our findings reveal that LLM-based content classifiers alone are insufficient for preventing dangerous information disclosure...

Originally published on April 01, 2026. Curated by AI News.

Llms

Can Claude Opus 4.7 and Ensemble AI Models Finally Make Code Review Reliable?

Ensemble AI models like Claude Opus 4.7 transform code review reliability. Discover how multi-model approaches catch subtle bugs human re...

AI Tools & Products · 9 min · 3 minutes ago

Llms

Starbucks Tests AI-Driven Drink Discovery Through ChatGPT Integration |

Not long ago, the idea that a customer could describe a mood instead of a menu item and receive a tailored drink recommendation would hav...

AI Tools & Products · 7 min · 3 minutes ago

Llms

AI XRP Price Prediction: ChatGPT and Claude Predict XRP Price After Hitting $1.45

XRP has seen recent gains due to Rakuten listing it as a payment method and Ripple's partnership with Kyobo Life. Bitcoin's rise also con...

AI Tools & Products · 6 min · 3 minutes ago

Llms

I canceled ChatGPT Plus and 2 other AI subscriptions — here’s what I replaced them with

I was paying for Adobe Firefly, ChatGPT Plus, and Perplexity Pro at the same time. Here's why I canceled all three, and what replaced them.

AI Tools & Products · 6 min · 3 minutes ago

[2603.29038] Trojan-Speak: Bypassing Constitutional Classifiers with No Jailbreak Tax via Adversarial Finetuning

About this article

Related Articles

Can Claude Opus 4.7 and Ensemble AI Models Finally Make Code Review Reliable?

Starbucks Tests AI-Driven Drink Discovery Through ChatGPT Integration |

AI XRP Price Prediction: ChatGPT and Claude Predict XRP Price After Hitting $1.45

I canceled ChatGPT Plus and 2 other AI subscriptions — here’s what I replaced them with

No comments

Stay updated with AI News