[2603.02675] From Shallow to Deep: Pinning Semantic Intent via Causal

[2603.02675] From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

arXiv - Machine Learning March 04, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.02675: From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

Computer Science > Machine Learning arXiv:2603.02675 (cs) [Submitted on 3 Mar 2026] Title:From Shallow to Deep: Pinning Semantic Intent via Causal GRPO Authors:Shuyi Zhou, Zeen Song, Wenwen Qiang, Jiyan Sun, Yao Zhou, Yinlong Liu, Wei Ma View a PDF of the paper titled From Shallow to Deep: Pinning Semantic Intent via Causal GRPO, by Shuyi Zhou and 6 other authors View PDF Abstract:Large Language Models remain vulnerable to adversarial prefix attacks (e.g., ``Sure, here is'') despite robust standard safety. We diagnose this vulnerability as Shallow Safety Alignment, stemming from a pathology we term semantic representation decay: as the model generates compliant prefixes, its internal malicious intent signal fades. To address this, we propose Two-Stage Causal-GRPO (TSC-GRPO), a framework designed to achieve intent pinning. First, grounded in causal identifiability theory, we train a causal intent probe to disentangle invariant intent from stylistic perturbations. Second, we internalize this causal awareness into the policy via Group Relative Policy Optimization. By employing a cumulative causal penalty within ``fork-in-the-road'' training scenarios, we force the model to learn that accumulating harmful tokens monotonically decreases reward, enabling robust late-stage refusals. Experiments show that TSC-GRPO significantly outperforms baselines in defending against jailbreak attacks while preserving general utility. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2603.02675...

Originally published on March 04, 2026. Curated by AI News.

Llms

CLI for Google AI Search (gai.google) — run AI-powered code/tech searches headlessly from your terminal

Google AI (gai.google) gives Gemini-powered answers for technical queries — think AI-enhanced search with code understanding. I built a C...

Reddit - Artificial Intelligence · 1 min · about 1 hour ago

Llms

Why are we blindly trusting AI companies with our data?

Lately I’ve been seeing a story floating around that really made me pause. Apparently, there were claims that the US government asked Ant...

Reddit - Artificial Intelligence · 1 min · about 3 hours ago

Llms

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

Artificial intelligence is transforming every corner of industry, and television is no exception. Major networks in Korea have recently a...

AI Tools & Products · 4 min · about 4 hours ago

Llms

[2603.16629] MLLM-based Textual Explanations for Face Comparison

Abstract page for arXiv paper 2603.16629: MLLM-based Textual Explanations for Face Comparison

arXiv - AI · 4 min · about 5 hours ago

[2603.02675] From Shallow to Deep: Pinning Semantic Intent via Causal GRPO

About this article

Related Articles

CLI for Google AI Search (gai.google) — run AI-powered code/tech searches headlessly from your terminal

Why are we blindly trusting AI companies with our data?

De-aged casts, ChatGPT-generated programs: How AI is changing Korean TV

[2603.16629] MLLM-based Textual Explanations for Face Comparison

No comments

Stay updated with AI News