[2604.04385] How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

[2604.04385] How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

arXiv - AI 4 min read

About this article

Abstract page for arXiv paper 2604.04385: How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Computer Science > Computation and Language arXiv:2604.04385 (cs) [Submitted on 6 Apr 2026] Title:How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models Authors:Gregory N. Frank View a PDF of the paper titled How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models, by Gregory N. Frank View PDF HTML (experimental) Abstract:We identify a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost the signal toward refusal. Using political censorship and safety refusal as natural experiments, we trace this mechanism across 9 models from 6 labs, all validated on corpora of 120 prompt pairs. The gate head passes necessity and sufficiency interchange tests (p < 0.001, permutation null), and core amplifier heads are stable under bootstrap resampling (Jaccard 0.92-1.0). Three same-generation scaling pairs show that routing distributes at scale (ablation up to 17x weaker) while remaining detectable by interchange. By modulating the detection-layer signal, we continuously control policy strength from hard refusal through steering to factual compliance, with routing thresholds that vary by topic. The circuit also reveals a structural separation between intent recognition and policy routing: under cipher encoding, the gate head's routing contribution collapses (78% in Phi-4 at n=120) while the model re...

Originally published on April 07, 2026. Curated by AI News.

Related Articles

Llms

Founding Engineer (Full-Stack / AI) – Build the Future of Personalized Healthcare (San Francisco, In-Person)

Hi everyone Galen AI is an early-stage, YC-backed healthtech startup building a personal AI doctor by combining clinical data, wearable d...

Reddit - ML Jobs · 1 min ·
Llms

Looking for Job Opportunities — Senior MLOps / LLMOps Engineer (Remote / Visa Sponsorship)

Hi Everyone šŸ‘‹ I’m a Senior MLOps / LLMOps Engineer with ~5 years of experience building and operating production-scale ML &amp; LLM platf...

Reddit - ML Jobs · 1 min ·
Llms

Early career / PhD (USA only) - $80-120/hr

Mercor is hiring Machine Learning Engineers to: Draft detailed natural-language plans and code implementations for machine learning tasks...

Reddit - ML Jobs · 1 min ·
Llms

Hello MLjobs, I'm looking for research internships.

About me: I'm into Deep Learning Research particularly in multimodal AI/LLMs based in Mumbai, India. I have read papers and I re-implemen...

Reddit - ML Jobs · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime