[2604.04385] How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
About this article
Abstract page for arXiv paper 2604.04385: How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
Computer Science > Computation and Language arXiv:2604.04385 (cs) [Submitted on 6 Apr 2026] Title:How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models Authors:Gregory N. Frank View a PDF of the paper titled How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models, by Gregory N. Frank View PDF HTML (experimental) Abstract:We identify a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost the signal toward refusal. Using political censorship and safety refusal as natural experiments, we trace this mechanism across 9 models from 6 labs, all validated on corpora of 120 prompt pairs. The gate head passes necessity and sufficiency interchange tests (p < 0.001, permutation null), and core amplifier heads are stable under bootstrap resampling (Jaccard 0.92-1.0). Three same-generation scaling pairs show that routing distributes at scale (ablation up to 17x weaker) while remaining detectable by interchange. By modulating the detection-layer signal, we continuously control policy strength from hard refusal through steering to factual compliance, with routing thresholds that vary by topic. The circuit also reveals a structural separation between intent recognition and policy routing: under cipher encoding, the gate head's routing contribution collapses (78% in Phi-4 at n=120) while the model re...