How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models
Abstract
The study reveals that policy routing in alignment-trained language models involves attention gates and amplifier heads that control safety responses, with the routing mechanism being early-committing and transferable across model scales.
This paper localizes the policy routing mechanism in alignment-trained language models. An intermediate-layer attention gate reads detected content and triggers deeper amplifier heads that boost the signal toward refusal. In smaller models the gate and amplifier are single heads; at larger scale they become bands of heads across adjacent layers. The gate contributes under 1% of output DLA, but interchange testing (p<0.001) and knockout cascade confirm it is causally necessary. Interchange screening at n>=120 detects the same motif in twelve models from six labs (2B to 72B), though specific heads differ by lab. Per-head ablation weakens up to 58x at 72B and misses gates that interchange identifies; interchange is the only reliable audit at scale. Modulating the detection-layer signal continuously controls policy from hard refusal through evasion to factual answering. On safety prompts the same intervention turns refusal into harmful guidance, showing the safety-trained capability is gated by routing rather than removed. Thresholds vary by topic and by input language, and the circuit relocates across generations within a family while behavioral benchmarks register no change. Routing is early-commitment: the gate commits at its own layer before deeper layers finish processing the input. Under an in-context substitution cipher, gate interchange necessity collapses 70 to 99% across three models and the model switches to puzzle-solving. Injecting the plaintext gate activation into the cipher forward pass restores 48% of refusals in Phi-4-mini, localizing the bypass to the routing interface. A second method, cipher contrast analysis, uses plain/cipher DLA differences to map the full cipher-sensitive routing circuit in O(3n) forward passes. Any encoding that defeats detection-layer pattern matching bypasses the policy regardless of whether deeper layers reconstruct the content.
Community
This paper identifies the attention-level circuit responsible for refusal in aligned-trained language models.
A 'gate' attention head at intermediate depth reads detected content and triggers downstream 'amplifier' heads that boost the output toward refusal. The gate contributes under 1% of output signal, but is causally necessary (p < 0.001); knocking it out suppresses amplifiers by 5–26%.
The same motif appears across 12 models from 6 labs (2B–72B), though specific heads differ. Per-head ablation weakens up to 58x at scale, while interchange testing remains informative, making it the only reliable circuit-level audit method at 70B+.
Under a substitution cipher, gate necessity collapses 70–99% and the model switches to puzzle-solving initially. The routing decision commits before deeper layers finish processing. Cipher contrast analysis, a new O(3n) method comparing per-head DLA under plaintext and cipher, identifies circuit members that interchange misses.
Code, reproducibility guide, and full results for all 12 models are available.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails (2026)
- Directional Routing in Transformers (2026)
- Over-Refusal and Representation Subspaces: A Mechanistic Analysis of Task-Conditioned Refusal in Aligned LLMs (2026)
- What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal (2026)
- Darkness Visible: Reading the Exception Handler of a Language Model (2026)
- Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism (2026)
- How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.04385 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper