[2604.09544] Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
About this article
Abstract page for arXiv paper 2604.09544: Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Computer Science > Computation and Language arXiv:2604.09544 (cs) [Submitted on 10 Apr 2026] Title:Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism Authors:Hadas Orgad, Boyi Wei, Kaden Zheng, Martin Wattenberg, Peter Henderson, Seraphina Goldfarb-Tarrant, Yonatan Belinkov View a PDF of the paper titled Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism, by Hadas Orgad and 6 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning t...