[2512.15052] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification
Summary
The paper presents SGM, a novel approach for detoxifying multimodal large language models (MLLMs) by recalibrating toxic neurons, significantly reducing harmful outputs while maintaining fluency.
Why It Matters
As MLLMs become more prevalent, ensuring their safety is crucial. This research addresses the pressing issue of toxicity in AI outputs, providing a method that enhances safety without compromising performance, which is vital for responsible AI deployment.
Key Takeaways
- SGM employs neuron-level detoxification to mitigate toxicity in MLLMs.
- The method reduces harmful output rates from 48.2% to 2.5% while preserving fluency.
- SGM can be combined with existing detoxification techniques for enhanced safety.
- Introduces the MM-TOXIC-QA framework for evaluating multimodal toxicity.
- Provides a low-cost, interpretable solution for safe multimodal generation.
Computer Science > Computation and Language arXiv:2512.15052 (cs) [Submitted on 17 Dec 2025 (v1), last revised 13 Feb 2026 (this version, v3)] Title:SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification Authors:Hongbo Wang, MaungMaung AprilPyone, Isao Echizen View a PDF of the paper titled SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification, by Hongbo Wang and MaungMaung AprilPyone and Isao Echizen View PDF Abstract:Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2\% to 2.5\% while preserving fluency and multimodal reasoning. SGM is extensible, and...