[2512.15052] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

[2512.15052] SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification

arXiv - AI 4 min read Article

Summary

The paper presents SGM, a novel approach for detoxifying multimodal large language models (MLLMs) by recalibrating toxic neurons, significantly reducing harmful outputs while maintaining fluency.

Why It Matters

As MLLMs become more prevalent, ensuring their safety is crucial. This research addresses the pressing issue of toxicity in AI outputs, providing a method that enhances safety without compromising performance, which is vital for responsible AI deployment.

Key Takeaways

  • SGM employs neuron-level detoxification to mitigate toxicity in MLLMs.
  • The method reduces harmful output rates from 48.2% to 2.5% while preserving fluency.
  • SGM can be combined with existing detoxification techniques for enhanced safety.
  • Introduces the MM-TOXIC-QA framework for evaluating multimodal toxicity.
  • Provides a low-cost, interpretable solution for safe multimodal generation.

Computer Science > Computation and Language arXiv:2512.15052 (cs) [Submitted on 17 Dec 2025 (v1), last revised 13 Feb 2026 (this version, v3)] Title:SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification Authors:Hongbo Wang, MaungMaung AprilPyone, Isao Echizen View a PDF of the paper titled SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification, by Hongbo Wang and MaungMaung AprilPyone and Isao Echizen View PDF Abstract:Disclaimer: Samples in this paper may be harmful and cause discomfort. Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2\% to 2.5\% while preserving fluency and multimodal reasoning. SGM is extensible, and...

Related Articles

Llms

main skill in software engineering in 2026 is knowing what to ask Claude, not knowing how to code. and I can’t decide if that’s depressing or just the next abstraction layer.

Been writing code professionally for 8+ years. I’m now mass spending more time describing features in plain english than writing actual c...

Reddit - Artificial Intelligence · 1 min ·
Llms

Can we even achieve AGI with LLMs, why do AI bros still believe we can?

I've heard mixed discussions around this. Although not much evidence just rhetoric from the AGI will come from LLMs camp. submitted by /u...

Reddit - Artificial Intelligence · 1 min ·
Llms

You can now prompt OpenClaw into existence. fully 1st party on top of Claude Code

OpenClaw is basically banned from Claude ¯_(ツ)_/¯ Claude Code has Telegram support.. so what if we just, made it always stay on? turns ou...

Reddit - Artificial Intelligence · 1 min ·
Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything
Llms

Anthropic Teams Up With Its Rivals to Keep AI From Hacking Everything

The AI lab's Project Glasswing will bring together Apple, Google, and more than 45 other organizations. They'll use the new Claude Mythos...

Wired - AI · 7 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime