[2604.08846] Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
About this article
Abstract page for arXiv paper 2604.08846: Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs
Computer Science > Machine Learning arXiv:2604.08846 (cs) [Submitted on 10 Apr 2026] Title:Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs Authors:Jinqi Luo, Jinyu Yang, Tal Neiman, Lei Fan, Bing Yin, Son Tran, Mubarak Shah, René Vidal View a PDF of the paper titled Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs, by Jinqi Luo and 7 other authors View PDF HTML (experimental) Abstract:Multimodal Large Language Models (MLLMs) have been shown to be vulnerable to malicious queries that can elicit unsafe responses. Recent work uses prompt engineering, response classification, or finetuning to improve MLLM safety. Nevertheless, such approaches are often ineffective against evolving malicious patterns, may require rerunning the query, or demand heavy computational resources. Steering the activations of a frozen model at inference time has recently emerged as a flexible and effective solution. However, existing steering methods for MLLMs typically handle only a narrow set of safety-related concepts or struggle to adjust specific concepts without affecting others. To address these challenges, we introduce Dictionary-Aligned Concept Control (DACO), a framework that utilizes a curated concept dictionary and a Sparse Autoencoder (SAE) to provide granular control over MLLM activations. First, we curate a dictionary of 15,000 multimodal concepts by retrieving over 400,000 caption-image stimuli and summarizing their activations into concept directi...