[2603.08486] Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images
About this article
Abstract page for arXiv paper 2603.08486: Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images
Computer Science > Computer Vision and Pattern Recognition arXiv:2603.08486 (cs) [Submitted on 9 Mar 2026 (v1), last revised 15 Apr 2026 (this version, v2)] Title:Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images Authors:Qishun Yang, Shu Yang, Lijie Hu, Di Wang View a PDF of the paper titled Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images, by Qishun Yang and 3 other authors View PDF HTML (experimental) Abstract:Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the...