[2603.08486] Visual Self-Fulfilling Alignment: Shaping Safety-Oriented

[2603.08486] Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

arXiv - AI April 16, 2026 3 min read

About this article

Abstract page for arXiv paper 2603.08486: Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Computer Science > Computer Vision and Pattern Recognition arXiv:2603.08486 (cs) [Submitted on 9 Mar 2026 (v1), last revised 15 Apr 2026 (this version, v2)] Title:Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images Authors:Qishun Yang, Shu Yang, Lijie Hu, Di Wang View a PDF of the paper titled Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images, by Qishun Yang and 3 other authors View PDF HTML (experimental) Abstract:Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response quality, and mitigates over-refusal while preserving general capabilities. Our work extends the...

Originally published on April 16, 2026. Curated by AI News.

Llms

[2604.01473] SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Abstract page for arXiv paper 2604.01473: SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

arXiv - AI · 4 min · about 3 hours ago

Llms

[2603.23682] Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

Abstract page for arXiv paper 2603.23682: Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for ...

arXiv - AI · 4 min · about 3 hours ago

Llms

[2601.07422] Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

Abstract page for arXiv paper 2601.07422: Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

arXiv - AI · 4 min · about 3 hours ago

Llms

[2512.22174] BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs

Abstract page for arXiv paper 2512.22174: BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs

arXiv - AI · 4 min · about 3 hours ago

[2603.08486] Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

About this article

Related Articles

[2604.01473] SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

[2603.23682] Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

[2601.07422] Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

[2512.22174] BitFlipScope: Scalable Fault Localization and Recovery for Bit-Flip Corruptions in LLMs

No comments

Stay updated with AI News