[2604.09253] Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
About this article
Abstract page for arXiv paper 2604.09253: Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
Computer Science > Computer Vision and Pattern Recognition arXiv:2604.09253 (cs) [Submitted on 10 Apr 2026] Title:Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization Authors:Yuqin Lan, Gen Li, Yuanze Hu, Weihao Shen, Zhaoxin Fan, Faguo Wu, Xiao Zhang, Laurence T. Yang, Zhiming Zheng View a PDF of the paper titled Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization, by Yuqin Lan and 8 other authors View PDF HTML (experimental) Abstract:Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate...