[2602.15278] Visual Persuasion: What Influences Decisions of Vision-Language Models?
Summary
This article explores how visual-language models (VLMs) make decisions based on image inputs, introducing a framework to analyze their preferences through controlled experiments.
Why It Matters
Understanding the decision-making processes of VLMs is crucial as these models increasingly influence consumer behavior and online interactions. This research provides insights into their visual preferences, which can help identify vulnerabilities and improve governance in AI applications.
Key Takeaways
- VLMs can be influenced by specific visual edits in images.
- A framework for studying VLM decision-making is proposed, utilizing controlled image-based tasks.
- Optimized visual prompts can significantly alter selection probabilities in VLMs.
- The research aids in identifying visual vulnerabilities in AI systems.
- An automatic interpretability pipeline is developed to explain VLM preferences.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15278 (cs) [Submitted on 17 Feb 2026] Title:Visual Persuasion: What Influences Decisions of Vision-Language Models? Authors:Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh View a PDF of the paper titled Visual Persuasion: What Influences Decisions of Vision-Language Models?, by Manuel Cherep and 3 other authors View PDF HTML (experimental) Abstract:The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significant...