[2602.15278] Visual Persuasion: What Influences Decisions of Vision-Language Models?

[2602.15278] Visual Persuasion: What Influences Decisions of Vision-Language Models?

arXiv - AI 4 min read Article

Summary

This article explores how visual-language models (VLMs) make decisions based on image inputs, introducing a framework to analyze their preferences through controlled experiments.

Why It Matters

Understanding the decision-making processes of VLMs is crucial as these models increasingly influence consumer behavior and online interactions. This research provides insights into their visual preferences, which can help identify vulnerabilities and improve governance in AI applications.

Key Takeaways

  • VLMs can be influenced by specific visual edits in images.
  • A framework for studying VLM decision-making is proposed, utilizing controlled image-based tasks.
  • Optimized visual prompts can significantly alter selection probabilities in VLMs.
  • The research aids in identifying visual vulnerabilities in AI systems.
  • An automatic interpretability pipeline is developed to explain VLM preferences.

Computer Science > Computer Vision and Pattern Recognition arXiv:2602.15278 (cs) [Submitted on 17 Feb 2026] Title:Visual Persuasion: What Influences Decisions of Vision-Language Models? Authors:Manuel Cherep, Pranav M R, Pattie Maes, Nikhil Singh View a PDF of the paper titled Visual Persuasion: What Influences Decisions of Vision-Language Models?, by Manuel Cherep and 3 other authors View PDF HTML (experimental) Abstract:The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significant...

Related Articles

Llms

[R] Depth-first pruning transfers: GPT-2 → TinyLlama with stable gains and minimal loss

TL;DR: Removing the right layers (instead of shrinking all layers) makes transformer models ~8–12% smaller with only ~6–8% quality loss, ...

Reddit - Machine Learning · 1 min ·
Llms

Built a training stability monitor that detects instability before your loss curve shows anything — open sourced the core today

Been working on a weight divergence trajectory curvature approach to detecting neural network training instability. Treats weight updates...

Reddit - Artificial Intelligence · 1 min ·
Llms

This Is Not Hacking. This Is Structured Intelligence.

Watch me demonstrate everything I've been talking about—live, in real time. The Setup: Maestro University AI enrollment system Standard c...

Reddit - Artificial Intelligence · 1 min ·
Llms

[D] Howcome Muon is only being used for Transformers?

Muon has quickly been adopted in LLM training, yet we don't see it being talked about in other contexts. Searches for Muon on ConvNets tu...

Reddit - Machine Learning · 1 min ·
More in Llms: This Week Guide Trending

No comments

No comments yet. Be the first to comment!

Stay updated with AI News

Get the latest news, tools, and insights delivered to your inbox.

Daily or weekly digest • Unsubscribe anytime