[2603.25326] Evaluating Language Models for Harmful Manipulation
About this article
Abstract page for arXiv paper 2603.25326: Evaluating Language Models for Harmful Manipulation
Computer Science > Artificial Intelligence arXiv:2603.25326 (cs) [Submitted on 26 Mar 2026] Title:Evaluating Language Models for Harmful Manipulation Authors:Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger View a PDF of the paper titled Evaluating Language Models for Harmful Manipulation, by Canfer Akbulut and 11 other authors View PDF HTML (experimental) Abstract:Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic...