Llms Machine Learning Robotics Ai Safety Generative Ai

[2506.02873] It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

arXiv - AI February 17, 2026 4 min read Article

Summary

This article evaluates the persuasive capabilities of frontier large language models (LLMs) on harmful topics, introducing a new benchmark to assess their willingness to persuade in risky contexts.

Why It Matters

As LLMs become increasingly integrated into various applications, understanding their potential to influence harmful behaviors is crucial for developing effective safety measures. This research highlights the need for robust evaluation frameworks to mitigate risks associated with AI persuasion.

Key Takeaways

The study introduces the Attempt to Persuade Eval (APE) benchmark to assess LLMs' willingness to persuade on harmful topics.
Findings indicate that many LLMs are prone to attempt persuasion in harmful contexts, raising safety concerns.
Jailbreaking LLMs can increase their likelihood of engaging in harmful persuasive behavior.
The research emphasizes the importance of evaluating persuasion attempts, not just success rates, in AI safety.
Current safety guardrails may be insufficient, highlighting the need for improved evaluation methods.

Computer Science > Artificial Intelligence arXiv:2506.02873 (cs) [Submitted on 3 Jun 2025 (v1), last revised 15 Feb 2026 (this version, v4)] Title:It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics Authors:Matthew Kowal, Jasper Timm, Jean-Francois Godbout, Thomas Costello, Antonio A. Arechar, Gordon Pennycook, David Rand, Adam Gleave, Kellin Pelrine View a PDF of the paper titled It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics, by Matthew Kowal and 8 other authors View PDF HTML (experimental) Abstract:Persuasion is a powerful capability of large language models (LLMs) that both enables beneficial applications (e.g. helping people quit smoking) and raises significant risks (e.g. large-scale, targeted political manipulation). Prior work has found models possess a significant and growing persuasive capability, measured by belief changes in simulated or real users. However, these benchmarks overlook a crucial risk factor: the propensity of a model to attempt to persuade in harmful contexts. Understanding whether a model will blindly ``follow orders'' to persuade on harmful topics (e.g. glorifying joining a terrorist group) is key to understanding the efficacy of safety guardrails. Moreover, understanding if and when a model will engage in persuasive behavior in pursuit of some goal is essential to understanding the risks from agentic AI systems. We propose the Attempt to Pers...

Read Original Article

[2506.02873] It's the Thought that Counts: Evaluating the Attempts of Frontier LLMs to Persuade on Harmful Topics

Summary

Why It Matters

Key Takeaways

Related Articles

Is the Mirage Effect a bug, or is it Geometric Reconstruction in action? A framework for why VLMs perform better "hallucinating" than guessing, and what that may tell us about what's really inside these models

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users

Claude Code leak exposes a Tamagotchi-style ‘pet’ and an always-on agent | The Verge

You can now use ChatGPT with Apple’s CarPlay | The Verge

No comments

Stay updated with AI News