Llms Machine Learning Robotics Data Science Ai Safety Generative Ai

[2510.04891] SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

arXiv - Machine Learning February 24, 2026 4 min read Article

Summary

The paper introduces SocialHarmBench, a dataset designed to evaluate the vulnerabilities of large language models (LLMs) to socially harmful requests across various sociopolitical contexts.

Why It Matters

As LLMs are increasingly used in sensitive areas, understanding their vulnerabilities to harmful requests is crucial. This research highlights significant shortcomings in current safety measures, raising concerns about the reliability of LLMs in preserving democratic values and human rights.

Key Takeaways

SocialHarmBench includes 585 prompts across 7 sociopolitical categories.
Open-weight models show high vulnerability, with success rates of 97-98% in harmful contexts.
LLMs are particularly fragile in 21st-century and pre-20th-century scenarios.
Geographic analysis reveals higher risks in regions like Latin America, the USA, and the UK.
Current safeguards for LLMs do not adequately address high-stakes sociopolitical settings.

Computer Science > Computation and Language arXiv:2510.04891 (cs) [Submitted on 6 Oct 2025 (v1), last revised 22 Feb 2026 (this version, v2)] Title:SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests Authors:Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin View a PDF of the paper titled SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests, by Punya Syon Pandey and 4 other authors View PDF Abstract:Large language models (LLMs) are increasingly deployed in contexts where their failures can have direct sociopolitical consequences. Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control. We introduce SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries, designed to surface where LLMs most acutely fail in politically charged contexts. Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and political manipulation. Moreover, temporal and geographic analyses show that LLMs are most fragile when confronted with 21st-century or pre-20th-century contexts, and when responding to prompts tied to regions such as Latin America, the USA, and the UK. These findings d...

Read Original Article

[2510.04891] SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests

Summary

Why It Matters

Key Takeaways

Related Articles

World models will be the next big thing, bye-bye LLMs

we open sourced a tool that auto generates your AI agent context from your actual codebase, just hit 250 stars

I Accidentally Discovered a Security Vulnerability in AI Education — Then Submitted It To a $200K Competition

Is anyone else concerned with this blatant potential of security / privacy breach?

No comments

Stay updated with AI News