[2510.04891] SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests
Summary
The paper introduces SocialHarmBench, a dataset designed to evaluate the vulnerabilities of large language models (LLMs) to socially harmful requests across various sociopolitical contexts.
Why It Matters
As LLMs are increasingly used in sensitive areas, understanding their vulnerabilities to harmful requests is crucial. This research highlights significant shortcomings in current safety measures, raising concerns about the reliability of LLMs in preserving democratic values and human rights.
Key Takeaways
- SocialHarmBench includes 585 prompts across 7 sociopolitical categories.
- Open-weight models show high vulnerability, with success rates of 97-98% in harmful contexts.
- LLMs are particularly fragile in 21st-century and pre-20th-century scenarios.
- Geographic analysis reveals higher risks in regions like Latin America, the USA, and the UK.
- Current safeguards for LLMs do not adequately address high-stakes sociopolitical settings.
Computer Science > Computation and Language arXiv:2510.04891 (cs) [Submitted on 6 Oct 2025 (v1), last revised 22 Feb 2026 (this version, v2)] Title:SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests Authors:Punya Syon Pandey, Hai Son Le, Devansh Bhardwaj, Rada Mihalcea, Zhijing Jin View a PDF of the paper titled SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests, by Punya Syon Pandey and 4 other authors View PDF Abstract:Large language models (LLMs) are increasingly deployed in contexts where their failures can have direct sociopolitical consequences. Yet, existing safety benchmarks rarely test vulnerabilities in domains such as political manipulation, propaganda and disinformation generation, or surveillance and information control. We introduce SocialHarmBench, a dataset of 585 prompts spanning 7 sociopolitical categories and 34 countries, designed to surface where LLMs most acutely fail in politically charged contexts. Our evaluations reveal several shortcomings: open-weight models exhibit high vulnerability to harmful compliance, with Mistral-7B reaching attack success rates as high as 97% to 98% in domains such as historical revisionism, propaganda, and political manipulation. Moreover, temporal and geographic analyses show that LLMs are most fragile when confronted with 21st-century or pre-20th-century contexts, and when responding to prompts tied to regions such as Latin America, the USA, and the UK. These findings d...