[2410.10700] LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
About this article
Abstract page for arXiv paper 2410.10700: LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
Computer Science > Computation and Language arXiv:2410.10700 (cs) [Submitted on 14 Oct 2024 (v1), last revised 26 Mar 2026 (this version, v3)] Title:LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts Authors:Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, Jing Shao View a PDF of the paper titled LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts, by Qibing Ren and 9 other authors View PDF HTML (experimental) Abstract:Safety concerns in large language models (LLMs) have gained significant attention due to their exposure to potentially harmful data during pre-training. In this paper, we identify a new safety vulnerability in LLMs: their susceptibility to \textit{natural distribution shifts} between attack prompts and original toxic prompts, where seemingly benign prompts, semantically related to harmful content, can bypass safety mechanisms. To explore this issue, we introduce a novel attack method, \textit{ActorBreaker}, which identifies actors related to toxic prompts within pre-training distribution to craft multi-turn prompts that gradually lead LLMs to reveal unsafe content. ActorBreaker is grounded in Latour's actor-network theory, encompassing both human and non-human actors to capture a broader range of vulnerabilities. Our experimental results demonstrate that ActorBreaker outperforms existing attack methods in terms of diversity, effectiven...