[2506.07452] When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Summary
This paper explores the vulnerabilities of large language models (LLMs) to superficial style alignment, proposing a defense mechanism called SafeStyle to enhance LLM safety against malicious queries.
Why It Matters
Understanding how style patterns can compromise the safety of LLMs is crucial as these models are increasingly deployed in sensitive applications. The findings highlight the need for improved alignment strategies to mitigate risks, ensuring safer AI interactions.
Key Takeaways
- Superficial style alignment can significantly increase LLM vulnerability to jailbreaks.
- ASR inflation correlates with an LLM's attention to style patterns, indicating a need for careful training.
- The proposed SafeStyle defense strategy effectively enhances LLM safety across various fine-tuning scenarios.
Computer Science > Machine Learning arXiv:2506.07452 (cs) [Submitted on 9 Jun 2025 (v1), last revised 24 Feb 2026 (this version, v3)] Title:When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment Authors:Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi View a PDF of the paper titled When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment, by Yuxin Xiao and 4 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) can be prompted with specific styles (e.g., formatting responses as lists), including in malicious queries. Prior jailbreak research mainly augments these queries with additional string transformations to maximize attack success rate (ASR). However, the impact of style patterns in the original queries that are semantically irrelevant to the malicious intent remains unclear. In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment. We first define ASR inflation as the increase in ASR due to style patterns in existing jailbreak benchmark queries. By evaluating 36 LLMs across seven benchmarks, we find that nearly all models exhibit ASR inflation. Notably, the inflation correlates with an LLM's relative attention to style patterns, which also overlap more with its instruction-tuning data when inflation occurs. We then investigate s...