[2505.14226] Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs
About this article
Abstract page for arXiv paper 2505.14226: Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs
Computer Science > Computation and Language arXiv:2505.14226 (cs) [Submitted on 20 May 2025 (v1), last revised 7 Apr 2026 (this version, v5)] Title:Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs Authors:Darpan Aswal, Siddharth D Jaiswal View a PDF of the paper titled Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs, by Darpan Aswal and Siddharth D Jaiswal View PDF HTML (experimental) Abstract:Safety-aligned LLMs remain vulnerable to digital phenomena like textese that introduce non-canonical perturbations to words but preserve the phonetics. We introduce CMP-RT (code-mixed phonetic perturbations for red-teaming), a novel diagnostic probe that pinpoints tokenization as the root cause of this vulnerability. A mechanistic analysis reveals that phonetic perturbations fragment safety-critical tokens into benign sub-words, suppressing their attribution scores while preserving prompt interpretability -- causing safety mechanisms to fail despite excellent input understanding. We demonstrate that this vulnerability evades standard defenses, persists across modalities and state-of-the-art (SOTA) models including Gemini-3-Pro, and scales through simple supervised fine-tuning (SFT). Furthermore, layer-wise probing shows perturbed and canonical input representations align up to a critical layer depth; enforcing output equivalence robustly recovers the lost representations, providing causal evidence for a structural gap between pre-training and alig...