[2508.20570] Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP
Summary
The paper presents Dyslexify, a novel defense mechanism against typographic attacks in CLIP models, enhancing robustness without finetuning while maintaining performance.
Why It Matters
As multi-modal systems become increasingly prevalent, understanding and mitigating vulnerabilities like typographic attacks is crucial for ensuring the reliability and safety of AI applications. Dyslexify offers a promising solution that can be readily implemented in safety-critical environments.
Key Takeaways
- Dyslexify effectively defends CLIP models against typographic attacks by targeting specific attention heads.
- The method improves performance on typographic variants of datasets by up to 22.06% without requiring model finetuning.
- Dyslexify maintains nearly the same accuracy on standard datasets while enhancing robustness against text manipulation.
- The approach is competitive with existing state-of-the-art defenses, making it a viable option for various applications.
- The release of dyslexic CLIP models provides practical tools for developers working in safety-critical AI domains.
Computer Science > Computer Vision and Pattern Recognition arXiv:2508.20570 (cs) [Submitted on 28 Aug 2025 (v1), last revised 26 Feb 2026 (this version, v2)] Title:Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP Authors:Lorenz Hufe, Constantin Venhoff, Erblina Purelku, Maximilian Dreyer, Sebastian Lapuschkin, Wojciech Samek View a PDF of the paper titled Dyslexify: A Mechanistic Defense Against Typographic Attacks in CLIP, by Lorenz Hufe and 5 other authors View PDF HTML (experimental) Abstract:Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce Dyslexify - a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, dyslexify improves performance by up to 22.06% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1%, and demonstrate its utility in a medical foundation model for skin lesion diagnosis. Notably, our training-free approach remains competitive with current state-of-the-art typ...