[2604.06247] SALLIE: Safeguarding Against Latent Language & Image Exploits
About this article
Abstract page for arXiv paper 2604.06247: SALLIE: Safeguarding Against Latent Language & Image Exploits
Computer Science > Cryptography and Security arXiv:2604.06247 (cs) [Submitted on 6 Apr 2026] Title:SALLIE: Safeguarding Against Latent Language & Image Exploits Authors:Guy Azov, Ofer Rivlin, Guy Shtar View a PDF of the paper titled SALLIE: Safeguarding Against Latent Language & Image Exploits, by Guy Azov and 2 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) remain highly vulnerable to textual and visual jailbreaks, as well as prompt injections (arXiv:2307.15043, Greshake et al., 2023, arXiv:2306.13213). Existing defenses often degrade performance through complex input transformations or treat multimodal threats as isolated problems (arXiv:2309.00614, arXiv:2310.03684, Zhang et al., 2025). To address the critical gap for a unified, modal-agnostic defense that mitigates both textual and visual threats simultaneously without degrading performance or requiring architectural modifications, we introduce SALLIE (Safeguarding Against Latent Language & Image Exploits), a lightweight runtime detection framework rooted in mechanistic interpretability (Lindsey et al., 2025, Ameisen et al., 2025). By integrating seamlessly into standard token-level fusion pipelines (arXiv:2306.13549), SALLIE extracts robust signals directly from the model's internal activations. At inference, SALLIE defends via a three-stage architecture: (1) extracting internal residual stream activations, (2) calculating layer-wise maliciousness sco...