[2604.06247] SALLIE: Safeguarding Against Latent Language & Image

[2604.06247] SALLIE: Safeguarding Against Latent Language & Image Exploits

arXiv - AI April 09, 2026 4 min read

About this article

Abstract page for arXiv paper 2604.06247: SALLIE: Safeguarding Against Latent Language & Image Exploits

Computer Science > Cryptography and Security arXiv:2604.06247 (cs) [Submitted on 6 Apr 2026] Title:SALLIE: Safeguarding Against Latent Language & Image Exploits Authors:Guy Azov, Ofer Rivlin, Guy Shtar View a PDF of the paper titled SALLIE: Safeguarding Against Latent Language & Image Exploits, by Guy Azov and 2 other authors View PDF HTML (experimental) Abstract:Large Language Models (LLMs) and Vision-Language Models (VLMs) remain highly vulnerable to textual and visual jailbreaks, as well as prompt injections (arXiv:2307.15043, Greshake et al., 2023, arXiv:2306.13213). Existing defenses often degrade performance through complex input transformations or treat multimodal threats as isolated problems (arXiv:2309.00614, arXiv:2310.03684, Zhang et al., 2025). To address the critical gap for a unified, modal-agnostic defense that mitigates both textual and visual threats simultaneously without degrading performance or requiring architectural modifications, we introduce SALLIE (Safeguarding Against Latent Language & Image Exploits), a lightweight runtime detection framework rooted in mechanistic interpretability (Lindsey et al., 2025, Ameisen et al., 2025). By integrating seamlessly into standard token-level fusion pipelines (arXiv:2306.13549), SALLIE extracts robust signals directly from the model's internal activations. At inference, SALLIE defends via a three-stage architecture: (1) extracting internal residual stream activations, (2) calculating layer-wise maliciousness sco...

Originally published on April 09, 2026. Curated by AI News.

Llms

We gave 45 psychological questionnaires to 50 LLMs. What we found was not “personality.”

What is the “personality” of an LLM? What actually differentiates models psychometrically? Since LLMs entered public use, researchers hav...

Reddit - Artificial Intelligence · 1 min · 37 minutes ago

Llms

How to Disable Google's Gemini in Chrome | WIRED

Chrome users were caught off guard by a 4-GB Google AI model baked into Chrome, sparking privacy concerns. The good news: You can easily ...

Wired - AI · 6 min · 37 minutes ago

Llms

OpenAI introduces new 'Trusted Contact' safeguard for cases of possible self-harm | TechCrunch

The company is expanding its efforts to protect ChatGPT users in cases where conversations may turn to self-harm.

TechCrunch - AI · 5 min · about 1 hour ago

Llms

Mira Murati’s deposition pulled back the curtain on Sam Altman’s ouster | The Verge

Thanks to Musk v. Altman, the public is getting a concrete look at details of Sam Altman’s ouster from OpenAI, much of it centered on for...

The Verge - AI · 11 min · about 3 hours ago

[2604.06247] SALLIE: Safeguarding Against Latent Language & Image Exploits

About this article

Related Articles

We gave 45 psychological questionnaires to 50 LLMs. What we found was not “personality.”

How to Disable Google's Gemini in Chrome | WIRED

OpenAI introduces new 'Trusted Contact' safeguard for cases of possible self-harm | TechCrunch

Mira Murati’s deposition pulled back the curtain on Sam Altman’s ouster | The Verge

No comments

Stay updated with AI News