[2603.25403] Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models
About this article
Abstract page for arXiv paper 2603.25403: Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models
Computer Science > Cryptography and Security arXiv:2603.25403 (cs) [Submitted on 26 Mar 2026] Title:Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models Authors:Eyal Hadad, Mordechai Guri View a PDF of the paper titled Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models, by Eyal Hadad and 1 other authors View PDF HTML (experimental) Abstract:On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-o...