[2510.02356] Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark
Summary
This article presents EAPrivacy, a benchmark for evaluating the physical-world privacy awareness of large language models (LLMs), revealing significant shortcomings in current models' handling of privacy in dynamic environments.
Why It Matters
As LLMs are increasingly integrated into real-world applications, understanding their privacy awareness is crucial. This research highlights the limitations of existing models in balancing task execution with privacy considerations, emphasizing the need for improved alignment in AI systems.
Key Takeaways
- EAPrivacy benchmark assesses LLMs' physical-world privacy awareness.
- Top models like Gemini 2.5 Pro show only 59% accuracy in dynamic scenarios.
- Models often prioritize task completion over privacy, with up to 86% of cases disregarding privacy requests.
- Significant misalignment exists between LLMs and social norms regarding privacy.
- The study calls for enhanced alignment strategies for LLMs in real-world applications.
Computer Science > Cryptography and Security arXiv:2510.02356 (cs) [Submitted on 27 Sep 2025 (v1), last revised 15 Feb 2026 (this version, v3)] Title:Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark Authors:Xinjie Shen, Mufei Li, Pan Li View a PDF of the paper titled Measuring Physical-World Privacy Awareness of Large Language Models: An Evaluation Benchmark, by Xinjie Shen and 2 other authors View PDF HTML (experimental) Abstract:The deployment of Large Language Models (LLMs) in embodied agents creates an urgent need to measure their privacy awareness in the physical world. Existing evaluation methods, however, are confined to natural language based scenarios. To bridge this gap, we introduce EAPrivacy, a comprehensive evaluation benchmark designed to quantify the physical-world privacy awareness of LLM-powered agents. EAPrivacy utilizes procedurally generated scenarios across four tiers to test an agent's ability to handle sensitive objects, adapt to changing environments, balance task execution with privacy constraints, and resolve conflicts with social norms. Our measurements reveal a critical deficit in current models. The top-performing model, Gemini 2.5 Pro, achieved only 59\% accuracy in scenarios involving changing physical environments. Furthermore, when a task was accompanied by a privacy request, models prioritized completion over the constraint in up to 86\% of cases. In high-stakes situations pitting privacy against...