[2602.22683] SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses
Summary
The paper introduces SUPERGLASSES, a benchmark for evaluating Vision Language Models (VLMs) in AI smart glasses, addressing the limitations of traditional datasets and proposing a new multimodal agent, SUPERLENS.
Why It Matters
As AI smart glasses gain popularity, understanding their interaction capabilities through effective benchmarks is crucial. This research highlights the need for task-specific solutions in Visual Question Answering (VQA) scenarios, ensuring better performance and user experience in real-world applications.
Key Takeaways
- SUPERGLASSES is the first benchmark for VLMs tailored for smart glasses, using real-world data.
- The benchmark includes 2,422 image-question pairs across diverse domains, enhancing realism in VQA tasks.
- SUPERLENS, a new multimodal agent, outperforms existing models by integrating advanced object detection and web search.
- The study reveals significant performance gaps in current VLMs, emphasizing the need for specialized solutions.
- This research sets a foundation for future advancements in AI applications for wearable technology.
Computer Science > Computer Vision and Pattern Recognition arXiv:2602.22683 (cs) [Submitted on 26 Feb 2026] Title:SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses Authors:Zhuohang Jiang, Xu Yuan, Haohao Qu, Shanru Lin, Kanglong Liu, Wenqi Fan, Qing Li View a PDF of the paper titled SUPERGLASSES: Benchmarking Vision Language Models as Intelligent Agents for AI Smart Glasses, by Zhuohang Jiang and 6 other authors View PDF HTML (experimental) Abstract:The rapid advancement of AI-powered smart glasses, one of the hottest wearable devices, has unlocked new frontiers for multimodal interaction, with Visual Question Answering (VQA) over external knowledge sources emerging as a core application. Existing Vision Language Models (VLMs) adapted to smart glasses are typically trained and evaluated on traditional multimodal datasets; however, these datasets lack the variety and realism needed to reflect smart glasses usage scenarios and diverge from their specific challenges, where accurately identifying the object of interest must precede any external knowledge retrieval. To bridge this gap, we introduce SUPERGLASSES, the first comprehensive VQA benchmark built on real-world data entirely collected by smart glasses devices. SUPERGLASSES comprises 2,422 egocentric image-question pairs spanning 14 image domains and 8 query categories, enriched with full search trajectories and reasoning annotations. We evaluate 26 representative VLMs on this b...