[2505.12254] MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark
Summary
The MMS-VPR paper introduces a comprehensive multimodal dataset for street-level visual place recognition, addressing gaps in existing datasets by including diverse imagery and video from pedestrian environments in Chengdu, China.
Why It Matters
This research is significant as it expands the scope of visual place recognition datasets, which have largely focused on vehicle-mounted imagery. By incorporating pedestrian-centric data, it enhances the potential for developing robust AI models that can operate effectively in diverse urban environments, particularly in non-Western contexts.
Key Takeaways
- MMS-VPR includes over 110,000 images and 2,500 video clips from pedestrian environments.
- The dataset features comprehensive annotations, including GPS coordinates and timestamps.
- MMS-VPRlib provides a standardized benchmarking platform for VPR research.
- The dataset aims to improve multimodal modeling by integrating visual, video, and textual data.
- This research addresses the underrepresentation of non-Western urban contexts in existing datasets.
Computer Science > Computer Vision and Pattern Recognition arXiv:2505.12254 (cs) [Submitted on 18 May 2025 (v1), last revised 17 Feb 2026 (this version, v2)] Title:MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark Authors:Yiwei Ou, Xiaobin Ren, Ronggui Sun, Guansong Gao, Kaiqi Zhao, Manfredo Manfredini View a PDF of the paper titled MMS-VPR: Multimodal Street-Level Visual Place Recognition Dataset and Benchmark, by Yiwei Ou and 5 other authors View PDF HTML (experimental) Abstract:Existing visual place recognition (VPR) datasets predominantly rely on vehicle-mounted imagery, offer limited multimodal diversity, and underrepresent dense pedestrian street scenes, particularly in non-Western urban contexts. We introduce MMS-VPR, a large-scale multimodal dataset for street-level place recognition in pedestrian-only environments. MMS-VPR comprises 110,529 images and 2,527 video clips across 208 locations in a ~70,800 $m^2$ open-air commercial district in Chengdu, China. Field data were collected in 2024, while social media data span seven years (2019-2025), providing both fine-grained temporal granularity and long-term temporal coverage. Each location features comprehensive day-night coverage, multiple viewing angles, and multimodal annotations including GPS coordinates, timestamps, and semantic textual metadata. We further release MMS-VPRlib, a unified benchmarking platform that consolidates commonly used VPR datasets and state-of-the-art methods u...