[2506.00530] CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing
About this article
Abstract page for arXiv paper 2506.00530: CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing
Computer Science > Artificial Intelligence arXiv:2506.00530 (cs) [Submitted on 31 May 2025 (v1), last revised 1 Mar 2026 (this version, v2)] Title:CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing Authors:Tianhui Liu, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Jie Feng, Yong Li, Pan Hui View a PDF of the paper titled CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing, by Tianhui Liu and 7 other authors View PDF HTML (experimental) Abstract:Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce \textit{CityLens}, a comprehensive benchmark designed to evaluate the capabilities of Large Vision-Language Models (LVLMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize 3 evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LVLMs across these tasks. These make CityLens the most extensive socioeconomic benchmark to date in terms of geographic coverage, indicator diversity, and model...