Abstract
A high-quality land-use dataset is crucial for constructing a high-performance land-use classification model. Due to the complexity and spatial heterogeneity of land-use, the dataset construction process is inefficient and costly. This challenge affects the quality of datasets, consequently impacting the model’s performance. The emerging field of Data-Centric Artificial Intelligence (DCAI) is expected to deliver techniques for dataset optimization, offering a promising solution to the problem. Therefore, this study proposes a data-centric framework named DCAI-CLUD for the construction of land-use datasets. Based on this framework, the accuracy and rate of data labeling are improved by 5.93 and 28.97%. The Gini index of the dataset and the proportion of samples with non-mixed land-use categories are enhanced by 3.27 and 8.52%. The overall accuracy (OA) and Kappa of the land-use classification model improved significantly by 27.87 and 58.08%. This study is the first to introduce DCAI into the field of geographic information and remote sensing and verify its effectiveness. The proposed framework can effectively improve the construction efficiency and quality of the dataset and synchronously optimize the model performance. Based on the proposed framework, we constructed a multi-source land-use dataset of major cities in China named CN-MSLU-100K.
HIGHLIGHTS
A framework for optimizing the land-use dataset construction process is proposed.
Filtering and pre-labeling improved the quality and efficiency of data labeling.
The performance of land-use classification model is enhanced by dataset optimization.
Preconceived results have a subjective impact on the data labelers.
The first study to introduce DCAI for land-use classification is launched.
Acknowledgement
We are deeply grateful to Professor Yuan May, Dr. Andreas Züfle, and the anonymous reviewers for their constructive comments and suggestions on our paper. We also extend our sincere thanks to the young volunteers who contributed significantly to this human-computer collaboration project.
Disclosure statement
No conflict of interest exists in the submission of this manuscript, and manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part.
Data and codes availability statement
The CN-MSLU-100K dataset cannot be shared publicly due to the copyright reasons. However, readers can access the dataset upon request. The CN-MSLU-DEMO dataset are publicly available at http://doi.org/10.6084/m9.figshare.24942510. We have already provided a website for full data application and download at https://urbancomp.net/s/cn-mslu-100k-land-use-classification-dataset-at-block-scale-for-multi-source-spatio-temporal-dataen. The ‘human-computer collaborative’ data annotation method mentioned in Section 2.2 is operated based on a data annotation platform of Alibaba, so the code of the platform cannot be disclosed due to the copyright issues. The rest of the code and sample data used to reproduce our work are publicly available at http://doi.org/10.6084/m9.figshare.24942510.
Additional information
Funding
Notes on contributors
Hao Wu
Hao Wu has obtained his master’s degree from China University of Geosciences (Wuhan). He is currently working at the State Grid Corporation of China. His research interests are geospatial big data mining, data-centric urban modeling. He contributed to the methodology, software developing, writing – original draft, visualization, writing – review and editing.
Zhangwei Jiang
Zhangwei Jiang is a staff algorithm engineer at Alibaba Group. His research interests are LBS data mining and research & recommendation algorithm. He contributed to the project administration, conceptualization, data curation, investigation, methodology, writing – original draft, writing – review and editing.
Anning Dong
Anning Dong has obtained his master’s degree from China University of Geosciences (Wuhan). He is currently working at the State Administration of Foreign Exchange in China. His research interests are spatiotemporal big data mining and crime geography. He contributed to the methodology, data curation, software developing, validation, writing – original draft, writing – review and editing.
Ronghui Gao
Ronghui Gao is a graduate student at China University of Geosciences (Wuhan). His research interests are geospatial big data mining, Interpretability of urban models. He contributed to the methodology, validation, writing – original draft, writing – review and editing.
Xiaoqin Yan
Xiaoqin Yan is currently a Ph.D. student in GIScience at the Institute of Remote Sensing and Geographical Information Systems, Peking University, Beijing. His research interests are spatio-temporal big data computing and social sensing. He contributed to the methodology, data curation, validation, writing – original draft, writing – review and editing.
Zhihui Hu
Zhihui Hu is a graduate student at China University of Geosciences (Wuhan). His research interests are geospatial big data mining, land use classification and trajectory representation learning. He contributed to the methodology, validation, writing – original draft, writing – review and editing.
Fengling Mao
Fengling Mao is an algorithm engineer at Alibaba Group. Her research interests are trajectory pattern mining and spatiotemporal data embedding. She contributed to the methodology, data curation, validation, software developing, writing – review and editing.
Hong Liu
Hong Liu is a senior staff algorithm engineer at Alibaba Group. His research interests are data mining and research&recommendation algorithm. He contributed to the conceptualization, investigation, methodology, writing – review and editing.
Pengxuan Li
Pengxuan Li is a senior staff data engineer at Alibaba Group. His research interests are data mining and data science. the methodology, validation, software developing, writing – review and editing.
Peng Luo
Peng Luo has obtained his Ph.D. from the Chair of Cartography and Visual Analytics at the Technical University of Munich, Germany. He is about to join the Senseable City Lab at the Massachusetts Institute of Technology. His research interests include spatial association modelling, social sensing, and applied artificial intelligence. He contributed to the validation, writing – original draft, writing – review and editing.
Zijin Guo
Zijin Guo has obtained his master’s degree from China University of Geosciences (Wuhan). He is currently working at the Changjiang Water Resources Commission in China. His research interests are trajectory data mining and complex network analysis. He contributed to the validation, writing – original draft, writing – review and editing.
Qingfeng Guan
Qingfeng Guan is a professor at China University of Geosciences (Wuhan). His research interests are high-performance spatial intelligence computation and urban computing. He contributed to the supervision, writing – review and editing.
Yao Yao
Yao Yao is a Professor at China University of Geosciences (Wuhan) and a researcher at the University of Tokyo. His research interests are geospatial big data mining, analysis, and computational urban science. He contributed to the supervision, project administration, conceptualization, data curation, investigation, methodology, writing – original draft, visualization, writing – review and editing.