ABSTRACT
Convolutional neural networks (CNN) have been developed for several years in the field of extracting buildings from remote sensing images. Vision Transformer (ViT) has recently demonstrated superior performance over CNN, thanks to its ability to model long-range dependencies through self-attention mechanisms. However, most existing ViT models lack shape information enhancement for the building objects, resulting in insufficient fine-grained segmentation. To address this limitation, we construct an efficient dual-path ViT framework for building segmentation, termed shape-aware enhancement Vision Transformer (SAEViT). Our approach incorporates shape-aware enhancement module (SAEM) that perceives and enhances the shape features of buildings using multi-shapes of convolutional kernels. We also introduce multi-pooling channel attention (MPCA) to exploit channel-wise information without squeezing the channel dimension. Furthermore, we propose a progressive aggregation upsampling model (PAUM) in the decoder to aggregate multilevel features using a progressive upsampling methodology, coupled with the utilization of the soft-pool algorithm operating on the channel axis. We evaluate our model on three public building datasets. The experimental results show that SAEViT obtains a significant improvement on various datasets, confirming its efficacy. Compared with several state-of-the-art models, SAEViT achieves a comprehensive transcendence in overall performance.
Acknowledgements
This work owes a great deal of gratitude to Natural Science Foundation of Xinjiang Uygur Autonomous Region and all teachers and students of the research group, whose unwavering support and invaluable assistance have been instrumental in shaping the outcome of this research.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Data availability statement
WHU building dataset is openly accessible at https://study.rsgis.whu.edu.cn/pages/download/building_dataset.html, Massachusetts building bataset is openly accessible at https://www.cs.toronto.edu/vmnih/data/, and Inria building dataset is openly accessible at https://project.inria.fr/aerialimagelabeling/.