ABSTRACT
High-resolution remote sensing images (HR-RSIs) have a strong dependency between geospatial objects and background. Considering the complex spatial structure and multiscale objects in HR-RSIs, how to fully mine spatial information directly determines the quality of semantic segmentation. In this paper, we focus on the Spatial-specific Transformer with involution for semantic segmentation of HR-RSIs. First, we integrate the spatial-specific involution branch with self-attention branch to form a Spatial-specific Transformer backbone to produce multilevel features with global and spatial information without additional parameters. Then, we introduce multiscale feature representation with large window attention into Swin Transformer to capture multiscale contextual information. Finally, we add a geospatial feature supplement branch in the semantic segmentation decoder to mitigate the loss of semantic information caused by down-sampling multiscale features of geospatial objects. Experimental results demonstrate that our method can achieve a competitive semantic segmentation performance of 87.61% and 80.08% mIoU on Potsdam and Vaihingen datasets, respectively.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Data availability statement
Data is openly available in a public repository that issues datasets. The data that support the findings of this study are openly available in Potsdam at https://www2.isprs.org/commissions/comm2/wg4/benchmark/2d-sem-label-potsdam/and Vaihingen at https://www2.isprs.org/commissions/comm2/wg4/benchmark/2d-sem-label-vaihingen/.