ABSTRACT
Semantic segmentation of remote sensing images (RSIs) is of great significance for obtaining geospatial object information. Transformers win promising effect, whereas multi-head self-attention (MSA) is expensive. We propose an efficient semantic segmentation Transformer (ESST) of RSIs that combines zero-padding position encoding with linear space reduction attention (LSRA). First, to capture the coarse-to-fine features of RSI, a zero-padding position encoding is proposed by adding overlapping patch embedding (OPE) layers and convolution feed-forward networks (CFFN) to improve the local continuity of features. Then, we replace LSRA in the attention operation to extract multi-level features to reduce the computational cost of the encoder. Finally, we design a lightweight all multi-layer perceptron (all-MLP) head decoder to easily aggregate multi-level features to generate multi-scale features for semantic segmentation. Experimental results demonstrate that our method produces a trade-off in accuracy and speed for semantic segmentation of RSIs on the Potsdam and Vaihingen datasets, respectively.
Disclosure statement
No potential conflict of interest was reported by the authors.
Data Availability statement
Data is openly available in a public repository that issues datasets. The data that support the findings of this study are openly available in Potsdam at https://www2.isprs.org/commissions/comm3/wg4/benchmark/2d-sem-label-potsdam/ and Vaihingen at https://www2.isprs.org/commissions/comm3/wg4/benchmark/2d-sem-label-vaihingen/.