Full article: CMPF-UNet: a ConvNeXt multi-scale pyramid fusion U-shaped network for multi-category segmentation of remote sensing images

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Most U-shaped convolutional neural network (CNN) methods have the problems of insufficient feature extraction and fail to fully utilize global/multi-scale context information, which makes it difficult to distinguish similar objects and shadow occluded objects in remote sensing images. This article proposes a ConvNeXt multi-scale pyramid fusion U-shaped network (CMPF-UNet). In this work, we first propose a novel backbone network based on ConvNeXt to enhance image feature extraction, and use ConvNeXt bottleneck blocks to reconstruct the decoder. Furthermore, a scale aware pyramid fusion (SAPF) module and Residual Atrous Spatial Pyramid Pooling (RASPP) module are proposed to dynamically fuse the rich multi-scale context information in advanced features. Finally, multiple Global Pyramid Guidance (GPG) modules are embedded in the network, aiming to provide different levels of global context information for the decoder by reconstructing skip-connections. Experiments on the Vaihingen and Potsdam datasets indicate that the proposed CMPF-UNet segmentation achieves more accurate results.

Keywords:

1. Introduction

Semantic segmentation of remote sensing images has played an important role in many fields such as urban planning, disaster assessment, military security. Its main task is to extract relevant feature information from the collected remote sensing images and classify the targets represented by each pixel in the remote sensing image, thereby forming a segmentation map.

With the rapid development of computer vision technology, researchers have conducted extensive research on the semantic segmentation of high-resolution remote sensing images, accordingly proposing a large number of semantic segmentation methods for remote sensing images. These research methods are mainly divided into two categories: traditional segmentation methods and deep learning-based segmentation methods (Li et al. Citation2021). Traditional segmentation methods divide the image into regions based on features such as grayscale, color, texture, and shape, presenting differences between regions and similarity within regions. The classic traditional segmentation methods include the segmentations based on threshold, edge, clustering, and graph theory (Huang et al. Citation2011; Yang et al. Citation2011). However, traditional image segmentation methods are mostly based on traditional remote sensing image classification techniques and highly rely on manually designed features. Thus, this method cannot achieve high-precision and fully automated segmentation. Nowadays, due to the excellent feature extraction capabilities, the deep learning methods based on convolutional neural networks (CNN) (Long et al. Citation2015; Ronneberger et al. Citation2015; Badrinarayanan et al. Citation2017) have been considered as a promising method for image semantic segmentation.

CNN have strong generalization ability and self-learning characteristics in the field of deep learning. Benefited from the structural features of local region connectivity, weight sharing, and down-sampling, it could automatically learn and extract feature information from multiple levels in training images, leading to a better segmentation performance for image semantic segmentation than traditional methods. For semantic segmentation, the proposal of a fully convolutional network (FCN) based on deep learning (Long et al. Citation2015) is an important turning point, it has brought semantic segmentation into the era of deep learning. Subsequently, researchers proposed U-Net (Ronneberger et al. Citation2015) and SegNet (Badrinarayanan et al. Citation2017) U-shaped encoder-decoder semantic segmentation networks and successfully applied them in the field of remote sensing image segmentation. Due to the residual network (ResNet) (He et al. Citation2016) extracting deeper features, the ResUNet (Xiao et al. Citation2018) network replaced each submodule of U-Net with a form with residual connections, achieving good segmentation results. DeeplabV2 (Chen et al. Citation2018) introduced ResNet to enhance the extraction of image feature information, thereby improving the performance of target object segmentation. Fast-FCN (Wu et al. Citation2019) used a joint pyramid up-sampling module to replace dilated convolutions and capture contextual information. PSPNet proposed the Pyramid Pool Module (PPM) to aggregate context information from different regions by adopting pools of different sizes (Zhao et al. Citation2017). DeepLabV3 introduced parallel Atrous Spatial Pyramid Pooling (ASPP) modules with different dilation rates to help fuse low-level and high-level features (Chen et al. Citation2017). DenseASPP (Yang et al. Citation2018) combined ASPP with the dense connection in DenseNet (Huang et al. Citation2017), bringing the final output features not only cover more extensive semantic information than ASPP, but also cover information coding in a very dense way. GSANet (Li et al. Citation2022) developed a new feature extractor (graph feature extraction module) to obtain topological structure information of remote sensing images and combine it with rich spectral spatial information. MGSNet (Wang et al. Citation2023) improved the accurate classification problem of remote sensing scenes formed by various complex objects and backgrounds through target background separation strategies and contrastive regularization. MIFNet (Wang et al. Citation2023) utilized multi-scale interactive information and global dependency fusion modules to extract rich multi-scale information and perform feature fusion of multi-source data. SLA-NET (Zhang et al. Citation2023) integrated spatial convolution and morphological operations into a unified architecture, thereby reducing information loss in data transmission in deep networks. However, despite the significant results achieved by these networks, the insufficient ability to extract information during the encoding stage can lead to the gradual deterioration of shallow details such as edges and contours as the network deepens. And the convolutional operations based on CNN networks have limitations and lack global and multi-scale contextual information, ultimately making it difficult to distinguish ground object categories with similar materials and shadow occlusion. Besides, these models use simple skip-connections to directly combine local information at each stage and ignore the fusion of global context information, which will introduce irrelevant clutter and lead to incorrect pixel classification. Although some researchers have alleviated this problem by adding deeper convolutional layers (Ibtehaz and Rahman Citation2020) and self-attention mechanisms (Oktay et al. Citation2018; Fu et al. Citation2019), they typically require a significant amount of computational time and memory to capture global contexts, thereby reducing efficiency and limiting the potential for practical applications. In summary, the performance of these semantic segmentation methods has not yet achieved satisfactory results, and there are many shortcomings.

In this paper, we aim to further enhance the segmentation performance of remote sensing images while addressing the challenges associated with distinguishing categories that exhibit similar materials and are obscured by shadows. These challenges arise due to the limitations of global/multi-scale context information capture and integration in U-shaped networks. The proposed approach seeks to overcome these limitations and improve the accuracy of segmentation results for remote sensing images, especially in scenarios where discriminating between similar materials and handling shadow occlusion are critical tasks. Therefore, we develop a novel semantic segmentation model for remote sensing images, which introduces a new backbone and several global/multi-scale context information fusion modules. Specifically, we first propose a new backbone network based on ConvNeXt (Liu et al. Citation2022) to enhance image feature extraction, which can retain more shallow details and which can retain more shallow details and prevent them from deteriorating gradually. Secondly, a Global Pyramid Guided (GPG) module is introduced between the encoder and decoder of the network, which reconstructs skip connections by enhancing the transmission and reuse of encoded feature information. Moreover, two pyramid modules, including the improved Scale Aware Pyramid Fusion (SAPF) module and the Residual Atrous Spatial Pyramid Pool (RASPP) module, are first proposed to effectively integrate global information and multi-scale context information, making the content of each pixel more accurate. Based on the above description, we name our method ConvNeXt Multi-scale Fusion U-shaped Network CMPF-UNet. The proposed CMPF-UNet is evaluated based on the Vaihingen and Potsdam datasets provided by the International Society for Photogrammetry and Remote Sensing (ISPRS). Compared with other methods, our proposed method achieves highly competitive performance.

In summary, the main contributions of this paper are as follows:

We propose a U-shaped high-resolution remote sensing image semantic segmentation network CMPF-UNet based on ConvNeXt, which can accurately segment high-resolution remote sensing images.
A new ConvNeXt backbone network is proposed for remote sensing images with rich semantic features, which is superior to the current popular Vgg and ResNet networks.
The Global Pyramid Guidance (GPG) module and the improved Scale Aware Pyramid Fusion (SAPF) module are embedded in the backbone network to effectively fuse the global and multi-scale context information of images.
We propose Residual Atrous Spatial Pyramid Pooling (RASPP) module to extract multi-scale features of ground objects. Compared with the same types of PPM, ASPP, and DenseASPP modules, RASPP has the advantages of low parameters and high segmentation performance.
We evaluate the proposed network by conducting sufficient ablation studies and comparative experiments on the Vaihingen and Potsdam datasets. The superiority of the method is demonstrated in comparison with state-of-the-art approaches.

2. Related works

2.1. Encoder-decoder framework

Since the advent of FCN (Long et al. Citation2015), models based on encoder-decoder structure have been widely used in remote sensing image semantic segmentation tasks. These models consist of two parts: (a) the encoder where the spatial dimension of feature maps is gradually reduced, thus, longer range information is more easily captured in the deeper encoder output; (b) the decoder where object details and spatial dimension are gradually recovered (Chen et al. Citation2018). For example, U-Net (Ronneberger et al. Citation2015) introduces a symmetrical ‘encoder-decoder’ design and adds skip-connections from encoder features to corresponding decoder activation, thereby aggregating more spatial information. On the basis of U-Net, ResUNet (Xiao et al. Citation2018) network replaced each submodule of U-Net with a form with residual connections, thus forming a new U-shaped network. Similarly, DenseUNet (Guan et al. Citation2019) replaced each sub-module of U-Net in a densely connected form. SegNet (Badrinarayanan et al. Citation2017) reuses the pooling indices from encoders to perform non-linear up-sampling and learns additional convolutional layers to densify feature responses. In addition to the above models, the TransUnet (Chen et al. Citation2021) and SwinUnet (Cao et al. Citation2023) based on Transforms have once again proven the effectiveness of the models based on the encoder-decoder structure on several semantic segmentation benchmarks. In this paper, we still follow the encoder-decoder U-shaped architecture, and extracting richer feature information is the key to image segmentation network design. Therefore, in order to enhance the network’s feature extraction ability and further improve segmentation performance, we propose a new U-shaped backbone and combine it with a global/multi-scale context extraction module in the model.

2.2. ConvNeXt

In this article, ConvNeXt (Liu et al. Citation2022) based on CNN model is used as the encoder part of our proposed network, which has faster inference speed and higher accuracy compared to the recently popular Transformer (Vaswani et al. Citation2017; Liu et al. Citation2021) based models. The formation of ConvNeXt structure is based on the original ResNet (He et al. Citation2016) and gradually improved by drawing on the design of Swin Transformer. Specifically, the starting point for ConvNeXt is the ResNet-50 model. Firstly, the ResNet-50 model is trained with similar training techniques used to train vision Transformers and obtains much improved results compared to the original ResNet-50. This is the baseline for ConvNeXt. Then, the paper studies a series of design decisions, summarized as (1) macro design, (2) ResNeXt, (3) reverse bottleneck, (4) large kernel size, and (5) various layered micro designs (Liu et al. Citation2022).

Among them, (1) Macro design includes changing the stage compute ratio and stem to ‘Patchify’. Based on the baseline, the paper adjusts the number of stacked blocks in each stage from (3, 4, 6, 3) in ResNet-50 to (3, 3, 9, 3), which also aligns the parameters with Swin-T. Meanwhile, the ResNet-style stem cell replaced with a patchify layer implemented using a 4 × 4, stride 4 convolutional layer. (2)The design of ResNeXtify attempts to adopt the idea of ResNeXt (Xie et al. Citation2017), which uses depthwise conversion and increases the network width to the same number of channels as Swin-T (from 64 to 96). (3) The bottleneck block in ConvNeXt adopts the reverse bottleneck module of MobileNetV2 (Sandler et al. Citation2018), which is a structure with a larger middle and smaller ends to effectively avoid information loss. (4) By comparing the experimental results of different kernel sizes, the convolutional kernel sizes in the ConvNeXt were increased from 3 × 3 to 7 × 7. 5). Micro design involves optimizing some details, including replacing ReLU (Nair and Hinton Citation2010) with GELU (Hendrycks and Gimpel Citation2016), fewer activation functions and normalization, substituting BatchNorm (BN) (Ioffe Citation2017) with Layer Normalization (LN) (Ba et al. Citation2016) and separating down-sampling layers.

After the above design decisions, the final ConvNeXt model outperforms the Swin Transformer (Liu et al. Citation2021) in both image classification and detection segmentation tasks. shows the detailed architecture specifications for ResNet-50, Swin-T, and ConvNeXt-T. Through observation and comparison, it can be seen that ConvNeXt has specific changes compared to the other two models. Some changes in micro design details can be seen through the comparison between the Swin Transformer, ResNet, and ConvNeXt bottleneck block structures in . Therefore, we constructed a new backbone network using ConvNeXt and restructured the decoder utilizing ConvNeXt bottleneck blocks. This method not only strengthens the feature extraction ability in the encoding stage to retain more detail content, but also endows the decoder with sufficient features and capabilities to recover lost spatial information. Besides, we compared this backbone with other popular CNN-based backbone networks such as Vgg16 (Simonyan and Zisserman Citation2014), ResNet, and Res2Net in the experimental section, confirming the excellent performance of the proposed method.

Figure 1. Bottleneck block designs for (a) Swin Transformer; (b) ResNet Block; (c) ConvNeXt Block.

Table 1. Detailed architecture specifications for ResNet-50, ConvNeXt-T and Swin-T.

Display Table

2.3. Multi-scale feature extraction and fusion

Due to the issues of inconsistent scales and imbalanced categories, extracting multi-scale features is crucial for determining object categories. Moreover, multi-scale features provide sufficient semantic information and spatial information for subsequent feature fusion. For example, U-Net (Ronneberger et al. Citation2015) only uses simple skip-connections between the encoder and decoder, and does not effectively extract and utilize multi-scale context information in each single stage, resulting in the loss of small targets and the appearance of irrelevant clutter in the segmentation results. Therefore, ResUNet (Xiao et al. Citation2018) replaces each convolution of U-Net in the form of residual connections. Res2UNet (Chen et al. Citation2022) replaces the convolution in U-Net with a set of smaller convolution groups in the form of residuals. Such mechanism enables the network to learn more features at a more granular level, and enlarges the range of receptive fields of each bottleneck layer. In addition, PSPNet101 (Dowden et al. Citation2021) works through the combination of ResNet101 convolutional neural network and a four-layer pyramid pooling module to improve the ability to obtain global information. DeepLabV3+ (Chen et al. Citation2018) proposes Atrous Spatial Pyramid Pooling (ASPP), where parallel atrous convolution layers with different rates capture multi-scale information. DenseASPP (Yang et al. Citation2018) connects different atrous convolutions with method of dense connections on the basis of ASPP, thereby obtaining achieve a larger range of dilation rates. LSCNet (Wang et al. Citation2023) utilizes two parallel rectangles to achieve a larger receptive field, and the long and short sides of the rectangles can simultaneously capture long-range dependencies and local detailed information. Despite achieving good performance, these models are still unable to explore sufficient information from full-scales. In order to fully utilize multi-scale features, we propose a residuals Atrous Spatial Pyramid Pooling (RASPP) module and an improved scale aware pyramid fusion (SAPF) module.

3. Proposed method

3.1. Overview

In this section, we firstly introduce the overall structure of the proposed CMPF-UNet. Afterwards, three important modules in CMPF-UNet are introduced, namely the Global Pyramid Guidance Module (GPG), Residual Atrous Spatial Pyramid Pool (RASPP), and Scale Aware Pyramid Fusion (SAPF) module.

3.2. Network structure

The overall architecture of our CMPF-UNet is shown in . Our CMPF-UNet follows the excellent structure of U-Net, where multiple GPG modules are connected to encoders and decoders. The GPG module is introduced in Section 3.3. Especially, we employ a pre-trained ConvNeXt (Liu et al. Citation2022) as the feature extractor to obtain more representative feature maps. For compatibility purpose, the average pooling layer and fully connected layers are removed. Furthermore, we design RASPP and SAPF modules to further enhance the ability to extract context information. These two modules are introduced in Sections 3.4 and 3.5, respectively.

Figure 2. Overall architecture of our proposed CMPF-UNet, where the backbone is ConvNeXt and there are three main modules, including GPG, RASPP, and SAPF, where ‘s4’, ‘s2’, and ‘p1’ represent stride 4, 2, and padding 1, respectively. According to the construction rules of ConvNeXt, L₁, L₂, L₃, L₄ = (3, 3, 9, 3), while L₅, L₆, and L₇ are set to (3,3,3).

Due to the high resolution of remote sensing images, the original images are processed to a size of 256 × 256. The processed remote sensing image $F \in R^{H \times W \times 3}$ is first fed to ConvNeXt the compression on the channel to obtain the deep features in the encoder, where ConvNeXt is detailed in Section 2.2. The encoder has four feature extraction stages and the output of each stage is defined as $S_{n},$ where n = 1, 2, 3, 4. The output resolution of stage n is $H / 2^{n + 1} \times W / 2^{n + 1},$ and the dimensions are $2^{n - 1} C_{1} .$ Here $C_{1} = 96 .$ At the end of the encoder, the RASPP and SAPF modules are introduced to extract multi-scale context information.

After the above encoding stage, we get feature $F \in R^{(H / 16) \times}^{(W / 16) \times 768},$ which is fed into the decoder after a 3 × 3 convolutional layer. By observing , it can be seen that the decoder mainly consists of three modules, and each module includes three ConvNeXt bottleneck blocks () and a 2 × 2 deconvolution up-sampling layer. As the feature $F$ pass through the up-sampling layer, its resolution expands by two times. In order to better utilize the detailed information of the shallow feature map and the semantic information of the deep feature map, each decoding layer is fused with the feature map output from the corresponding level GPG. After executing three decoding modules, feature $F$ gradually expands to $F^{'} \in R^{(H / 4) \times}^{(W / 4) \times 96} .$ At last, the Conv_last layer at the end of the decoder restores the feature $F^{'}$ to the original image size. The architecture details of the proposed CMPF-UNet are shown in , where Conv represents convolution, Deconv represents deconvolution, Stride represents step size, padding represents boundary padding size and classes represents the number of segmented categories. Furthermore, the table also annotates the names of each layer structure of the network, as well as the dimension size of the output feature map.

Table 2. The proposed architecture details of CMPF-UNet.

Download CSV Display Table

3.3. Global Pyramid Guidance Module

Our proposed CMPF-UNet employs GPG module following (Feng et al. Citation2020) to replace the original skip-connection structure between the encoder and decoder. On the one hand, the feature information learned during the encoding stage may gradually weaken when transmitted to shallower layers. On the other hand, simple skip-connections in U-shaped networks can introduce irrelevant clutters and lead to incorrect pixel classification. Therefore, the GPG module is introduced to address these issues, as shown in the structure in .

Figure 3. Illustration of the Global Pyramid Guidance (GPG) module.

In the GPG module, the original skip-connection is reconstructed in the U-shaped network by combining the feature maps of this stage with the feature maps of all higher-level stages, which can reduce semantic gap due to the mismatch of receptive fields. shows an example of the GPG module on Stage 2. Specifically, the GPG module integrates features from different encoding stages through convolution, up-sampling, and three parallel separable convolutions with different dilation rates. Each GPG module in different stages can be summarized as formula as below (the regular convolution is ignored to simplify the formula): (1) $G_{k} =_{i = k}^{i = 4} C (D_{sconv} @ 2^{i - k} (_{i = k}^{i = 4} C (F_{k} \otimes 2^{i - k})))$ (1) where $G_{k}$ signals the output of the GPG module inserted into the $k^{th}$ stage, $C$ represents the operation of concatenation, $D_{sconv} @ 2^{i - k}$ signals the separable dilated convolution with dilation rate of $2^{i - k},$ $F_{k}$ signals the feature map of the $k^{th}$ stage in the encoder, and $\otimes 2^{i - k}$ signals the up-sampling operation with rate of $2^{i - k} .$ By introducing multiple GPG modules between encoder and decoder, the global semantic information from high-level stages can be fused.

3.4. Residual Atrous Spatial Pyramid Pooling module

In performing image segmentation tasks, extracting multi-scale features is crucial for determining object categories due to inconsistent scales and imbalanced categories. To solve these problems, DeepLabv3+ (Chen et al. Citation2018) proposed the ASPP module with reference to SPP, PPM, and others. The ASPP module performs parallel atrous convolution with different dilation rates on the input feature map to extract multi-scale feature. In this paper, our CMPF-UNet designs a RASPP module at the bottom of the encoder to capture context features of different ranges. To obtain a larger receptive field and better multi-scale feature extraction capabilities, the RASPP module employs the residual atrous block to replace the original atrous convolution. As shown in , the structure of the RASPP module and residual block is exhibited. According to the improvement idea of ConvNeXt network, the residual block employs a reverse bottleneck structure and larger kernel-sized convolutions (Liu et al. Citation2022). We experimented with several kernel sizes, including 3, 5, 7, and 9. By observing the experimental results, the benefit of larger kernel sizes reaches a saturation point at 7 × 7. This improvement method improves the segmentation performance of the network, while the parameters of the network remain roughly the same. In the RASPP module, the dilation rates of the atrous convolution of the four residual blocks are: 1, 3, 5, 7, respectively. The RASPP module can be expressed by the following formula: (2) $y_{i} = D_{sconv} @ i$ (2) (3) $Y = C (y_{i})$ (3) where $i$ denotes the dilation rate = (1, 3, 5, 7) of the dilated convolution, $y_{i}$ denotes the output of a feature map with a dilation rate of $i,$ $D_{sconv} @ 2^{i - k}$ denotes the separable dilated residual block convolution with the dilation rate of $i,$ $C$ denotes the operation of concatenation and $Y$ denotes the final output of the RASPP module. The use of residual block will greatly improve the ability to extract scale features.

Figure 4. Illustration of the Residual Atrous Spatial Pyramid pooling (RASPP) module and Residual block.

3.5. Scale-Aware Pyramid Fusion module

As the network deepens, the semantic information of advanced features becomes richer. However, it is worth considering how to improve the performance of semantic segmentation by effectively integrating multi-scale information of deep semantics. To solve this problem, a modified SAPF module according to Feng et al. (Citation2020) is proposed, the structure is given in . Obviously, the SAPF module has three parallel dilated convolutions with different dilation rates of 1, 2, and 4 to capture information at different scales. It is noted that these different dilated convolutions have shared weights, leading to a reduction in the number of model parameters and the risk of overfitting. Moreover, a scale-aware module is designed between dilation convolutions of different scales to effectively fuse multi-scale context information. However, high information capacity and blurred boundaries between different regions of remote sensing images increase the difficulty of our segmentation. Therefore, a convolutional block attention module (Woo et al. Citation2018) (CBAM) is added on the original scale-aware module to focus on more details and boundary information. The modified scale-aware module structure and CBAM structure are shown in and , respectively.

Figure 5. The illustration of scale aware Pyramid fusion (SAPF) module.

Figure 6. Illustration of the modified scale-aware module.

Figure 7. Illustration of channel and spatial attention module of CBAM.

Using CBAM attention mechanism in the SAPF module can make important information more prominent and filter out other information. This is similar to the way the human brain focuses on important information. The CBAM module is composed of channel attention and spatial attention module in series, which are used for the attention mechanism of the feature vector channel and space, respectively.

Channel attention assigns weights to each channel of the feature image for increasing effective channel weights and suppressing invalid channel weights (Sun et al. Citation2022). Viewed from , the channel attention structure is displayed in the upper portion. The input feature map $F \in R^{H \times W \times 3}$ first performs average-pooling and max-pooling operations to generate vectors $F_{Avg}^{C} \in R^{1 \times 1 \times C}$ and $F_{Max}^{C} \in R^{1 \times 1 \times C},$ where C represents the channel. Then, the two output features are fed into a shared MLP, and the generated feature images are summarized and merged elementwise. Finally, the channel attention result $M_{c} \in R^{C \times 1 \times 1}$ is output through the sigmoid activation function. According to the above process, the calculation formula for channel attention module is as follows: (4) $\begin{matrix} M_{c} (F) = σ (MLP (AvgPool (F)) + MLP (MaxPool (F))) \\ = σ (W_{1} (W_{0} (F_{Avg}^{C})) + W_{1} (W_{0} (F_{Max}^{C}))) \end{matrix}$ (4) where $σ$ represents the sigmoid activation function, $AvgPool$ represents the average pooling operation, $MaxPool$ represents the maximum pooling operation, $MLP$ represents the shared multi-layer perceptron, and $W_{0}$ and $W_{1} \in R^{C / r \times C}$ are the weights of the shared multi-layer perceptron.

The spatial attention can measure some regions of the feature image to obtain higher responses (Sun et al. Citation2022). The mechanism flow is illustrated in . Suppose the feature vector weighted by the channel attention module is $F^{'} \in R^{H \times W \times C} .$ Firstly, $F^{'}$ obtain two-dimensional vectors $F_{Avg}^{S} \in R^{H \times W \times 1}$ and $F_{Max}^{S} \in R^{H \times W \times 1}$ through average-pooling and max-pooling operations, where S represents the channel. Then, the two-dimensional vector information is concatenated and it is fused through the convolution operation. Finally, a two-dimensional spatial attention image is generated by the sigmoid activation function. The equation of the above process is as follows: (5) $\begin{matrix} M_{s} (F) = σ (f^{7 \times 7} ([AvgPool (F^{'}); MaxPool (F^{'})])) \\ = σ (f^{7 \times 7} ([F_{Avg}^{S}; F_{Max}^{S}])) \end{matrix}$ (5) where $σ$ is the sigmoid activation function, and $f^{7 \times 7}$ indicates that the feature vector in parentheses is convolved with a convolution kernel of size $7 \times 7 .$

The input feature image $F$ is first weighted through the channel attention module to obtain feature $F^{'} .$ Then feature $F^{'}$ is weighted through the spatial attention module to obtain feature $F^{″} .$ Therefore, feature $F$ is weighted through the CBAM module: (6) $F^{'} = M_{c} (F) \otimes F$ (6) (7) $F^{″} = M_{s} (F^{'}) \otimes F^{'}$ (7) where $\otimes$ represents that the elementwise multiplication.

The scale-aware module is used to fuse different scale feature maps in the SAPF module, as shown in . The improved scale-aware module first focuses on more important clues through CBAM, then uses spatial attention mechanism to dynamically select appropriate scale features and fuses them through self-learning. Through the attention module and a series of convolutions, two different scale feature maps $F_{A}$ and $F_{B}$ obtain two feature maps $A, B \in R^{H \times W},$ where $H$ represents the height of the feature map and $W$ represents the width of the feature map. Then, pixel-wise attention maps $A, B \in R^{H \times W}$ are generated by softmax operator on the spatial-wise values: (8) $A_{i} = \frac{e^{A_{i}}}{e^{B_{i}} + e^{A_{i}}}, B_{i} = \frac{e^{B_{i}}}{e^{B_{i}} + e^{A_{i}}}, i = [1, 2, 3 \dots, H \times W]$ (8)

Finally, the fusion feature map is acquired through a weighted sum: (9) $F_{fusion} = A ⊙ F_{A} + B ⊙ F_{B}$ (9) where the element-wise product operations $⊙$ are performed between two scale features and the attention maps to obtain the fused feature map $F_{fusion} .$

After using two cascaded scale sensing modules, the final fuse features of the three branches are obtained. Then a residual connection with learnable parameter α is used to get the output of the SAPF module.

3.6. Loss function

A main problem faced in the Vaihingen and Potsdam datasets is the imbalance in category proportions, which can lead to model training focusing on larger proportions while ‘ignoring’ smaller ones (Kampffmeyer et al. Citation2018; He et al. Citation2022). Thus, for further optimizing the model, the principal loss $L$ adopts the joint loss combined with dice loss (Milletari et al. Citation2016) $L_{Dice}$ and cross-entropy loss $L_{CE}$ to perform all segmentation tasks. The formula is as follows: (10) $L_{Dice} = 1 - \frac{2}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} \frac{{\hat{y}}_{k}^{(n)} y_{k}^{(n)}}{{\hat{y}}_{k}^{(n)} + y_{k}^{(n)}}$ (10) (11) $L_{CE} = - \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} y_{k}^{(n)} log {\hat{y}}_{k}^{(n)}$ (11) (12) $L = L_{Dice} + L_{CE}$ (12) where $N$ represents the number of samples, $K$ represents the number of categories, $y^{(n)}$ and ${\hat{y}}_{}^{(n)}$ represent the one-hot encoding of the true semantic labels and the corresponding softmax output of the network $(n \in [1, \dots, N]),$ and ${\hat{y}}_{k}^{(n)}$ represents the confidence of sample n belonging to the category $k .$

For a fair comparison, all methods in this work employ the same loss function for each individual task.

4. Experiment preparation

4.1. Dataset

The proposed architecture in this work is evaluated with the Vaihingen and Potsdam datasets which provided by ISPRS. The Vaihingen dataset was taken in Vaihingen, Germany, with a ground sampling distance of 9 cm and an area of 1.38 square kilometers. It contains 33 true orthophoto (TOP) images collected by advanced airborne sensors of different sizes, of which 16 images have corresponding semantic labels. The remote sensing image format is an 8-bit TIFF file, composed of three bands: near infrared, red, and green. We divided the marked Vaihingen dataset into two parts, using ID: 30, 32, 34, 37 for testing and the remaining 12 images for training. Due to the high resolution of original remote sensing images and limited equipment, the original dataset is cropped randomly into a size of 256 × 256, accompanied by data augmentation operations, resulted in 10020 patches for training and 1950 patches for testing.

The Potsdam dataset with 38 images of the same size was taken in Potsdam, Germany, with a ground sampling distance of 5 cm and an area of 3.42 square kilometers. Each image has three combinations of channels, namely IR-R-G, R-G-B and R-G-B-IR. In our work, R-G-B image with three channels was selected for experiments. The Potsdam dataset was also divided into two parts, in which (ID: 2_12, 3_12, 4_12, 5_12, 6_12, 7_12, 7_13) were for testing and the remaining 31 images were for training. Similarly, they were randomly cropped and expanded to obtain patches for 10563 training sets and 3521 testing sets The Vaihingen and Potsdam datasets are labeled as 6 categories for semantic segmentation research, including five foreground categories (impervious surface, building, low vegetation, tree, car) and one background category (clutter/background). The category ‘Clutter/background’ is ignored when performing quantitative evaluation on the two datasets.

4.2. Implementation details

In this experiment, all evaluations of the networks were implemented on a single NVIDIA GTX 2080 GPU. The software framework used was PyTorch 1.10.1. Detailed information about the training parameters can be found in . During training and testing, the size of all input images to the model was 256 × 256. To ensure rapid convergence, we employed the Adam optimizer to train all modules in the experiment, setting the learning rate to 1e-4. The batch size for the experiments was set to 4, and the maximum number of iterations was set to 100. Random initialization methods were used to set the model parameters in the experiments. To prevent overfitting, we relied on data augmentation techniques to generate more data. The data augmentation methods used in the paper include rotation, translation, noise injection, scaling transformations, and color jitter. These methods not only effectively expand the training dataset but also do not alter the essential content of the images, thereby enhancing the model’s generalization ability. Moreover, we conducted five independent runs on each experimental configuration of the remote sensing image segmentation model to ensure the stability of the results. The average values with standard deviation were obtained from the results of five repeated experiments.

Table 3. Training parameter settings.

Download CSV Display Table

4.3. Evaluation metrics

Deep learning models typically evaluate their quality and performance by analyzing their performance in test data. To analyze the segmentation performance after training, three common performance metrics (overall accuracy (OA), intersection over union (IoU), and F1 score(F1)) were employed to evaluate the proposed network.

Among them, overall accuracy (OA), also known as pixel accuracy, is the percentage of pixels in the image that are correctly classified. The formula can be expressed as: (13) $OA = \frac{TP + TN}{TP + TN + FP + FN}$ (13) where $TP$ represents true positive, $TN$ represents true negative, $FP$ represents false positive, and $FN$ represents false negative.

Intersection Over Union (IoU), also known as the Jaccard index, is the intersection and union of the predicted results of each category with the actual label. The calculation formula is given as below: (14) $IoU = \frac{TP}{TP + FP + FN}$ (14)

F1 score(F1) offers the harmonic mean for both Precision and Recall: (15) $F 1 = 2 \times \frac{precision \times recall}{precision + recall}$ (15) (16) $precision = \frac{TP}{TP + FP}$ (16) (17) $recall = \frac{TP}{TP + FN}$ (17)

IoU and F1 are evaluation indicators for a single type of feature, while MIoU and AF1 represent the average values of IoU and F1 for all categories, and are used for global evaluation indicators.

5. Result and discussion

5.1. Ablation study and module analysis

In this section, the performance of the proposed network structure and several important modules were evaluated by conducting ablation studies on the Potsdam dataset all experiments used a U-shaped network based on pre-trained weights ConvNeXt as the baseline network. To objectively evaluate the model and the proposed modules, MIoU, AF1, and OA evaluation indicators are adopted to quantitatively evaluate the segmentation performance. Besides, the effect of the loss function on the proposed network was also verified.

5.1.1. Effect of backbone and overall network structure

The results are demonstrated in . We can see that the introduction of our backbone ConvNeXt can effectively improve the segmentation performance of baseline U-Net. Meanwhile, U-Net* (representing U-Net + GPG + RASPP + SAPF) outperforms the U-Net baseline, demonstrating the effectiveness of the proposed three modules in context information capture and integration. To increase the comparability of the network, the ConvNeXt backbone is replaced with classic feature extraction methods such as Vgg, ResNet50, and Res2Net50. As shown in , all indicators indicate that CMPF-UNet has the best performance; therefore, ConvNeXt exhibits excellent feature extraction ability compared to Vgg, ResNet50, and Res2Net50. In summary, both the backbone network and overall network structure of CMPF-UNet are effective.

Table 4. Comparison of baseline methods and different backbones on the Potsdam dataset.

Display Table

5.1.2. Effect of GPG module

As shown in , the GPG module are compared with two other encoder-decoder skip-connection methods, which can verify its contribution to improving network segmentation performance. One skip-connection method is the simple skip-connection in U-Net (Ronneberger et al. Citation2015), where the encoder and decoder are directly connected. Another method is the respath skip-connection method in the MultiResUnet (Ibtehaz and Rahman Citation2020) network, where the convolutional layer chain has residual connections. By observing the experimental results, it is evident that the GPG module has made substantial improvements in evaluation indicators compared to the other two methods. For the MIoU indicator, the GPG module is 0.75% and 1.37% higher than the other two methods, indicating that the separable convolution of parallel branches with different Receptive field is more conducive to information acquisition. To further verify the effectiveness of the GPG module, three skip-connection methods are compared through feature map visualization in . Obviously, GPG module with global context information flow has better response to target segmentation. In the first line, the shadow of the ‘Car’ causes misclassification of categories, but the GPG module can correctly distinguish them. In the second and third rows, the segmentation errors caused by material similarity between ‘Building’ and ‘Impervious Surface’ are avoided after using GPG module. Therefore, the GPG module is chosen as the encoder-decoder skip-connection method for CMPF-UNet, which can improve the overall segmentation performance.

Figure 8. The segmentation effect of using different skip methods in the CMPF-UNet framework. (a) U-Net; (b) Respath; (c) GPG.

Table 5. Comparison of different skip-connection methods.

Display Table

5.1.3. Effect of RASPP module

verifies the segmentation performance of the proposed RASPP model (Baseline + A) on the network, where A represents the RASPP module. Compared with the baseline, the mIoU, AF1, and OA indicators of the model improved by 0.55%, 0.22%, and 0.31%, respectively, which benefits from the fact that the proposed RASPP module can extract rich multi-scale context information by adopting larger kernel-sized residual modules. To explore suitable large convolutional kernels, kernel sizes like 3 × 3, 5 × 5, 7 × 7 and 9 × 9 are selected for experiments, the obtained results are shown in . Obviously, convolutional kernel size of 7 × 7 performed best on three evaluation indicators. Moreover, the parameters of the network remained roughly the same. Thus, residual blocks in the RASPP module adopt 7 × 7 depthwise convolution to obtain a larger receptive field. To further demonstrate the performance of the RASPP module, the same type of modules (PPM (Zhao et al. Citation2017), ASPP (Chen et al. Citation2018), and DenseASPP (Yang et al. Citation2018)) is used to replace the RASPP module in our network for segmentation performance comparison experiments for experimental comparison. The results are shown in . Through comprehensive comparison of parameters and segmentation performance, our RASPP module not only outperforms other modules in segmentation accuracy, but also has relatively small parameters. Among them, the parameters of the RASPP module have been reduced by 75% compared to the original ASPP module. Moreover, the MIoU, AF1, and OA indicators increased by 0.57%, 0.37%, and 0.44%, respectively. As shown in , the comparison of visual segmentation results is more intuitive. The ‘Impervious Surface’, ‘Building’, and ‘Low Vegetation’ categories with the similar colors could be effectively distinguished after using the RASPP module.

Figure 9. The segmentation effect of using RASPP module in the CMPF-UNet framework. (a) PPM; (b) ASPP; (c) DensASPP; (d) RASPP.

Table 9. Comparison of parameters and segmentation performance between different modules.

Display Table

5.1.4. Effect of SAPF module

As shown in , the MIoU, AF1, and OA indicators of the model increased to 0.41%, 0.26%, and 0.35% with using the improved SAPF module (Baseline + B_CBAM) independently (B represents the SAPF module). This confirms that the improved SAPF module can dynamically fuse multi-scale context information. By comparison, the original SAPF module (Baseline + B_no_CBAM) reduced some categories of IoU indicators, indicating that the CBAM module helps to focus on more context information. shows that the performance of the improved SAPF module is further verified through visualization results. Viewed from the first and second rows, the ‘Car’ and ‘Buildings’ are mistakenly divided into ‘Clutter/background’ because of the similarity of the goods on the truck and the color of the building roof to the ground. Moreover, the ‘Impervious Surface’ in the third row has brighter light compared to other ‘Impervious Surface’ around it, so it is mistakenly classified as ‘Clutter/background’. However, the visualization results in (c) show that the introduction of the improved SAPF module effectively reduces errors caused by similar features and lighting issues.

Figure 10. The segmentation effect of using SAPF module in the CMPF-UNet framework. (a) Baseline; (b) baseline + SAPF (no CBAM); (c) baseline + SAPF (CBAM).

5.1.5. Effect of loss functions

To solve the problem of unbalanced distribution of dataset categories, the loss function ( $L_{CE}$ + $L_{Dice}$ ) composed of dice loss ( $L_{Dice}$ ) and cross-entropy loss ( $L_{CE}$ ) are used to perform all segmentation tasks. Furthermore, the ablation experiment on the influence of three loss function are conducted on the Potsdam dataset to verify the effectiveness of the new loss function, and the experimental results are intuitively displayed in . Compared to applying only $L_{CE},$ $L_{Dice}$ significantly improves the segmentation results of a smaller proportion of ‘Car’. But it weakens regulation of categories other than ‘Car’. When both $L_{CE}$ and $L_{Dice}$ are used, the IoU of each category gets the best segmentation result.

Figure 11. Ablation experiment of loss functions on the Potsdam dataset.

In this work, the joint effects between modules are also explored (). Compared with the baseline, the introduction of both RASPP and SAPF modules further improves the segmentation results, which increase MIoU by 1.08%, AF1 by 0.66%, and OA by 0.8%. In addition, shows the parameters and speeds of different network configurations in the ablation experiment. Although the introduction of GPG, RASPP, and SAPF modules has increased the complexity and computational complexity of the model to a certain extent, it has also relatively improved the segmentation accuracy of the model. This performance improvement is balanced with additional computational costs. And in these proposed new modules, depthwise separable convolutions have been adopted instead of standard convolutions, and attention mechanisms have been introduced to reduce unnecessary calculations. We also compared existing modules of the same type in and , and the results showed that our proposed new module not only has lower parameter count, but also has the best segmentation effect. Therefore, when high-performance application scenarios are needed, it may be reasonable to increase computational complexity relatively. When on resource limited devices, it may be necessary to sacrifice some accuracy to achieve faster speed. shows the performance under different model configurations for selection based on actual application scenario requirements.

In summary, the ablation experiments and visualization results above verified that our proposed network structure and three important modules achieve satisfactory segmentation results, especially challenging for similar material categories.

5.2. Comparison with other methods

FCN-8s (Long et al. Citation2015), U-Net (Ronneberger et al. Citation2015), SegNet (Badrinarayanan et al. Citation2017), DeepLabV3+ (Chen et al. Citation2018), ResUNet-50 (Xiao et al. Citation2018), Res2UNet-50 (Chen et al. Citation2022), TransUNet (Chen et al. Citation2021), and Swin-UNet (Cao et al. Citation2023) are introduced as a reference for comparison to further evaluate the effectiveness of the proposed CMPF-UNet. Among them, the first six comparison methods are based on traditional CNN, and the last two methods are based on the Transformer architecture. Specifically, TransUNet is a hybrid of standard visual Transformer and UNet, and Swin-UNet is a U-shaped network composed solely of pure Swin Transformer blocks.

During the experiment, FCN-8s, ResUNet-50 and DeeplabV3+, and Res2UNet-50 are initialized with Vgg16, ResNet-50, and Res2Net-50 pre-training weights, respectively. For our CMPF-UNet, it employs ConvNeXt-T pre-training weights to initialize the network. Besides, we quantitatively demonstrate the superiority of our CMPF-UNet with the metrics IoU, AF1, and OA. To ensure the fairness of the experiment and the effective-ness of the data, all comparative experiments are trained and tested in the same software and hardware environment to avoid the influence of irrelevant variables.

5.2.1. Results of Vaihingen dataset

shows the comparison results of all the semantic segmentation methods. Obviously, the proposed CMPF-UNet in this work outperforms other methods in all indicators. Among them, Swin-UNet performs the worst in all categories of IoU, which may be due to its insufficient global modeling ability and difficulty in processing semantically rich remote sensing images. In contrast, TransUNet improves segmentation accuracy by 3.65% MIoU, demonstrating the feasibility of mixing CNN and Transformers. However, CMPF-UNet contains more local information than TransUNet, resulting in improvements of 4.89%, 3.26%, and 2.51% on MIoU, AF1, and OA, respectively. U-Net only employs simple skip-connections to continuously integrate spatial information from low-level features, which can lead to information loss and model optimization difficulties in processing high-resolution remote sensing images. FCN-8s combines Vgg16 pre-training weights with shallower detail information for multiple upsampling to achieve better segmentation results than U-Net. SegNet has relatively low segmentation accuracy for complex scenes such as remote sensing images, making it difficult to handle complex backgrounds, occlusion, and other situations. In addition, Res2UNet-50 achieves better segmentation performance compared to ResUNet-50 using conventional ResNet bottleneck blocks by inserting more levels of residual connection structures for multi-scale feature extraction in the Res2Net bottleneck block. Our CMPF-UNet outperforms Res2UNet-50 in MIoU, AF1, and OA by 0.85%, 0.55%, and 0.65%, respectively, indicating that it is not as effective in global context modeling as our proposed method. Although DeeplabV3+ also captures more global context information by combining ASPP modules and carefully designed decoder structures, experimental data shows that it is not as good as our CMPF-UNet. Compared with DeeplabV3+, CMPF-UNet integrates global/multi-scale information more effectively, resulting in an increase of 2.61%, 1.74%, and 1.05% in MIoU, AF1, and OA, respectively. In summary, CMPF-UNet achieves the best segmentation performance in MIoU, AF1, and OA, reaching 77.09%, 86.81%, and 87.85%, respectively.

Table 10. Comparison of different other methods on the Vaihingen dataset.

Display Table

shows the training loss and accuracy curves of all networks in the Vaihingen dataset. According to the comparison, it can be intuitively seen that CMPF-UNet exhibits a faster convergence speed. Moreover, the loss oscillation amplitude of is relatively small, and the loss is lower than other networks. shows the visual prediction results of several semantic segmentation methods involved in . It can be observed that Swin-UNet lacks spatial location information, resulting in many semantic fragments in its segmentation results. Compared with other segmentation results, CMPF-UNet is closest to the label, especially for categories with high similarity. For example, in the first and second rows, ‘Tree’ and ‘Low Vegetation’ are connected together with the similar appearances, which makes it difficult to accurately distinguish. In the fourth and last rows, ‘Building’ with shaded areas cannot be accurately identified. Besides, ‘Building’ and ‘Impervious Surface’ with similar materials cannot be distinguished. However, the visualization results in show that our CMPF-UNet overcomes the above problems, it can reduce errors caused by similar appearance and shadow occlusion. In summary, the proposed CMPF-UNet in this work has been proven to have better segmentation performance compared to the listed comparative networks.

Figure 12. Comparison of training loss and accuracy on the Vaihingen dataset.

Figure 13. Examples of segmentation effects on the Vaihingen dataset. (a) FCN; (b) U-Net; (c) SegNet; (d) DeeplabV3+; (e) ResUNet-50; (f) Res2UNet-50; (g) TransUNet; (h)Swin-UNet; (i) CMPF-UNet.

5.2.2. Results of Potsdam dataset

As shown in , the experimental results of all networks on the Potsdam dataset are displayed. By comparing experimental data, the effectiveness of the proposed CMPF-UNet is further demonstrated, reaching 80.23% on MIoU, 88.86% on AF1, and 89.17% on OA. Due to the different sizes and data types, the segmentation accuracy on the Potsdam dataset is generally higher than that on the Vaihingen dataset. It is worth noting that Swin-UNet and TransUNet still perform poorly compared to CNN based models, which is consistent with the results on the Vaihingen dataset. It indicates that spatial information extraction is crucial for high-resolution remote sensing images. Although ResUNet-50 employs residual bottleneck modules and pre-trained weights to obtain deeper features, insufficient global information extraction results in poor performance. Res2UNet-50 with multi-scale residual hierarchical convolution module and DeepLabV3+ with ASPP module perform well in CNN-based models, which verifies the effectiveness of multi-scale feature extraction for semantically rich remote sensing images. Compared with Res2UNet-50, CMPF-UNet increases MIoU by 1.94%, AF1 by 1.25%, and OA by 1.29%. Compared with DeepLabV3+, CMPF-UNet increases MIoU by 2.42%, AF1 by 1.57%, and OA by 1.53%. Additionally, the proposed CMPF-UNet in this work is compared with other state-of-the-art CNNs, such as FCN-8s, U-Net, and SegNet, showing improvements in the segmentation accuracy of each category over the other comparison networks.

Table 11. Comparison of different other methods on the Potsdam dataset.

Display Table

shows the training loss curves and accuracy curves of all the networks on the Potsdam dataset. The results shows that CMPF-UNet has the highest accuracy with the minimal losses, further demonstrating its effectiveness. As shown in , the visualization results on the Potsdam dataset are given. Due to the similar appearance between the categories, the reference method mistakenly classified the small target ‘clutter/background’ as others categories (first and fifth rows). As expected, only CMPF-UNet achieves the correct segmentation. In the second and fourth rows, CMPF-UNet could also better distinguish between the two categories of ‘Low Vegetation’ and ‘Tree’ with similar materials. Moreover, as shown in the last row, since the category ‘Building’ is occluded by ‘Tree’, extracting and recognizing its semantic features is difficult. Nevertheless, CMPF-UNet can still identify occluded object by extracting global and multi-scale context information.

Figure 14. Comparison of training loss and accuracy on the Potsdam dataset.

Figure 15. Examples of segmentation effects on the Potsdam dataset. (a) FCN; (b) U-Net; (c) SegNet; (d) DeeplabV3+; (e) ResUNet-50; (f) Res2UNet-50; (g) TransUNet; (h) Swin-UNet; (i) CMPF-UNet.

In the above experiments, the Vaihingen and Potsdam datasets were mainly divided into 6 types of land features, representing the majority of urban area images. However, these datasets lack geographical diversity and cannot represent urban landscapes in different regions around the world, especially in terms of architectural style, vegetation types, and urban layout. Moreover, the data may have seasonal biases, and the model may not be able to adapt to changes in different seasons, such as the growth cycle of vegetation or the presence of snow. Besides, it may also be affected by limitations in lighting time and climate, as well as insufficient and imbalanced samples. Therefore, in order to provide a balanced view of the model’s generalization ability, data augmentation techniques (rotation, translation, scaling, and color jitter) were applied to simulate different lighting and weather conditions, helping the model learn more universal features. Meanwhile, this technology also solves the problem of insufficient samples. We also employed partial transfer learning to assist in model training, as well as a loss function combining dice loss and cross-entropy loss to address the issue of imbalanced samples. Through these appropriate measures, it has been demonstrated in the above comparative experiments that our CMPF-UNet is more effective than other reference networks on the Vaihingen and Potsdam datasets, and is not affected by the randomness of individual datasets.

5.3. Efficiency analysis

In order to conduct a comprehensive performance analysis of the proposed network, shows the training speeds and parameters of different networks in the same operating environment. Among them, ‘speed’ represents the number of images processed by the network per second with the unit of FPS (Frames Per Second) (Li et al. Citation2021). In terms of computational efficiency (speed), the larger number of parameters would lead to the lower speed. For CMPF-UNet, it not only employs the deep neural network ConNeXt as the Backbone, but also adds the three important modules proposed. Therefore, CMPF-UNet has relatively larger parameters than other comparative networks, and its speeds on the Vaihingen and Potsdam datasets are only 99 FPS and 98 FPS, respectively. Although the above two issues may limit the application of CMPF-UNet in certain scenarios (such as small mobile devices), CMPF-UNet remains valuable in exploring the role of global and multi-scale features in remote sensing semantic segmentation.

Table 12. Comparison of all model parameters, speed, and accuracy.

Display Table

6. Conclusions

In this paper, a new deep learning framework for remote sensing image segmentation called CMPF-UNet is proposed, which focuses on addressing the weakness of global/multi-scale contextual information capture and fusion in U-shaped networks. The network uses ConvNeXt as the backbone to extract image feature information accompanied with the design of the GPG, RASPP, and SAPF modules. Specifically, the GPG module is introduced to combine multi-stage global context information to reconstruct skip-connections, providing global information guidance flow for the decoder. Moreover, the RASPP and SAPF modules proposed in this work are introduced to obtain and dynamically fuse rich multi-scale context information from advanced features.

This article verifies the effectiveness and universality of our proposed CMPF-UNet by conducting comprehensive experiments on Vaihingen and Potsdam remote sensing datasets. The obtained results on two datasets show that the CMPF-UNet effectively reduces errors compared to other methods in categories with similar materials, shadow occlusion, and small target objects. Meanwhile, this method is not affected by a single dataset, indicating its practicality and universality. Especially, GPG, RASPP, and SAPF modules all exhibit effective and general, making it easy to introduce other networks. Furthermore, our proposed research method can capture general image features (including edges, textures, and colors), so it can also be applied to other image segmentation tasks, such as urban scene segmentation, traffic monitoring image processing, medical image segmentation, etc. Through appropriate training and adjustment, this method can be transferred to different tasks and application scenarios. Especially the important modules we have proposed and improved can provide more research ideas for other scholars. In the subsequent work, we will expand to other segmentation fields to improve the generalization ability of the model, as well as design lightweight networks to improve inference efficiency.

Data availability statement

The data provided in this article can be obtained from this website (https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx).

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This research was funded by the Jilin Province Science and Technology Development Plan (Grant No. YDZJ202301ZYTS285), the National Natural Science Foundation of China (No. 21606099), the Innovative and Entrepreneurial Talents Foundation of Jilin Province (No. 2023QN31) and the Natural Science Foundation of Jilin Province (No. YDZJ202301ZYTS157).

References

Ba JL, Kiros JR, Hinton GE. 2016. Layer normalization. arXiv:1607.06450.
Google Scholar
Badrinarayanan V, Kendall A, Cipolla R. 2017. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell. 39(12):2481–2495. doi: 10.1109/TPAMI.2016.2644615.
PubMed Web of Science ®Google Scholar
Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M. 2023. Swin-unet: unet-like pure transformer for medical image segmentation. Computer Vision–ECCV 2022 Workshops; October 23–27, 2022; Tel Aviv, Israel: Springer.
Google Scholar
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. The European Conference on Computer Vision (ECCV); 8–14 September 2018; Munich, Germany.
Google Scholar
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou YJ. 2021. Transunet: transformers make strong encoders for medical image segmentation. arXiv:2102.04306.
Google Scholar
Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL. 2018. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS. IEEE Trans Pattern Anal Mach Intell. 40(4):834–848. doi: 10.1109/TPAMI.2017.2699184.
PubMed Web of Science ®Google Scholar
Chen L-C, Papandreou G, Schroff F, Adam H. 2017. Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587.
Google Scholar
Chen F, Wang N, Yu B, Wang L. 2022. Res2-Unet, a new deep architecture for building detection from high spatial resolution images. IEEE J Sel Top Appl Earth Observations Remote Sensing. 15:1494–1501. doi: 10.1109/JSTARS.2022.3146430.
Web of Science ®Google Scholar
Dowden B, De Silva O, Huang W, Oldford D. 2021. Sea ice classification via deep neural network semantic segmentation. IEEE Sensors J. 21(10):11879–11888. doi: 10.1109/JSEN.2020.3031475.
Web of Science ®Google Scholar
Feng S, Zhao H, Shi F, Cheng X, Wang M, Ma Y, Xiang D, Zhu W, Chen X. 2020. CPFNet: context pyramid fusion network for medical image segmentation. IEEE Trans Med Imaging. 39(10):3008–3018. doi: 10.1109/TMI.2020.2983721.
PubMed Web of Science ®Google Scholar
Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. 2019. Dual attention network for scene segmentation. The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 15–20 June 2019; Long Beach, CA, USA.
Google Scholar
Guan S, Khan AA, Sikdar S, Chitnis PV. 2019. Fully dense UNet for 2-D sparse photoacoustic tomography artifact removal. IEEE J Biomed Health Inform. 24(2):568–576. doi: 10.1109/JBHI.2019.2912935.
PubMed Web of Science ®Google Scholar
He K, Zhang X, Ren S, Sun J. 2016. Deep residual learning for image recognition. The IEEE conference on computer vision and pattern recognition; 27–30 June 2016; Las Vegas, NV, USA.
Google Scholar
He X, Zhou Y, Zhao J, Zhang D, Yao R, Xue Y. 2022. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans Geosci Remote Sensing. 60:1–15. doi: 10.1109/TGRS.2022.3144165.
Web of Science ®Google Scholar
Hendrycks D, Gimpel K. 2016. Gaussian error linear units (gelus). arXiv:1606.08415.
Google Scholar
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. 2017. Densely connected convolutional networks. The IEEE Conference on Computer Vision and Pattern Recognition.
Google Scholar
Huang X, Zhang L, Gong W. 2011. Information fusion of aerial images and LIDAR data in urban areas: vector-stacking, re-classification and post-processing approaches. Int J Remote Sens. 32(1):69–84. doi: 10.1080/01431160903439882.
Web of Science ®Google Scholar
Ibtehaz N, Rahman MS. 2020. MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw. 121:74–87. doi: 10.1016/j.neunet.2019.08.025.
PubMed Web of Science ®Google Scholar
Ioffe S. 2017. Batch renormalization: towards reducing minibatch dependence in batch-normalized models. Adv Neural Info Processing Sys. 30.
Google Scholar
Kampffmeyer M, Salberg A-B, Jenssen R. 2018. Urban land cover classification with missing data modalities using deep convolutional neural networks. IEEE J Sel Top Appl Earth Observations Remote Sensing. 11(6):1758–1768. doi: 10.1109/JSTARS.2018.2834961.
Web of Science ®Google Scholar
Li H, Qiu K, Chen L, Mei X, Hong L, Tao C. 2021. SCAttNet: semantic segmentation network with spatial and channel attention mechanism for high-resolution remote sensing images. IEEE Geosci Remote Sensing Lett. 18(5):905–909. doi: 10.1109/LGRS.2020.2988294.
Web of Science ®Google Scholar
Li W, Wang J, Gao Y, Zhang M, Tao R, Zhang B. 2022. Graph-feature-enhanced selective assignment network for hyperspectral and multispectral data classification. IEEE Trans Geosci Remote Sensing. 60:1–14. doi: 10.1109/TGRS.2022.3166252.
Google Scholar
Li X, He H, Li X, Li D, Cheng G, Shi J, Weng L, Tong Y, Lin Z. 2021. Pointflow: flowing semantics through points for aerial image segmentation. The IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021; June 19–25, 2021.
Google Scholar
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B. 2021. Swin transformer: hierarchical vision transformer using shifted windows. The IEEE/CVF International Conference on Computer Vision; October 2021; Montreal, BC, Canada. p. 11–17.
Google Scholar
Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S. 2022. A convnet for the 2020s. The IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Google Scholar
Long J, Shelhamer E, Darrell T. 2015. Fully convolutional networks for semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition; 7–12 June 2015; Boston, MA, USA.
Google Scholar
Milletari F, Navab N, Ahmadi S-A. 2016. V-net: fully convolutional neural networks for volumetric medical image segmentation. The 2016 Fourth International Conference on 3D Vision (3DV); 25–28 October 2016; Stanford, CA, USA: IEEE. doi: 10.1109/3DV.2016.79.
Google Scholar
Nair V, Hinton GE. 2010. Rectified linear units improve restricted Boltzmann machines. The 27th International Conference on Machine Learning (ICML-10).
Google Scholar
Oktay O, Schlemper J, Folgoc LL, Lee M, Heinrich M, Misawa K, Mori K, McDonagh S, Hammerla NY, Kainz B. 2018. Attention u-net: learning where to look for the pancreas. arXiv:1804.03999.
Google Scholar
Ronneberger O, Fischer P, Brox T. 2015. U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference; October 5–9, 2015; Munich, Germany: Springer.
Google Scholar
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. 2018. Mobilenetv2: inverted residuals and linear bottlenecks. The IEEE conference on computer vision and pattern recognition.
Google Scholar
Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
Google Scholar
Sun W, Chen J, Yan L, Lin J, Pang Y, Zhang G. 2022. COVID-19 CT image segmentation method based on swin transformer. Front Physiol. 13:981463. doi: 10.3389/fphys.2022.981463.
PubMed Web of Science ®Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017; December 4–9, 2017; Long Beach, CA, USA.
Google Scholar
Wang J, Li W, Gao Y, Zhang M, Tao R, Du Q. 2023. Hyperspectral and SAR image classification via multiscale interactive fusion network. IEEE Trans Neural Netw Learn Syst. 34(12):10823–10837. doi: 10.1109/TNNLS.2022.3171572.
PubMed Web of Science ®Google Scholar
Wang J, Li W, Zhang M, Chanussot J. 2023. Large kernel sparse ConvNet weighted by multi-frequency attention for remote sensing scene understanding. IEEE Trans Geosci Remote Sensing. 61:1–12. doi: 10.1109/TGRS.2023.3333401.
Web of Science ®Google Scholar
Wang J, Li W, Zhang M, Tao R, Chanussot J. 2023. Remote sensing scene classification via multi-stage self-guided separation network. IEEE Trans Geosci Remote Sensing. 61:1–12. doi: 10.1109/TGRS.2023.3295797.
Google Scholar
Woo S, Park J, Lee J-Y, Kweon IS. 2018. Cbam: convolutional block attention module. The European conference on computer vision (ECCV), 06 October 2018.
Google Scholar
Wu H, Zhang J, Huang K, Liang K, Yu Y. 2019. Fastfcn: rethinking dilated convolution in the backbone for semantic segmentation. arXiv:1903.11816.
Google Scholar
Xiao X, Lian S, Luo Z, Li S. 2018. Weighted Res-UNet for high-quality retina vessel segmentation. 2018 9th International Conference on Information Technology in Medicine and Education (ITME); IEEE. doi: 10.1109/ITME.2018.00080.
Google Scholar
Xie S, Girshick R, Dollár P, Tu Z, He K. 2017. Aggregated residual transformations for deep neural networks. The IEEE Conference on Computer Vision and Pattern Recognition.
Google Scholar
Yang M, Yu K, Zhang C, Li Z, Yang K. 2018. Denseaspp for semantic segmentation in street scenes. The IEEE Conference on Computer Vision and Pattern Recognition.
Google Scholar
Yang Y, Hallman S, Ramanan D, Fowlkes CC. 2011. Layered object models for image segmentation. IEEE Trans Pattern Anal Mach Intell. 34(9):1731–1743. doi: 10.1109/TPAMI.2011.208.
Web of Science ®Google Scholar
Zhang M, Li W, Zhao X, Liu H, Tao R, Du Q. 2023. Morphological transformation and spatial-logical aggregation for tree species classification using hyperspectral imagery. IEEE Trans Geosci Remote Sensing. 61:1–12. doi: 10.1109/TGRS.2022.3233847.
Web of Science ®Google Scholar
Zhao H, Shi J, Qi X, Wang X, Jia J. 2017. Pyramid scene parsing network. The IEEE conference on computer vision and pattern recognition; 21–26 July 2017; Honolulu, HI, USA.
Google Scholar

CMPF-UNet: a ConvNeXt multi-scale pyramid fusion U-shaped network for multi-category segmentation of remote sensing images

Abstract

1. Introduction

2. Related works

2.1. Encoder-decoder framework

2.2. ConvNeXt

Table 1. Detailed architecture specifications for ResNet-50, ConvNeXt-T and Swin-T.

2.3. Multi-scale feature extraction and fusion

3. Proposed method

3.1. Overview

3.2. Network structure

Table 2. The proposed architecture details of CMPF-UNet.

3.3. Global Pyramid Guidance Module

3.4. Residual Atrous Spatial Pyramid Pooling module

3.5. Scale-Aware Pyramid Fusion module

3.6. Loss function

4. Experiment preparation

4.1. Dataset

4.2. Implementation details

Table 3. Training parameter settings.

4.3. Evaluation metrics

5. Result and discussion

5.1. Ablation study and module analysis

5.1.1. Effect of backbone and overall network structure

Table 4. Comparison of baseline methods and different backbones on the Potsdam dataset.

5.1.2. Effect of GPG module

Table 5. Comparison of different skip-connection methods.

5.1.3. Effect of RASPP module

Table 6. Ablation Studies of the proposed method on the Potsdam dataset.

Table 7. Comparison of parameters and speed in Ablation studies.

Table 8. Comparison of different convolutional kernel sizes in RASPP modules.

Table 9. Comparison of parameters and segmentation performance between different modules.

5.1.4. Effect of SAPF module

5.1.5. Effect of loss functions

5.2. Comparison with other methods

5.2.1. Results of Vaihingen dataset

Table 10. Comparison of different other methods on the Vaihingen dataset.

5.2.2. Results of Potsdam dataset

Table 11. Comparison of different other methods on the Potsdam dataset.

5.3. Efficiency analysis

Table 12. Comparison of all model parameters, speed, and accuracy.

6. Conclusions

Data availability statement

Disclosure statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date