152
Views
0
CrossRef citations to date
0
Altmetric
Research Article

A method for building extraction in remote sensing images based on swintransformer

, , , , , & show all
Article: 2353113 | Received 06 Dec 2023, Accepted 04 May 2024, Published online: 15 May 2024

ABSTRACT

Remote sensing image building segmentation, which is essential in land use and urban planning, is evolving with advancements in deep learning. Conventional methods using convolutional neural networks face limitations in integrating local and global information and establishing long-range dependencies, resulting in suboptimal segmentation in complex scenarios. This paper proposes LMSwin_PNet, a novel segmentation network that addresses the SwinTransformer encoder's deficiency in local information processing through a local feature extraction module. Additionally, it features a multiscale nonparametric merging attention module to enhance feature-channel correlations. The network also incorporates the pyramid large-kernel convolution module, replacing the traditional 3 × 3 convolution in the decoder with multibranch large-kernel convolution, thereby achieving a large receptive field and detailed information capture. Comparative analyses on three public building datasets demonstrated the model's superior segmentation performance and robustness. The results show that LMSwin_PNet produced outputs closely matching labels, showing its potential for broader application in remote sensing image segmentation tasks. It achieved achieving an IoU of 72.35% on the Massachusetts Building Dataset, 91.30% on the WHU Building Dataset, and 78.99% on the Inria aerial-image building dataset. The source code will be freely available at https://github.com/ziyanpeng/pzy.

This article is part of the following collections:
Integration of Advanced Machine/Deep Learning Models and GIS

1. Introduction

The study of building extraction from satellite remote sensing images has particular significance as it contributes to the comprehension of building information within these images. It also provides essential data support for population forecasting (Wang, Wang, et al. Citation2022), urban planning (Wang and Qu Citation2022), and agricultural production (Huang et al. Citation2018).

In recent years, convolutional neural networks (CNNs) have gained widespread utilization in remote sensing image analysis, including detection (Wang, Lv, et al. Citation2022), classification (Duan, Duan, and Ding Citation2021), and semantic segmentation (Sun et al. Citation2021), yielding notable results. Unlike conventional machine learning techniques such as support vector machines (Razaque et al. Citation2021), random forests (Linhui, Weipeng, and Huihui Citation2021), and conditional random fields (Wang, Mu, et al. Citation2021), CNN-based approaches excel at capturing complex local contextual details. This, in turn, enhances their ability in feature representation and pattern recognition. In 2014, the fully convolutional neural network (FCN) (Long, Shelhamer, and Darrell Citation2015) replaced fully connected layers with fully convolutional networks to accomplish image segmentation. However, the simplification of the FCN decoder has led to coarse resolution segmentation (Wu et al. Citation2018). Afterward, in 2015, U-Net (Ronneberger, Fischer, and Brox Citation2015) achieved significant success in image segmentation owing to its distinctive skip connections and encoder–decoder architecture. While its encoder extracts hierarchical features by gradually downsampling the spatial resolution of feature maps, the decoder learns more contextual information by progressively restoring the spatial resolution. Therefore, the encoder–decoder framework has become the standard for remote sensing image segmentation networks. While CNN segmentation networks using the encoder–decoder structure have achieved commendable segmentation results, they primarily focus on local semantic feature extraction, thus lacking the ability to capture global semantic features. Previous research has explored the application of dilated convolutions (Deng, Shi, and Li Citation2021a; Wang et al. Citation2018) to expand the receptive field. However, this approach is prone to information loss (Dai et al. Citation2017; Yu, Koltun, and Funkhouser Citation2017) and may fail to capture contextual information from a global perspective.

Obtaining long-range dependencies and a large receptive field proves crucial in distinguishing features, particularly in remote sensing images featuring complex patterns and low contrast, manmade objects. Identifying the objects requiring segmentation solely through global information is challenging. Recently, the vision transformer (Vit) (Vaswani et al. Citation2017) has presented a substantial challenge to CNNs. It has demonstrated superior performance and is gaining prominence in modern vision tasks. Originally proposed for natural language processing, the Vit structure excels in handling global information dependencies, possessing a global receptive field for extracting more comprehensive semantic information from features. Chen et al. (Citation2021) utilized CNNs to extract features and then employed a transformer for long-term dependency modeling, aiming to address neural network deficiencies in global modeling. Wang, Xie, et al. (Citation2021) introduced the pyramid Vit model for extracting multiscale feature maps, albeit with the complexity scaling to some extent with image size. Liu et al. (Citation2021) presented the SwinTransformer model, utilizing a sliding-window strategy to significantly reduce computational complexity and achieve superior performance in cross-window information exchange. Cao et al. (Citation2022) replaced the convolutional encoding and decoding operations in U-Net with SwinTransformer modules, thereby establishing the Swin_UNet model.

Transformer models exhibit nearly flawless performance in visual tasks. However, these models also have certain flaws. For instance, they typically rely on pretrained networks to achieve peak performance, which requires a large amount of training data for the former (Kornblith, Shlens, and Le Citation2019; Raffel et al. Citation2020). Consequently, applying transformer models to acquire global information may prove unsuitable for smaller remote sensing datasets (Devlin et al. Citation2018; Yang et al. Citation2019). Notably, the use of large convolutional kernels can yield a substantial receptive field. The principal factor contributing to the performance disparity between CNNs and Vits is in the disparity in their receptive field sizes. Large convolutional kernels can expand the receptive field, compensating for the limitations of transformers on smaller datasets. In addition, transformer models lacks the capacity to extract local information. In recent research, more transformer architectures have focused on extracting local information. Zhu et al. (Citation2023) proposed a simple and effective dynamic sparse attention for two-layer routing to achieve more content-aware, flexible-computing allocation. Furthermore, Ren et al. (Citation2022) proposed a new method for addressing the efficiency of Vit in capturing small objects; it allows Vit to model attention at a mixed scale of each attention layer while preserving fine-grained features. Consequently, extracting local information while extracting global information based on the transformer model is crucial for semantic segmentation tasks.

This paper proposes a novel segmentation network designed for building segmentation in remote sensing imagery. It leverages a fusion of transformer and CNN elements, distinct from purely transformer-based networks. In particular, the improved SwinTransformer is used as an encoder for global feature extraction, the LFEM + MNPA is used for extracting local information, and the PLKC module is employed to obtain multiscale large receptive fields, solving the limitations of pure transformers on smaller datasets and the loss of detail information by SwinTransformer.

Our main contributions are summarized as follows:

  1. Introduction of a new segmentation network called LMSwin_PNet, representing a hybrid of SwinTransformer and U-Net architectures. This network was successfully applied to three publicly available building datasets;

  2. Proposal of the local feature extraction module (LFEM), combining the potent global information extraction capability of the SwinTransformer encoder to complement the local information that SwinTransformer may overlook;

  3. Proposal of the nonparametric merging attention (MNPA) module. This module enhances features at each layer, with the number of channels in MNPA adapting to the feature maps. The greater the number of channels, the more feature map groups that are employed, with the maximum number of feature map groups increasing alongside the channel count. The designed MNPA module collaborates with the LFEM module to concentrate on pertinent building areas;

  4. Proposal of the PLKC module, which is characterized by varying convolution kernel sizes. The PLKC block size adjusts to the feature map's dimensions. For larger feature maps, smaller convolution kernel sizes are employed within the PLKC block. This pyramid structure, coupled with a large receptive field, allows for the capture of more information, thereby significantly enhancing network-segmentation representation and performance.

2. Related work

2.1. Semantic segmentation of buildings

Traditional semantic segmentation methods are often used for semantic segmentation of buildings. Li et al. (Citation2015) obtained the initial classification of image pixels based on unsupervised segmentation, and then developed a new conditional random field formula to achieve accurate building segmentation. Du, Zhang, and Zhang (Citation2015) learned a random forest classifier from a large number of imbalanced samples with high-dimensional features to perform more precise semantic segmentation of buildings. Although traditional semantic segmentation algorithms are suitable for extracting buildings, they often rely on manual feature extraction and rule design, making it difficult to handle complex scenes and diverse architectural styles.

The development and application of the CNN are also applicable to the field of semantic segmentation of remote sensing images. The CNN is currently the main technology for constructing semantic segmentation. Liu et al. (Citation2019) proposed a lightweight deep learning model that combines spatial pyramids with the encoder–decoder structure, enhancing segmentation accuracy while maintaining compact model parameters. Deng, Shi, and Li (Citation2021b) uses the stable encoder–decoder architecture, combined with a grid-based attention gate and atrous spatial-pyramid-pooling module, to capture and restore features progressively and effectively. Tian et al. (Citation2021) elevated building segmentation accuracy by incorporating attention mechanisms and a dense feature pyramid into the segmentation network. Yuan, Wang, and Xu (Citation2022) proposed the shift pooling PSPNet, addressing the limitations of pyramid pooling and effectively capturing complete local information for building segmentation. This continuous development of CNN based feature learning structures continually enhances the efficacy of building extraction from remote sensing images.

In remote sensing images, nonbuilding objects often exhibit complex structures, making it difficult to achieve accurate segmentation without global semantic information. In the past 3 years, transformer technology has made significant progress in semantic segmentation, with some transformer-based models achieving better performance than the CNN. Renhe, Qian, and Guixu (Citation2023) introduces a novel shunted transformer, which enables the model to capture multiscale information internally while fully establishing global dependencies, which enables the construction of a pure ViT-based U-shaped model for building extraction. However, there are fewer building extraction methods for pure transformer models.

Although semantic-segmentation models based on a CNN and transformer have achieved good segmentation results, the former approach falls short in capturing multiscale global features, while the latter can lead to the loss of local information, as depicted in . For semantic segmentation, if only local information is modeled, the classification of each pixel will often be fuzzy. The combination of a CNN and transformer architectures may be an effective method to improve semantic-segmentation performance. Liegang et al. (Citation2023) uses convolution to extract the local features of buildings while extracting global features in the encoder. Xiao et al. (Citation2022) integrates a U-shaped network in a novel manner to achieve the feature-level fusion of local and large-scale semantics. The above methods provide the possibility for the fusion architecture of a CNN and transformer to achieve global and local feature fusion of buildings.

Figure 1. Explanation of local information and global information. The image on the left is from the WHU Building Dataset, and the image on the right is from Massachusetts Building Dataset, where the local contextual information is modeled using convolution (in yellow). The global contextual information is modeled through the dependency relationship between long-range windows (in red).

Figure 1. Explanation of local information and global information. The image on the left is from the WHU Building Dataset, and the image on the right is from Massachusetts Building Dataset, where the local contextual information is modeled using convolution (in yellow). The global contextual information is modeled through the dependency relationship between long-range windows (in red).

2.2. Local feature extraction

In Vit, images are tagged with fragments, and then long-range contextual information is obtained by examining the relationship between representation and self-attention structures. The standard Vit includes multi-head self-attention (MSA), multilayer perceptron (MLP), and layer normalization (LN), as shown in (a). Compared with conventional Vit, SwinTransformer is a hierarchical transformer that uses efficient self-attention strategies. As shown in (b), the conventional MSA method is replaced by the rule-based window MSA method and the moving SW-MSA method. In the SwinTransformer model, although the alternating execution of W-MSA and SW-MSA increases the interaction between information, this approach restricts the focus on local information within the window, thus weakening the global modeling capability of the encoder (Xia et al. Citation2023; Xu et al. Citation2023). Hence, we use the LFEM to address the additional detailed information that SwinTransformer may inadvertently neglect, as shown in (c).

Figure 2. (a) Architecture of a standard transformer block; (b) Two consecutive window-based transformer blocks (W-TB) and shift window-based transformer group (SW-TB); (c) Novel SwinTransformer block with a multiscale nonparametric merging attention mechanism module (MNPA) and local feature extraction module (LFEM).

Figure 2. (a) Architecture of a standard transformer block; (b) Two consecutive window-based transformer blocks (W-TB) and shift window-based transformer group (SW-TB); (c) Novel SwinTransformer block with a multiscale nonparametric merging attention mechanism module (MNPA) and local feature extraction module (LFEM).

2.3. Nonparametric merging attention module

The concept of grouping feature maps has been a recurring topic of research. In the context of deep separable convolutions (Chollet Citation2017; Howard et al. Citation2017), the grouping operation of feature maps has reached an extreme, with the number of groups matching the number of channels in the feature maps. Each group undergoes independent convolution operations, enabling the capture of spatial information within the feature maps. Furthermore, in EPSANet (Zhang et al. Citation2022), the grouping operation is used to capture dependencies between different positions of the feature maps, thus enhancing the interconnection among the channels within the feature maps. Building upon these ideas, we developed an MNPA module that focuses on the pertinent channel features through the utilization of the SimAM attention mechanism (Yang et al. Citation2021). The grouping methodology and the number of groups are tailored to accommodate the varying channel numbers within the encoder network at distinct stages. In cases where feature maps contain a large number of channels, the number of groups correspondingly increases, further complicating the grouping approach. This architectural design facilitates more efficient interaction among feature channels with differing grouping strategies and numbers, ultimately making the fused features resulting from multiscale integration more interpretable. This, in turn, contributes to achieving superior segmentation performance.

2.4. Large-kernel convolution

The convolutional operation is a core component of CNNs, and the choice of convolutional kernels is crucial for network performance. Previous studies have widely used 3 × 3 convolutional kernels owing to advantages such as fewer parameters, more nonlinear transformations, and better parameter-sharing effects (Simonyan and Zisserman Citation2014). However, these kernels have a small receptive field, making it difficult to capture global features. Convolutional kernels larger than 5 × 5 are usually referred to as large kernels. In the early days, large kernels were frequently used, such as in AlexNet (Krizhevsky, Sutskever, and Hinton Citation2012). Although large kernels can achieve a larger receptive field, they have drawbacks, such as low efficiency and the overlooking of details, so they have been replaced by 3 × 3 convolutional kernels for a long time. Recently, the resurgence of large kernels has introduced a fresh perspective. RepLKNet (Ding et al. Citation2022) proposed the idea that small kernels can compensate for information loss. The fundamental concept involves the simultaneous use of small kernels alongside large kernels to capture complex details that large kernels may disregard. Empirical results substantiate the superiority of this approach. In RepLKNet, a dimension-reduction operation precedes large-kernel convolution, effectively curtailing the computational burden of the network. Based on these principles, this study proposes the concept of PLKC. Prior to each large-kernel convolution, dimension reduction is applied, and each PLKC module integrates 3 × 3 convolutions in parallel to enhance the extraction of fine-grained features. The pyramid structure aids in amalgamating the receptive fields of convolutional kernels at various scales, thereby enhancing segmentation accuracy. In particular, the configuration of the PLKC module adapts to the feature map's size. Larger feature maps entail the application of smaller convolutional kernels within the PLKC module, thus effectively mitigating the computational demands associated with large-kernel convolution.

3. Method

The novel architectural segmentation network proposed in this research is illustrated in . This network, denoted as LMSwin_PNnet, comprises an enhanced SwinTransformer encoder and a decoder founded on large-kernel convolution. In its entirety, this network accepts remote sensing building images as input and seamlessly produces building segmentation results. The methodology presented in this paper can achieve accurate building segmentation while maintaining network efficiency. The subsequent sections offer comprehensive explanations for each constituent element.

Figure 3. LMSwin_PNet for building segmentation in remote sensing images.

Figure 3. LMSwin_PNet for building segmentation in remote sensing images.

3.1. LFEM

Although global contextual information plays a pivotal role in the semantic segmentation of complex architectural scenes, retaining rich details through local information is equally important. To address this requirement, we devised two parallel branches dedicated to separately extracting global and local contexts. global contextual information is combined with local contextual information insights obtained from LFEM to yield comprehensive global–local contextual information.

In datasets 1 and 3, most individual images contain multiple small and irregularly spaced buildings, and the spacing between buildings is tight. In contrast, Dataset 2 has relatively high image accuracy, with most individual images containing few or even no buildings, and a relatively large proportion of individual target buildings, with sparse intervals between buildings. Recognizing the differences between Dataset 1, Dataset 2, and Dataset 3, we tailored two different methods for extracting local features. Based on the data characteristics of dataset 1 and dataset 3, we use a size of 1 × 1. 3 × 3 and 5 × Three convolution kernels of 5 are used to extract local contextual information. Conversely, for dataset 2, we use a size of 1 × 1. 3 × 3. 5 × 5 and 7 × The four convolution kernels of 7 are used to extract local contextual information. Both methods for extracting local contextual information combine batch normalization before addition or superposition operations. Finally, three different pooling methods are applied to balance the features in the feature map. The structural details are shown in (a and b).

Figure 4. (a) The LFEM used in Dataset 1 and Dataset 3; (b) The LFEM used in Dataset 2

Figure 4. (a) The LFEM used in Dataset 1 and Dataset 3; (b) The LFEM used in Dataset 2

3.2. MNPA

For the LFEM-derived features, we employ MNPA for supervision. This supervised grouping of channel features effectively highlights the objects that require detection, enabling the network to adaptively learn the critical aspects of the input data. Our MNPA approach is composed of different cascaded grouping methods. When the feature map has a channel count of 128, no grouping is applied, as illustrated in (a). When the channel count reaches 256, the feature map is divided into groups of 1 and 2, as depicted in (b). For a feature map with 512 channels, grouping involves 1, 2, and 4 groups, as shown in (c). Finally, in the case of a feature map containing 1,024 channels, grouping includes 1, 2, 4, and 8 groups, as illustrated in (d).

Figure 5. Cascading of groups corresponding to different feature channel numbers at different stages of MNPA, where G represents the number of groups in the feature map.

Figure 5. Cascading of groups corresponding to different feature channel numbers at different stages of MNPA, where G represents the number of groups in the feature map.

The grouped feature map is subsequently input into the SimAm attention mechanism. SimAm operates without generating extra parameters and assigns weight values to individual neurons through an energy function, eliminating the need for complex structures: (1) et(wt,bt,y,xi)=1M1i=1M1(1(wtxi+bt))2+(1(wtt+bt))2+λwt2,(1) where i is the index over the spatial dimension; t and xi represent a neuron and other neurons in the feature map, respectively; M represents the number of neurons in a channel; wt and bt represent the weights and biases of the neuron, respectively; and λ represents a hyper-parameter in the function, usually set to 10−4. Furthermore, by calculation and inference, the average value and variance of all neurons, along with the weights wt and biases bt, can be obtained. By substituting these analytical solutions into it, we can obtain the minimum energy function: (2) et=4(σ2+λ)(tμ)2+2σ2+2λ,(2)

which is an energy function that quantifies the disparity between the target neuron and other neurons. where the definitions of σ2 and μ are given in Equations (Equation3) and (Equation4), respectively: (3) σ2=1Mi=1M(xiμ)2,(3) (4) μ=1Mi=1Mxi.(4) In general, a smaller value signifies greater importance assigned to the target neuron. To address the need for processing all channels within the feature map efficiently, we use SimAM to enhance and optimize the energy function, as depicted in EquationEquation (5). (5) X=sigmoid(1E)X.(5) Among them, E groups all et across channel and spatial dimensions. sigmoid limits the larger value in E, and X represents the input neurons. As shown in , the specific processing flow of the SimAM when the number of feature map groups is 4 is as follows:

Figure 6. Cascading unit when the number of groups is 4(G = 4), where G represents the number of groups in the feature map, X represents the feature map, C represents the number of feature channels, and SimAMAttention represents the SimAM attention mechanism.

Figure 6. Cascading unit when the number of groups is 4(G = 4), where G represents the number of groups in the feature map, X represents the feature map, C represents the number of feature channels, and SimAMAttention represents the SimAM attention mechanism.

3.3. PLKC

In the decoder, we replace the original 3 × 3 convolution module with the PLKC module, which primarily comprises five distinct structures, ranging from PLKC0 to PLKC4. Each PLKC module consists of four parallel branches arranged in a sequential manner. Each branch includes convolution operations, batch normalization, and the ReLU activation function.

Moreover, this paper uses a 3 × 3 convolution on convolutions with kernel sizes greater than or equal to 5 × 5 to supplement fine-grained information. Concurrently, each convolution with a large kernel employs 1 × 1 convolutions for dimension reduction and dimension expansion to mitigate computational complexity. We use convolution kernels of different sizes based on the size of the feature map. When the size of the feature map is small, we use a larger convolution kernel in the PLKC model. When the size of the feature map is large, we use a smaller convolution kernel in the PLKC module. The design concept of PLKC, in one sense, benefits from large-kernel convolution, effectively improving the prediction accuracy of the model and, in another sense, reducing the enormous computational burden brought by large-kernel convolution.

The convolution kernels used in PLKC0 include 1 × 1, 3 × 3, and 5 × 5, as illustrated in (a). For PLKC1, the convolution kernels include 1 × 1, 3 × 3, 5 × 5, and 7 × 7, as shown in (b). In the case of PLKC2, the convolution kernels comprise 3 × 3, 5 × 5, and two 7 × 7 kernels, as depicted in (c). PLKC3 employs 3 × 3, 5 × 5, 7 × 7, and 9 × 9 convolution kernels, as presented in (d). Finally, PLKC4 utilizes 5 × 5, 7 × 7, 9 × 9, and 11 × 11 convolution kernels, as outlined in (e).

Figure 7. PLKC block (a) PLKC0. (b) PLKC1. (c) PLKC2. (d) PLKC3. (e) PLKC4.

Figure 7. PLKC block (a) PLKC0. (b) PLKC1. (c) PLKC2. (d) PLKC3. (e) PLKC4.

4. Experiment and results

4.1. Datasets

To ensure a fair comparison with existing research, researchers typically perform the evaluation of building segmentation accuracy using publicly available building datasets. In our study, we employed three distinct datasets to assess the model's performance.

Dataset 1 (Mnih Citation2013) is the Massachusetts Building Dataset, which comprises aerial images with R, G, and B bands and corresponding labels. The size of all images in this dataset is 1500 × 1500, with a spatial resolution of 1 m. The Massachusetts dataset has a lower spatial resolution and contains more buildings in each image. Dataset 2 (Ji, Wei, and Lu Citation2018) is the WHU Building Dataset, provided by Wuhan University, which has original data from the New Zealand Land Information Service website. This dataset includes a total of 187,000 buildings and is a visible-light dataset with a ground resolution of 0.3 m. The dataset contains 4,736 training images (130,500 building areas), 1036 validation images (14,500 building areas), and 2,416 testing images (42,000 building areas), with an image size of 512 × 512. Dataset 3 (Maggiori et al. Citation2017) is the Inria aerial-image building dataset, which does not split adjacent parts of the same image into training and testing subsets, but includes different cities in each subset. It covers different urban settlements, including densely populated areas and high mountain towns, covering an area of 810 square km with a spatial resolution of 0.3 m. Since only the masks of the images in the training set are available publicly, these data were used in this study.

4.2. Evaluation metrics

To examine the building extraction performance of various networks, we employed well-established metrics, including intersection over union (IoU), recall, and F1 score. These metrics were computed based on the counts of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). TP indicates that the model correctly identifies building pixels as building pixels, FP indicates that the model incorrectly identifies non building pixels as building pixels, FN indicates that the model incorrectly identifies building pixels as non building pixels, and TN indicates that the model correctly identifies non building pixels as non building pixels.

IoU was used to measure the degree of overlap between the segmentation results predicted by the model and the actual annotated segmentation results. The value range of IOU is between 0 and 1, with a larger value indicating a higher degree of overlap between the predicted segmentation results and the actual segmentation results, resulting in better model performance. The formula for IoU is (6) IoU=TPTP+FP+FN.(6) Precision was used to measure how much of the model's predicted positive class samples were truly positive class samples. The formula for precision is (7) Precision=TPTP+FP.(7) Recall indicates that the model successfully predicted the proportion of the number of pixels in a certain category to the actual number of pixels in that category. The formula for recall is (8) Recall=TPTP+FN.(8) F1 score is a comprehensive evaluation index for model performance, which combines precision and recall indicators to measure the predictive performance of the model on positive and negative categories. The formula for F1 score is (9) F1 score=2×Recall×PrecisionRecall+Precision.(9)

4.3. Implementation details

All experiments were conducted based on experimental data with the same training parameters used three times, and the column with the highest accuracy in the three training sessions was selected for inclusion in the table. The images in datasets 1 and 2 were cropped to 256 × 256, while the images in dataset 3 were cropped to 512 × 512. In addition, for dataset 1, images with blank areas larger than 5% were excluded. The model was trained for 50 epochs with a batch size of 4. To ensure fairness, we conducted all experiments in the same environment, using the cross-entropy loss function as the model constraint for training, with unified hyper-parameter settings. The Adam optimizer was used with a learning rate set to 0.0001. All experiments were implemented on a NVIDIA GeForce RTX 4090 24-GB GPU.

4.4. Ablation study

4.4.1. The effectiveness of PLKC, LFEM, and MNPA

In this section, we present our validation of the effectiveness of the LFEM, MNPA, and PLKC modules within the network, as presented in . Ablation experiments were conducted on dataset 1, dataset 2, and dataset 3. The baseline network was established by removing the LFEM, MNPA, and PLKC modules from our network architecture. exhibits the results obtained by comparing our method with models containing different components. The baseline network exhibited robust segmentation performance, achieving an IoU of 69.89% and an F1 score of 81.11% on dataset 1, an IoU of 90.70% and an F1 score of 93.61% on dataset 2, and an IoU of 78.55% and an F1 score of 85.56% on dataset 3.

Table 1. Results of ablation experiments with our proposed method.

First, we introduced the LFEM + MNPA module to the baseline network, denoting this model as BaseLine + LFEM + MNPA. The results show that on dataset 1, the IoU and F1 score metrics increased by 1.11% and 0.86%, respectively. On dataset 2, there was an improvement of 0.33% in IoU and 0.27% in the F1 score. On dataset 3, there was an improvement of 0.36% in IoU and 0.33% in the F1 score. The LFEM module extracts local features, while MNPA suppresses irrelevant information within the feature channels, thereby mitigating interference from nonprominent objects. The LFEM + MNPA module enhances the global–local information balance by increasing local information features.

Second, we investigated the effectiveness of the PLKC module. We added the PLKC module to the base network and called this model BaseLine + PLKC. Compared with the basic network, for dataset 1, the IoU and F1 score index increased by 1.97% and 1.25%, respectively; on dataset 2, it increased by 0.39% and 0.31%; and on dataset 3, it increased by 0.22% and 0.19%, respectively. The PLKC module uses cascaded large-kernel convolution and small-kernel convolution, enabling the decoder to capture local information while capturing a large amount of receptive field information, compensating for the traditional 3 × 3 defect of the small receptive field of convolution.

Afterward, we investigated the effectiveness of the MNPA module. By adding the LFEM + PLKC module to the baseline network, denoted as BaseLine + LFEM + PLKC, we assessed its effect. The integration of the PLKC and LFEM + MNPA modules into the baseline network led to the creation of the proposed model, BaseLine + LFEM + MNPA + PLKC. The results show that on dataset 1, the proposed model improved the IoU and F1 score metrics by 0.76% and 0.76%, respectively. On dataset 2, there was an enhancement of 0.18% in IoU and 0.17% in the F1 score. On dataset 3, there was an enhancement of 0.03% in IoU but an decrease 0.01% in the F1 score.

Ultimately, the proposed model (BaseLine + LFEM + MNPA + PLKC) delivered significant improvements, achieving an increase in IoU of 2.46% and an increase in the F1 score of 1.82% on dataset 1. On dataset 2, the model demonstrated an improvement of 0.6% in IoU and 0.46% in the F1 score. On dataset 3, the model demonstrated an improvement of 0.44% in IoU and 0.36% in the F1 score. The integration of the PLKC, LFEM, and MNPA modules facilitates more effective interactions between features, contributing to enhanced segmentation performance.

4.4.2. Comparison of different combinations of convolutions in LFEM

For datasets 1, 2, and 3, owing to the differences in their data, the convolution combinations used for extracting local information were different, resulting in different segmentation effects. In our proposed model, we adopted three distinct convolution combinations tailored to the specific datasets, guided by the evaluation metrics. The convolution combinations corresponding to different evaluation metrics are outlined in . For datasets 1 and 3, the 1 × 1, 3 × 3, and 5 × 5 convolution combinations had higher IoU and F1 score metrics than for other combinations. However, for dataset 2, the 1 × 1, 3 × 3, and 5 × 5 convolution combinations that performed well on dataset 1 and dataset 3 did not achieve the best segmentation results on dataset 2. Instead, the 1 × 1, 3 × 3, 5 × 5, and 7 × 7 convolution combinations had higher IoU and F1 score metrics on dataset 2, but they were not suitable for dataset 1 and dataset 3. Given the distinctive characteristics of the datasets, the utilization of different convolution combinations to extract local information for each dataset proved beneficial in achieving enhanced segmentation results.

Table 2. The effect of different combinations of convolutional kernels on segmentation effectiveness.

4.4.3. The impact of the dimensionality-reduction operation in 1 × 1 convolution

Within the PLKC module, by using 1 × 1 convolution to reduce dimensionality, we effectively reduced the computational complexity associated with large-kernel convolution and accelerated the training speed of the model. However, the accuracy of the model could be impacted when using 1 × 1 convolutions to reduce dimensionality. We referred to the model without dimensionality reduction using 1 × 1 convolutions as Ours + NODR (None Dimension Reduction).

To investigate the influence of 1 × 1 convolution dimensionality reduction on segmentation results, we compared all three datasets, namely datasets 1, 2, and 3. As illustrated in , the segmentation effect of using 1 × 1 convolution without dimensionality reduction varied across datasets. While increasing model complexity, it was likely to increase segmentation accuracy. On dataset 1, it resulted in an increase of 0.27% in the IoU and 0.09% in F1 score. On dataset 2, there was an enhancement of 0.19% in the IoU and 0.17% in F1 score. On dataset 3, it resulted in a decrease of 0.23% in IoU and 0.14% in the F1 score.

Table 3. Influence of 1 × 1 convolution on dimension reduction for segmentation results.

4.4.3. Comparing with the state-of-the-art methods

To validate the effectiveness of our proposed method, we conducted a comparative analysis between LMSwin_PNet and seven advanced segmentation methods, including U-Net, PSPNet (Zhao et al. Citation2017), Deeplabv3plus (Chen et al. Citation2018), TransUNet, MANet (Li et al. Citation2021), Swin_UNet, and BANet (Graham et al. Citation2021). U-Net, PSPNet, Deeplabv3plus, TransUNet, Swin_UNet, and BANet are representative deep learning models in semantic segmentation. They are widely used in various scenarios and have achieved good segmentation results. MANet is a well-known model in the field of remote sensing segmentation, with good performance. U-Net, PSPNet, Deeplabv3plus, and MANet are all based on CNN architecture, while Swin_UNet and BANet are all based on Transformer architecture. The presence of both Transformer and CNN elements in LMSwin_Pnet and TransUNet has gradually become a research hotspot in recent years, alleviating the problem of insufficient segmentation accuracy caused by insufficient sample data. We carried out both quantitative and qualitative evaluations for all these methods. present the quantitative evaluation results on datasets 1, 2, and 3, respectively, and we provide the parameter quantities and Flops for each model in . Notably, our method consistently outperformed the other segmentation methods in terms of IoU and F1 score, showcasing its superior segmentation capabilities.

Table 4. Comparing our method with other advanced segmentation methods on dataset 1.

Table 5. Comparing our method with other advanced segmentation methods on dataset 2.

Table 6. Comparing our method with other advanced segmentation methods on dataset 3.

Further, we compared the segmentation results of various methods, as illustrated in , showcasing visual comparisons on dataset 1, dataset 2, and dataset 3. Notably, our method consistently delivered superior segmentation results. When confronted with relatively blurry images in dataset 1 and dataset 3, our method excelled in distinguishing buildings and handling boundary information. Similarly, when dealing with remote sensing images in dataset 2 characterized by similar contrast, our proposed method accurately identified building areas, demonstrating its robustness and effectiveness.

Figure 8. Visual comparison of segmentation results from different methods on datasets 1, 2, and 3. The first and second columns are from dataset 1, the third and fourth columns are from dataset 2, and the fifth columns are from dataset 3.

Figure 8. Visual comparison of segmentation results from different methods on datasets 1, 2, and 3. The first and second columns are from dataset 1, the third and fourth columns are from dataset 2, and the fifth columns are from dataset 3.

5. Discussion

In the encoder, we introduce a three-branch structure for global–local attention, thereby enabling the extraction of a significant amount of global contextual information while preserving complex local details. Global contextual information is obtained through cascaded SwinTransformer branches, whereas local information is captured via cascaded LFEM modules. The MNPA modules further enhance model interpretability by suppressing irrelevant information output by LFEM. Although the combination of global–local information and local enhancement has been achieved, there remains room for research into more effective global–local information extraction techniques.

In the decoder, we propose a fusion of large-kernel convolutions and multiscale concepts. This approach leverages large-kernel convolutions of various sizes for feature extraction, making full use of the large receptive field to strengthen long-range connections between different features. Different convolution sizes allow for the extraction of multiscale information, compensating for information loss associated with the small receptive field and single scale, as seen in traditional neural network structures. The unique PLKC structure with varying kernel sizes provides a substantial receptive field while adding only a modest computational load. Experimental results validate its effectiveness in remote sensing image segmentation.

Traditional CNN structures are currently being influenced by transformer and other architectures owing to their limited receptive fields. This study not only suggested novel directions for future CNN applications in remote sensing image segmentation but also proposed a method to integrate the advantages of a large receptive field into CNN network structures. Experimental findings indicate that combining large-kernel convolutions and small kernel convolutions can enhance model performance. Although large-kernel convolutions introduce additional parameters, the parameter size can be mitigated through dimensionality reduction using 1 × 1 convolutions, albeit with a marginal loss in accuracy. Thus, our future research will focus on more effective ways to combine large- and small-kernel convolutions within CNNs.

6. Conclusions

In this article, we proposed a segmentation network named LMSwin_Pnet, which combines the strengths of both the CNN and transformer architectures to effectively capture global and local information for building remote sensing image segmentation. We conducted extensive experiments on the WHU building dataset, the Massachusetts building remote sensing dataset and the Inria aerial-image building dataset, demonstrating the effectiveness of our model. We achieved an IoU of 91.49% on the WHU dataset, an IoU of 72.62% on the Massachusetts dataset, and an IoU of 78.93% on the Inria aerial-image building dataset.

The core structure of LMSwin_PNet was built upon U-Net, SwinTransformer, LFEM, MNPA, and PLKC. The LFEM plays a crucial role in extracting local information and effectively integrating it with the global information obtained from the original encoder. MNPA enhances the correlation between feature channels without increasing the model's parameters. When used alongside LFEM, it helps the model focus on building details more effectively. The PLKC introduces a novel approach by combining large- and small-kernel convolutions. We designed a PLKC structure with different sizes of convolution kernels, which expands the receptive field while adding minimal computational complexity. We believe that our proposed network can serve as a dependable computer-aided system for building remote sensing image segmentation, thereby facilitating accurate identification of building features in remote sensing images. In the future, we may address computational efficiency concerns associated with models such as Vit by exploring lightweight architectures such as knowledge distillation, tensor decomposition, and depthwise separable convolutions, aiming to achieve minimal accuracy loss in lightweight segmentation networks.

Author contributions

W.Z. and X.Z. designed and completed the experiments and wrote the paper. N.H. revised the paper and analyzed the data. Y.X., T.C. supervised the study. Y.L. and Y.H. guided the process and helped with the writing of the paper. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

We would like to thank the anonymous reviewers for their constructive and valuable suggestions on earlier drafts of this manuscript.

Data availability statement

The data used in this study are from open-source datasets. The datasets can be downloaded from Road and Building Detection Datasets (toronto.edu), https://gpcv.whu.edu.cn/data/building_dataset.html and Download – Inria Aerial Image Labeling Dataset (accessed on 17 January 2024).

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This research was supported by the National Natural Science Foundation of China (grant number 42371441) and the Scientific Innovation Program Project by the Shanghai Committee of Science and Technology (grant number 20dz1206501).

References

  • Cao, Hu, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. 2022. “Swin-unet: Unet-Like Pure Transformer for Medical Image Segmentation.” Paper presented at the European Conference on Computer Vision.
  • Chen, Jieneng, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. 2021. “Transunet: Transformers Make Strong Encoders for Medical Image Segmentation.” arXiv preprint arXiv:2102.04306.
  • Chen, Liang-Chieh, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. “Encoder-decoder with Atrous Separable Convolution for Semantic Image Segmentation.” Paper presented at the Proceedings of the European Conference on Computer Vision (ECCV).
  • Chollet, François. 2017. “Xception: Deep Learning with Depthwise Separable Convolutions.” Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Dai, Jifeng, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. “Deformable Convolutional Networks.” Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
  • Deng, Wenjing, Qian Shi, and Jun Li. 2021a. “Attention-gate-based Encoder–Decoder Network for Automatical Building Extraction.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing. https://doi.org/10.1109/jstars.2021.3058097.
  • Deng, Wenjing, Qian Shi, and Jun Li. 2021b. “Attention-gate-based Encoder–Decoder Network for Automatical Building Extraction.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14:2611–2620. https://doi.org/10.1109/JSTARS.2021.3058097.
  • Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv preprint arXiv:1810.04805.
  • Ding, Xiaohan, Xiangyu Zhang, Jungong Han, and Guiguang Ding. 2022. “Scaling up Your Kernels to 31 × 31: Revisiting Large Kernel Design in Cnns.” Paper presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Du, Shihong, Fangli Zhang, and Xiuyuan Zhang. 2015. “Semantic Classification of Urban Buildings Combining VHR Image and GIS Data: An Improved Random Forest Approach.” ISPRS Journal of Photogrammetry and Remote Sensing 105:107–119. https://doi.org/10.1016/j.isprsjprs.2015.03.011.
  • Duan, Meimei, Lijuan Duan, and Bai Yuan Ding. 2021. “High Spatial Resolution Remote Sensing Data Classification Method Based on Spectrum Sharing.” Scientific Programming 2021:1–12. https://doi.org/10.1155/2021/4356957.
  • Graham, Benjamin, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. 2021. “Levit: A Vision Transformer in Convnet's Clothing for Faster Inference.” Paper presented at the Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Howard, Andrew G, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. “Mobilenets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” arXiv preprint arXiv:1704.04861.
  • Huang, Yanbo, Zhong-xin Chen, Tao Yu, Xiang-zhi Huang, and Xing-fa Gu. 2018. “Agricultural Remote Sensing big Data: Management and Applications.” Journal of Integrative Agriculture 17 (9): 1915–1931. https://doi.org/10.1016/S2095-3119(17)61859-8.
  • Ji, Shunping, Shiqing Wei, and Meng Lu. 2018. “Fully Convolutional Networks for Multisource Building Extraction from an Open Aerial and Satellite Imagery Data set.” IEEE Transactions on Geoscience and Remote Sensing 57 (1): 574–586.
  • Kornblith, Simon, Jonathon Shlens, and Quoc V Le. 2019. “Do Better Imagenet Models Transfer Better?” Paper presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25: 1–8.
  • Li, Er, John Femiani, Shibiao Xu, Xiaopeng Zhang, and Peter Wonka. 2015. “Robust Rooftop Extraction from Visible Band Images Using Higher Order CRF.” IEEE Transactions on Geoscience and Remote Sensing. https://doi.org/10.1109/tgrs.2015.2400462.
  • Li, Rui, Shunyi Zheng, Ce Zhang, Chenxi Duan, Jianlin Su, Libo Wang, and Peter M Atkinson. 2021. “Multiattention Network for Semantic Segmentation of Fine-resolution Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 60:1–13.
  • Liegang, Xia, Mi Shulin, Zhang Junxia, Luo Jiancheng, Shen Zhanfeng, and Cheng Yubin. 2023. “Dual-stream Feature Extraction Network Based on CNN and Transformer for Building Extraction.” Remote Sensing 15:2689, https://doi.org/10.3390/rs15102689.
  • Linhui, Li, Jing Weipeng, and Wang Huihui. 2021. “Extracting the Forest Type from Remote Sensing Images by Random Forest.” IEEE Sensors Journal 21 (16): 17447–17454. https://doi.org/10.1109/JSEN.2020.3045501.
  • Liu, Yaohui, Lutz Gross, Zhiqiang Li, Xiaoli Li, Xiwei Fan, and Wenhua Qi. 2019. “Automatic Building Extraction on High-resolution Remote Sensing Imagery Using Deep Convolutional Encoder-Decoder with Spatial Pyramid Pooling.” IEEE Access 7:128774–128786. https://doi.org/10.1109/ACCESS.2019.2940527.
  • Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.” Paper presented at the Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Long, Jonathan, Evan Shelhamer, and Trevor Darrell. 2015. “Fully Convolutional Networks for Semantic Segmentation.” Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Maggiori, E., Y. Tarabalka, G. Charpiat, and P. Alliez. 2017. “Can Semantic Labeling Methods Generalize to any City? The Inria Aerial Image Labeling Benchmark.” Paper presented at the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 23–28 July 2017.
  • Mnih, Volodymyr. 2013. Machine Learning for Aerial Image Labeling. Ph.D. thesis, Toronto, ON, Canada: University of Toronto. https://www.cs.toronto.edu/~vmnih/docs/Mnih_Volodymyr_PhD_Thesis.pdf.
  • Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. “Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer.” The Journal of Machine Learning Research 21 (1): 5485–5551.
  • Razaque, Abdul, Mohamed Ben Haj Frej, Muder Almi’ani, Munif Alotaibi, and Bandar Alotaibi. 2021. “Improved Support Vector Machine Enabled Radial Basis Function and Linear Variants for Remote Sensing Image Classification.” Sensors 21 (13): 4431, https://doi.org/10.3390/s21134431.
  • Ren, Sucheng, Daquan Zhou, Shengfeng He, Jiashi Feng, and Xinchao Wang. 2022. “Shunted Self-attention via Multi-scale Token Aggregation.” Paper presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Renhe, Zhang, Zhang Qian, and Zhang Guixu. 2023. “SDSC-UNet: Dual Skip Connection ViT-Based U-Shaped Model for Building Extraction.” IEEE Geoscience and Remote Sensing Letters 20:1–5. https://doi.org/10.1109/lgrs.2023.3270303.
  • Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015. “U-net: Convolutional Networks for Biomedical Image Segmentation.” Paper presented at the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18.
  • Simonyan, Karen, and Andrew Zisserman. 2014. “Very Deep Convolutional Networks for Large-scale Image Recognition.” arXiv preprint arXiv:1409.1556.
  • Sun, Shuting, Lin Mu, Lizhe Wang, Peng Liu, Xiaolei Liu, and Yuwei Zhang. 2021. “Semantic Segmentation for Buildings of Large Intra-Class Variation in Remote Sensing Images with O-GAN.” Remote Sensing 13 (3): 475. https://doi.org/10.3390/rs13030475.
  • Tian, Qinglin, Yingjun Zhao, Kai Qin, Yao Li, and Xuejiao Chen. 2021. “Dense Feature Pyramid Fusion Deep Network for Building Segmentation in Remote Sensing Image.” Paper presented at the Seventh Symposium on Novel Photoelectronic Detection Technology and Applications.
  • Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. “Attention is All You Need.” Advances in Neural Information Processing Systems 30: 1–7.
  • Wang, Panqu, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. 2018. “Understanding Convolution for Semantic Segmentation.” Paper presented at the 2018 IEEE Winter Conference On Applications of Computer Vision (WACV).
  • Wang, Hao, Xiaolei Lv, Kaiyu Zhang, and Bin Guo. 2022. “Building Change Detection Based on 3D Co-Segmentation Using Satellite Stereo Imagery.” Remote Sensing 14 (3): 628. https://doi.org/10.3390/rs14030628.
  • Wang, Shuyang, Xiaodong Mu, Dongfang Yang, Hao He, and Peng Zhao. 2021. “Road Extraction from Remote Sensing Images Using the Inner Convolution Integrated Encoder-decoder Network and Directional Conditional Random Fields.” Remote Sensing 13 (3): 465. https://doi.org/10.3390/rs13030465.
  • Wang, Wei, and Zhiguo Qu. 2022. “Design of Public Building Space in Smart City Based on Big Data.” Journal of Environmental and Public Health 2022:1–10. https://doi.org/10.1155/2022/4733901.
  • Wang, Mengqi, Yinglin Wang, Bozhao Li, Zhongliang Cai, and Mengjun Kang. 2022. “A Population Spatialization Model at the Building Scale Using Random Forest.” Remote Sensing 14 (8): 1811. https://doi.org/10.3390/rs14081811.
  • Wang, Wenhai, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. 2021. “Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions.” Paper presented at the Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Wu, Guangming, Xiaowei Shao, Zhiling Guo, Qi Chen, Wei Yuan, Xiaodan Shi, Yongwei Xu, and Ryosuke Shibasaki. 2018. “Automatic Building Segmentation of Aerial Imagery Using Multi-constraint Fully Convolutional Networks.” Remote Sensing 10 (3): 407. https://doi.org/10.3390/rs10030407.
  • Xia, Liegang, Shulin Mi, Junxia Zhang, Jiancheng Luo, Zhanfeng Shen, and Yubin Cheng. 2023. “Dual-stream Feature Extraction Network Based on CNN and Transformer for Building Extraction.” Remote Sensing 15 (10): 2689. https://doi.org/10.3390/rs15102689.
  • Xiao, Xiao, Guo Wenliang, Chen Rui, Hui Yilong, Wang Jianing, and Zhao Hongyu. 2022. “A Swin Transformer-based Encoding Booster Integrated in U-Shaped Network for Building Extraction.” Remote Sensing 14. https://doi.org/10.3390/rs14112611.
  • Xu, Lele, Ye Li, Jinzhong Xu, Yue Zhang, and Lili Guo. 2023. “BCTNet: Bi-branch Cross-fusion Transformer for Building Footprint Extraction.” IEEE Transactions on Geoscience and Remote Sensing 61:1–14.
  • Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019. “Xlnet: Generalized Autoregressive Pretraining for Language Understanding.” Advances in Neural Information Processing Systems 32: 1–9.
  • Yang, Lingxiao, Ru-Yuan Zhang, Lida Li, and Xiaohua Xie. 2021. “SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks.” In Proceedings of the 38th International Conference on Machine Learning, edited by Meila Marina and Zhang Tong, 11863–11874. Virtual Event. http://proceedings.mlr.press/v139/yang21o.html.: Proceedings of Machine Learning Research: PMLR.
  • Yu, Fisher, Vladlen Koltun, and Thomas Funkhouser. 2017. “Dilated Residual Networks.” Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Yuan, Wei, Jin Wang, and Wenbo Xu. 2022. “Shift Pooling PSPNet: Rethinking Pspnet for Building Extraction in Remote Sensing Images from Entire Local Feature Pooling.” Remote Sensing 14 (19): 4889. https://doi.org/10.3390/rs14194889.
  • Zhang, Hu, Keke Zu, Jian Lu, Yuru Zou, and Deyu Meng. 2022. “EPSANet: An Efficient Pyramid Squeeze Attention Block on Convolutional Neural Network.” Paper presented at the Proceedings of the Asian Conference on Computer Vision.
  • Zhao, Hengshuang, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. “Pyramid Scene Parsing Network.” Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Zhu, Lei, Xinjiang Wang, Zhanghan Ke, Wayne Zhang, and Rynson WH Lau. 2023. “BiFormer: Vision Transformer with Bi-Level Routing Attention.” Paper presented at the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.