Full article: Scene classification for remote sensing image of land use and land cover using dual-model architecture with multilevel feature fusion

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Scene classification for remote sensing image (RSI) of land use and land cover (LULC) involves identifying discriminative features of interest in different classes. Spurred by the powerful feature extraction capability of Convolutional Neural Networks (CNNs), LULC classification for RSI has rapidly developed in recent years. Although multi-models have better classification performance than single-models, combining multi-models is still the key to maximizing classification accuracy. Thus, this paper proposes a dual-model architecture with multilevel feature fusion called XE-Net. Specifically, high, middle, and low-level features are extracted by the Xception and EfficientNet-V2, respectively, and through transfer learning, the model’s weighting parameters from the ImageNet data are shared. Moreover, the designed sibling feature fusion algorithm fuses the triple-level features extracted from the dual models sequentially according to the same level. Besides, the proposed multi-scale feature fusion method systematically enhances the fused three-scale features to improve the discriminative feature. Finally, the discriminative feature is input into the classifier to obtain the classification results. The maximum average overall accuracy obtained from sufficient experiments using XE-Net on the RSSCN-7 dataset is 96.84%, while WHU-19, UCM-21, OPTIMAL-31, NWPU-RESISC45, and AID attain 99.58%, 99.37%, 97.07%, 95.03%, and 95.78%, respectively, demonstrating our model’s superiority.

KEYWORDS:

1. Introduction

Scene classification for RSI of LULC refers to recognizing and classifying feature characteristics on RSI using computers by establishing the relationship between artificially defined classes and image pixels (Ekim and Sertel Citation2021). It is a research hotspot in the intelligent interpretation of high-resolution RSI and an indispensable link in the study of land use and change in earth observation technology (Naushad, Kaur, and Ghaderpour Citation2021). This classification is based on image recognition technology, utilizing automatic classification of RSI by extracting and analyzing the spectral characteristics, texture, shape, and other information on the images. With the continuous development of remote sensing technology and the optimization of computer algorithms, the accuracy and efficiency of LULC RSI classification have been continuously improved, providing a more reliable and effective scientific basis for land use planning, management, and decision-making (Temenos et al. Citation2023).

High-quality image data is important for obtaining high-precision classification models. RSI data mainly originate from satellite, airborne, and unmanned aircraft remote sensing technologies (Wang et al. Citation2023). Due to its wide coverage and short cycle time, satellite remote sensing technology has become the most commonly used and mature RSI data acquisition method. Airborne and UAV remote sensing techniques are also used to acquire LULC data due to the higher resolution of the RSIs they acquire. However, although diversified data acquisition methods bring convenience to the classification for RSI of LULC, they also bring new challenges (Cheng et al. Citation2020). For example, RSIs are usually rich in pixel content, and images of a certain class often contain many features of other classes. Most of these features that differ from the image class have little relevance to the core features of the image class and may even affect the model’s classification accuracy. As illustrated in , the overall feature characteristics are artificially defined as ‘River’, but the pixel content involves the various morphological features of ‘River’, ‘Bridge’, ‘Forest’, ‘Parking’, ‘Meadow’, and other features.

Figure 1. The class ‘River’ contains many features from other class.

Meanwhile, RSIs often have similar intra-class features or inter-class differences between different image classes, which can easily confuse classification. For instance, ‘Desert’ and ‘Mountain’ have similar color characteristics ((a)), and ‘Runway’ and ‘Freeway’ are structurally similar ((b)). Besides, ‘Meadow’ has large color differences from each other ((c)), and there are different structural characteristics in the samples of ‘Industry’ ((d)). Some of the RSIs collected with different feature classes have more images. In comparison, others have fewer images, posing a class imbalance in the RSI of LULC, e.g. the AID dataset (Xia et al. Citation2017) depicted in . Obviously, all these issues affect the classification level of different classes of RSIs to some extent.

Figure 2. Similarities in the different classes and differences in the same classes.

Figure 3. The sample sizes for the different classes are not balanced in AID dataset.

Researchers have conducted several studies to reduce the impact of these problems and improve the classification accuracy for RSI of LULC (Dutta and Das Citation2023). Early classification methods mainly relied on handcrafted features, with the low and mid-level features lacking high-level semantic information and suffering from various limitations (Mehmood et al. Citation2022). Recently, with the development of computers, neural networks, optimization algorithms, open source frameworks, and RSI datasets, deep learning methods have received increasing attention for classifying the RSI of LULC (Adegun, Viriri, and Tapamo Citation2023). Among the advanced networks, such as Visual Transformer (Dosovitskiy et al. Citation2020), which applies the Transformer to image processing, global information can be modeled without the limitation of local convolution. However, this method suffers from high computational complexity and low processing efficiency for large-size images and tasks with fine-grained features. Graph Neural Networks (Zhou et al. Citation2020) can model local relationships and global structures in images but impose a high computational complexity and cannot handle image translation and rotation. In contrast, CNN specializes in image processing and can effectively capture local relations and spatial structures in images. It also has a strong feature extraction ability and has been widely used in practice. However, CNN-based classification for RSI of LULC still faces three problems (Li et al. Citation2021). First, CNNs require a large number of labeled training samples, while the sample size of the current labeled dataset tends to be small, not meeting the millions of samples typically required. Besides, class imbalance among samples is another concern. Second, pixel features in RSIs are so diverse and complexly distributed that it is difficult to achieve fine classification of similar classes solely based on a single classifier. Third, fusing multiple CNNs is usually too simple and may lead to redundancy or information offsetting, affecting discriminative features’ expression.

Thus, to solve the above problems, this paper employs data augmentation of images to vary the amount and richness of the training data. Furthermore, the proposed method uses a sampling method that balances the training set samples, i.e. approximately equalizes the sample cardinality per class. Considering RSI's complex content and structure, we choose two different CNNs to enhance the learning ability of the RSI features. Finally, we design two new fusion algorithms that fully fuse global and local features to prevent the fusion process from producing redundant or losing information. Specifically, our contribution is reflected in three aspects:

We propose a new dual-model structure to improve the classification accuracy for the RSI of LULC by compensating for the deficiency of a single CNN model in feature extraction. The proposed structure exerts the powerful ability of CNN to express semantics from the RSI data using a few samples.
We design sibling and multi-scale feature fusion methods to enhance the effect of convolutional feature fusion from different structures and reduce the redundancy and loss of information. This strategy enriches the deep semantic features of RSI's for LULC.
We evaluated the proposed method on RSSCN-7, WHU-19, UCM-21, and OPTIMAL-31 and demonstrated that our approach is effective and has a certain degree of sophistication.
We discuss the applicability and limitations of our method on the AID dataset with prominent class imbalance and the NWPU-RESISC45 dataset with a large sample size, demonstrating our method’s appealing applicability.

2. Relate works

2.1. RSI data pre-processing methods

Data pre-processing for RSI exploits the available data to produce more data with balanced classes without collecting more data. Typical methods mainly include geometric, color, and pixel transformations (Alomar, Aysel, and Cai Citation2023). Among them, geometric transformations commonly involve flipping and rotating, cropping and scaling, image shifting and edge filling, while color transformations mainly convert images on different color spaces such as RGB, HSV, and LAB. Pixel transformations mainly add noise to the original image and fuse it. Choosing the appropriate image processing methods is necessary for different RSI classification tasks to complete data expansion. The main methods for solving the RSI class sample imbalance problem are the sampling in data (Rendón et al. Citation2020) and the adjusting in loss function (Wang et al. Citation2021). The sampling methods include data sampling and class-balanced sampling. Accordingly, data sampling uses up-sampling ‘add images’ or down-sampling ‘delete images’ to ensure the data balance. Class-balanced sampling involves randomly selecting samples from the ones corresponding to one or several classes during training, ensuring a balanced opportunity for each class to participate in the training process. The adjusting method reduces the ‘underfitting’ phenomenon of the small-sample class due to class imbalance by adding the weight of small-sample misclassification in the loss function. The essence of this method can be understood as adjusting the model's attention on the small classes.

2.2. Artificial feature classification methods

The methods can be achieved by manually designing or selecting the traditional RSI features (Xia et al. Citation2017). Early methods usually segmented RSI into small chunks, then extracted information such as spectral, textural, or structural information, and finally used the feature distribution of these small chunks for class differentiation. Commonly used algorithms included the scale invariant feature transform (Tang, Liu, and Xiong Citation2019), local binary patterns (Chen et al. Citation2016), color histogram (Kumar and Saravanan Citation2013), and generalized search trees (GiST) (Oliva and Torralba Citation2006). With the rising demand for high-quality classification, manual features contain more complex information, and a common approach is to encode different features into a richer one, such as spatial pyramid matching (Lazebnik, Schmid, and Ponce Citation2006), locality-constrained linear coding (Wang et al. Citation2010), and the vector of locally aggregated descriptors (Jégou et al. Citation2012). These methods realize the scene classification task for RSI under a certain amount of data. However, limited by manual experience and workload, how to extract efficiently the RSI features is still a key issue.

2.3. Deep feature classification methods

The methods actively learn RSI features through neural networks, realizing the ‘end-to-end’ scene classification effect (Thapa et al. Citation2023). However, as known, there is a big difference between the features extracted from different neural networks and network layers. When a neural network or one of its layers of deep features is used alone for classification, the prediction is often not ideal due to the incomplete depth of semantic information of the image. Currently, there are two main classes of fusion-based RSI scene classification methods.

One type is feature-level fusion, i.e. the deep features of RSI are fused according to splicing, summing, and multiplying. For instance, four deep features from ResNet34 are augmented by multiplication and addition according to contextual relationships, which are then used for classification tasks (Hou et al. Citation2023). The original features have also been replaced by multiplying the attention coefficients with the features of EfficientNet-B3 (Alhichri et al. Citation2021). Both methods are better at classification than using a single feature layer but are limited by a single network structure, making it challenging to attain higher accuracy. Therefore, (Shen et al. Citation2021) utilized low, medium, and high features extracted from ResNet-50 and DenseNet-121, which were attentively augmented and summed element by element to obtain discriminative features. Although this method is better than a particular CNN, the fusion is simple and prone to feature redundancy. Besides, (Zhang, Tang, and Zhao Citation2019) utilized CapsNet instead of the fully connected layer of Inception-V3, while (Peng et al. Citation2022) incorporated graph attention network (GAT) and graph convolutional network (GCN) into ResNet-50. Although these approaches can improve the classification of CNNs, the experimental results tend to be related to the characteristics of non-convolutional networks, which are more demanding in terms of input data and usually affect the training speed.

The other one is decision-level fusion, which reduces the over-fitting and bias that may arise from a single model by combining the predictions of multiple models. For example, (Alhichri Citation2023) fused the prediction probabilities of five CNN models on RSI data. This approach can capitalize on the benefits of multiple models, but the computational complexity and model complexity are also greatly increased.

3. Proposed method

3.1. The method of XE-Net

Every deep learning framework has its unique function, and different architectures may present advantages on the same task. Fusing different frameworks can extend the architecture, change the function, and even adapt to richer and higher task requirements. Based on this concept, we propose a dual-model architecture, called XE-Net, to overcome the shortcomings of a single model on LULC classification ().

Figure 4. The overall architecture of the XE-Net.

Specifically, first, we pre-process the LULC data to fit the input requirements of the architecture, including randomly cropping the image to $256 \times 256$ , flipping it horizontally with a probability of 0.5 after randomly rotating it between $(- 45^{\circ}, 45^{\circ})$ , and cropping it to $224 \times 224$ along the center outward. Finally, we convert the shape from $(H \times W \times C)$ to $(C \times H \times W)$ and the format from $[0, 255]$ to $[0, 1]$ , while converting the values to a standard normal distribution. Then, we adopt Xception as the upper-branch convolutional feature extractor and EfficientNetV2 as the lower, focusing on designing a novel sibling fusion strategy to fuse different features extracted from the dual branches and designing a multi-scale fusion algorithm to fuse high-level, middle-level, and low-level convolutional features to form discriminative feature. Finally, we input the discriminative features into the classifier for LULC classification.

3.1.1. Branch features extraction

We have chosen Xception (Chollet Citation2017) as the backbone network for the upper branch features. The Xception adopts the idea of depthwise separable convolution, where the input feature map is spatially convolved channel by channel and then pointwise convolved channel by channel. This network can better capture the spatial information and inter-channel correlation in the input feature map, reduce the number of parameters and computational complexity, and improve the model’s efficiency (). Moreover, Xception adopts the idea of the Inception module, which uses multiple convolutional kernels of different sizes to process the input feature maps in parallel and capture the feature information at different scales. Multi-scale processing helps to understand the details and global information in the image.

Figure 5. Schematic representation of the core ideas for Xception.

In contrast, we chose EfficientNetV2 (Tan and Le Citation2021) as the backbone network for extracting the lower branch features for RSI of LULC. EfficientNetV2 is an improvement and extension of Efficient-Net (Tan and Le Citation2019), mainly including three improvement aspects. Specifically, compound Scaling is utilized to uniformly scale the network's depth, width, and resolution to increase the model size while maintaining high efficiency. Stochastic Depth is utilized to randomly discard a portion of the layers during training to reduce over-fitting. The Fused-MBConv and MBConv modules () connect the low-resolution and high-resolution stages to improve the model’s performance without adding too many parameters and computational complexity.

Figure 6. Diagram of the core modules for EfficientNetV2.

When extracting primary features using dual models, we imported the pre-training parameters learned on the ImageNet dataset (Russakovsky et al. Citation2015) for each branch to reduce the data impact and accelerate the process of training the overall framework for faster convergence. We extracted the $F_{1}^{up}$ , $F_{2}^{up}$ , $F_{3}^{up}$ features from the upper layers and the $F_{1}^{low}$ , $F_{2}^{low}$ , $F_{3}^{low}$ from the lower layers through continuous fine-tuning and employed the last three layers of convolutional features, the richest in the receptive field.

3.1.2. Sibling feature fusion

Since $F_{1}^{up}$ , $F_{2}^{up}$ , $F_{3}^{up}$ and $F_{1}^{low}$ , $F_{2}^{low}$ , $F_{3}^{low}$ belong to the same high-level features of the same training sample data and are extracted from different networks, fusing these features from different networks provides richer sample information. Inspired by the attention mechanism (Fan et al. Citation2019), we generated the attention weight coefficients of the features using the $Sigmoid$ and $Softmax$ functions, obtained the attention features by multiplying them element-by-element with the input features, and then subsequently fused the attention features from the dual branches into the new attention features according to element-by-element summation. Meanwhile, to alleviate the problem of vanishing information gradient caused by fusion and inspired by the shortcut connection (Seung, Yoo, and Shin Citation2023), the input features are added to the fusion process through an element-by-element summation process, which improves the network’s expressive ability and allows it to learn more complex feature representations. Furthermore, the network can be adapted to more complex mapping functions by learning constant mappings, easing the network’s optimization and training.

presents the proposed algorithm flow. Firstly, $F_{i}^{up}$ and $F_{i}^{low}$ are input to the different $1 \times 1$ convolutional layers. The features under the same number of channels are formulated as follows: (1) $F_{ii}^{up} = Con v_{1 \times 1}^{up} (F_{i}^{up}, δ_{up})$ (1) (2) $F_{ii}^{low} = Con v_{1 \times 1}^{low} (F_{i}^{low}, δ_{low})$ (2) Where $Con v_{1 \times 1}^{up}$ and $Con v_{1 \times 1}^{low}$ denote two-dimensional convolutions, whose kernel size is $1 \times 1$ , and the out channels are equal to the channels of $F_{i}^{up}$ , respectively, and $δ_{up}$ and $δ_{low}$ signify the weight parameters of the corresponding layers in the respective convolution process. Then, $F_{ii}^{up}$ and $F_{ii}^{low}$ are spliced together along the channel dimension using the Concatenation function, followed by $Con v_{1 \times 1}^{c}$ , who has twice as many channels as $Con v_{1 \times 1}^{up}$ . (3) $F_{ii}^{c} = Con v_{1 \times 1}^{c} (Concate (F_{ii}^{up}, F_{ii}^{low}), δ_{c})$ (3) Then, the Split function is utilized to slice $F_{ii}^{c}$ into two parts along the channel dimensions, which are convolved and then input to the $Sigmoid$ and $Softmax$ function in turn to obtain the weight coefficients $w_{ii}^{up}$ and $w_{ii}^{low}$ . (4) $w_{ii}^{up} = Softmax (Sigmoid (Con v_{1 \times 1}^{up} (Split (F_{ii}^{c}) [0], δ_{s})))$ (4) (5) $w_{ii}^{low} = Softmax (Sigmoid (Con v_{1 \times 1}^{up} (Split (F_{ii}^{c}) [1], δ_{s})))$ (5) Finally, $w_{ii}^{up}$ and $w_{ii}^{low}$ are multiplied with $F_{ii}^{up}$ and $F_{ii}^{low}$ , respectively, the product result is summed by the corresponding elements and then added with $F_{ii}^{up}$ and $F_{ii}^{low}$ element-wise to obtain $F_{i}^{s}$ from $F_{i}^{up}$ and $F_{i}^{low}$ . (6) $F_{i}^{s} = (w_{ii}^{up} \otimes F_{ii}^{up}) \oplus (w_{ii}^{low} \otimes F_{ii}^{low}) \oplus (F_{ii}^{up} \oplus F_{ii}^{low})$ (6) Where $\otimes$ represents element-by-element multiplication and $\oplus$ represents element-by-element summation. Lastly, $F_{1}^{s}$ , $F_{2}^{s}$ and $F_{3}^{s}$ are obtained.

Figure 7. The module of sibling feature fusion.

3.1.3. Discriminant feature enhancement

Different layers of convolutional features correspond to different sizes of receptive fields. The convolutional features in the lower layers mainly contain the details and local features of the input image, while the higher layers contain broader contextual information. To utilize the different scale features for RSI of LULC more effectively and obtain rich features, we designed a multi-scale feature enhancement algorithm from high-level to low-level layers to obtain more comprehensive feature representations.

Specifically, as shown in , the channel dimensions of $F_{1}^{s}$ , $F_{2}^{s}$ , and $F_{3}^{s}$ are adjusted to obtain $F_{11}^{s}$ , $F_{22}^{s}$ , and $F_{33}^{s}$ with equal channel dimensions. According to EquationEquation 7(7) $F_{ii}^{s} = ReLU (BN (Con v_{1 \times 1}^{i} (F_{i}^{s}, δ_{i})))$ (7) , the number of output channels for 2D convolution is set to 256, BN indicates batch normalization to avoid data explosion, and $ReLU$ increases the nonlinear features of the network to avoid gradient vanishing. (7) $F_{ii}^{s} = ReLU (BN (Con v_{1 \times 1}^{i} (F_{i}^{s}, δ_{i})))$ (7) Then, the size of $F_{33}^{s}$ is augmented to two times its original size, which is added to $F_{22}^{s}$ element-by-element to obtain $F_{23}^{s}$ . (8) $F_{23}^{s} = Con v_{3 \times 3} ((BI (F_{33}^{s}) \oplus F_{22}^{s}), δ_{23})$ (8) Where BI denotes bilinear interpolation and $Con v_{3 \times 3}$ is a convolutional kernel with size $3 \times 3$ , a stride of 1, and a padding of 1.

Figure 8. The module of discriminant feature enhancement.

Similarly, the size of $F_{23}^{s}$ is augmented by a factor of 2, which is added to $F_{11}^{s}$ element by element to obtain the discriminative feature $F_{123}^{s}$ : (9) $F_{123}^{s} = ReLU (BN (Con v_{3 \times 3} ((BI (ReLU (BN (F_{23}^{s}))) \oplus F_{11}^{s}), δ_{123})))$ (9)

3.1.4. Multi-class prediction

The cross-entropy loss function is commonly used in deep learning to measure the difference between the model output and the true label, which optimizes the model parameters. The module aims to fit the minimized cross-entropy loss (EquationEq. 1(1) $F_{ii}^{up} = Con v_{1 \times 1}^{up} (F_{i}^{up}, δ_{up})$ (1) 0) using an adaptive optimization algorithm to obtain the best classification model. (10) $loss = - \frac{1}{N} \sum_{i} \sum_{m = 1}^{M} h_{im} \log (p_{im})$ (10) Where $M$ denotes the classes number for RSI of LULC, $h_{im}$ is a sign function set to 1 if the true class of the $i$ -th sample is $m$ and 0 otherwise, and $p_{im}$ denotes the predicted probability that the $i$ -th sample belongs to the class $m$ . Specifically, the discriminative feature $F_{123}^{s}$ is input into the adaptive average pooling function to obtain the global feature $F_{123}^{pool}$ with a spatial dimension of $1 \times 1$ , and the number of channels is unchanged. Then, $F_{123}^{pool}$ is horizontally spliced according to the direction of the columns and input into the fully connected layer, in which the output dimensions are set to the class number. Thus, the predicted class vector $labe l_{P}$ is obtained. Next, we calculate the degree of deviation between $labe l_{P}$ and the true class vector $labe l_{T}$ using the cross-entropy loss function by constantly updating the network parameters through the Adam algorithm. This strategy reduces deviation until $labe l_{P}$ and $labe l_{T}$ are infinitely close to each other. At this point, the parameters obtained constitute the mapping relationship between the pre-processed RSIs and classes.

3.2. Evaluation metrics

3.2.1. Overall accuracy

Every RSI of LULC has only one defined class, and predicting the predefined class of that image proves the correct classification. Therefore, the most intuitive evaluation metric is the overall accuracy (OA), i.e. the ratio of correctly predicted samples to the total number of samples in a given data for LULC (see EquationEq. 1(1) $F_{ii}^{up} = Con v_{1 \times 1}^{up} (F_{i}^{up}, δ_{up})$ (1) 1). (11) $OA = \frac{\sum_{i = 1}^{N} (f (x_{i}) = y_{i})}{N}$ (11) Where $N$ denotes the total number of samples, $x_{i}$ is the $i$ -th RSI, $y_{i}$ is the true label of the $i$ -th RSI, and $f$ denotes the mapping relation obtained with our method.

3.2.2. Confusion matrix

To analyze the specific classifications within each of the predefined classes of LULC and to capture the proportion of samples incorrectly predicted between classes, we plot the confusion matrices (CM) to represent the prediction results per class in detail. The column coordinates of CM represent the predicted class labels of LULC, the row coordinates represent the true class labels of LULC using the proposed method, and the values of the elements in CM represent the probability of accurate prediction.

3.2.3. Kappa coefficient

The prediction results were further assessed for consistency with the actual classification results by computing the kappa coefficient (KC) (see EquationEq. 1(1) $F_{ii}^{up} = Con v_{1 \times 1}^{up} (F_{i}^{up}, δ_{up})$ (1) 2). (12) $KC = (OA - \frac{\sum_{i}^{M} (N_{i}^{T} \times N_{i}^{P})}{M^{2}}) \times (\frac{M^{2}}{M^{2} - \sum_{i}^{M} (N_{i}^{T} \times N_{i}^{P})})$ (12) Where $OA$ denotes the overall accuracy, $N_{i}^{T}$ is the true number of $i$ -th class, $N_{i}^{P}$ is the predicted number of $i$ -th class, and $M$ denotes the class number of RSI. The $KC$ ranges from – 1–1, with 1 indicating perfect consistency and – 1 indicating perfect inconsistency.

4. Experiments

4.1. Datasets

4.1.1. RSSCN-7

The dataset images have a pixel size of 400 × 400 and contain seven classes of LULC, with 400 images per class, totaling 2800 images, with a sufficient number of samples in each class but with a smaller number of classes (Zou et al. Citation2015) ().

4.1.2. WHU-19

The dataset images have a pixel size of 600 × 600 and contain 19 classes of LULC, with roughly 50 images in each class and a total of 1005 images, which have high image quality and annotation accuracy, but the number of samples is unbalanced (Sheng et al. Citation2012) ().

4.1.3. UCM-21

The images are 256 × 256 pixels, and the dataset contains 21 classes of LULC with 100 images per class, totaling 2100 images. The samples are balanced, but some classes have similar features (Yang and Newsam Citation2010) ().

4.1.4. OPTIMAL-31

This dataset contains 256 × 256 images and involves 31 classes of LULC, with 60 images per class, totaling 1860 images. This dataset has more classes and rich spatial resolution, but there is a similarity between the images (Wang et al. Citation2018) ().

Figure 9. Samples of datasets for LULC, and (a) is for OPTIMAL-31; (b) is for UCM-21; (c) is for WHU-19; (d) is for RSSCN-7.

4.2. Details

4.2.1. Environmental setting

To verify the proposed method's performance, we constructed the specific framework using Python 3.7 in conjunction with the PyCharm Community Edition platform 2020.1 on a Windows 10 operating system using the Pytorch Learning Framework. We used four datasets to train and fully verify the proposed method. reports the specific details.

Table 1. Introduction of environmental setting.

Download CSV Display Table

4.2.2. Hyperparameter setting

In order to compare and analyze our methods, we set the training to test data ratio to the generally accepted values. We inputted 48 images at a time and trained each dataset 3 times with 100 rounds each to calculate the OA's mean and standard deviation. The learning rate was set to 0.0001 and the weight decay coefficient to 0.0005 in order to speed up the convergence speed and reduce the risk of over-fitting at the same time.

4.3. Classification experiments

4.3.1. RSSCN-7

To compare our method with the state-of-the-art (SOTA) algorithms, we randomly extracted 20% and 50% of the data from the RSSCN-7 dataset to train the model. The remaining data was used to test the model's accuracy. reports the accuracy of different algorithms on the test data of RSSCN-7. When 20% of the data is randomly chosen for training, our method achieves an average OA of 94.58% in the rest of the data. When 50% is used for training, our method achieves 96.84% OA, demonstrating that the proposed method is better than the traditional manual methods, with some methods improving CNN in terms of accuracy.

Table 2. The OA obtained from RSSCN-7 with different methods.

Download CSV Display Table

We plot the confusion matrix and calculated the $KC$ to demonstrate our method’s prediction capability on the test data of RSSCN-7 for each LULC class. (a) shows the model’s prediction results when using 20% of the data for training, highlighting that the probability of accurate prediction for all classes is close to 1, and the obtained $KC$ is 0.953. Specifically, the large interclass similarity between ‘Industry’ and ‘Parking’ leads to more serious misclassification in the model identification. Similarly, ‘Grass’, ‘Field’, and ‘Resident’ have similar problems. (b) presents the predictions obtained by utilizing 50% of the data for training and reveals that the probability of accurate prediction for all classes is closer to 1, and the obtained $KC$ is 0.968. The effect of the confusion problem has been mitigated, suggesting that more training data contributes to our model's learning ability.

Figure 10. The CM obtained by our method on the RSSCN-7.: (a)2:8 (b)5:5.

4.3.2. WHU-19

For comparison with the SOTA algorithm, we randomly selected 40% and 60% of the data from WHU-19 to train the model. reports the accuracy of the competitor algorithms on the WHU-19 dataset. When 40% of the data is randomly selected for training, our method achieves an average OA rate of 99.58%. For 60% of training data, our method attains 98.91%. Compared with the traditional manual and some improved CNN methods, the proposed method is slightly better in accuracy.

Table 3. The OA obtained from WHU-19 with different methods.

Download CSV Display Table

To further demonstrate the predictions of our method on the test data of WHU-19 for each LULC class, we plot the confusion matrix and calculated the $KC$ . (a) illustrates the prediction of our model using 40% of the data to train it and demonstrates that almost all classes are predicted accurately with a probability of 1, and the obtained $KC$ is 0.997. The inter-class similarity between ‘Port’ and ‘Airport’ leads to a misclassification. (b) depicts the predictions obtained using 60% of the data during training, revealing that most classes are predicted accurately with a probability of 1, and the obtained $KC$ is 0.989. Only ‘Bridge’ and ‘Residential’ and ‘Farmland’ and ‘Bridge’ have some slight problems recognizing each other. These results reflect the similarity in the learning ability of our models when the difference in the data in the training set is not very large.

Figure 11. The CM obtained by our method on the WHU-RS19.: (a)4:6 (b)6:4.

4.3.3. UCM-21

For convenience in comparing the developed scheme with the SOTA algorithms, 50% and 80% of the data from UCM-21 were randomly selected to train the model, with the remaining data being used to test the models’ accuracy. presents the models’ accuracy, revealing that for 50% of the total data randomly selected for training, the average OA rate of our method is 98.79%. The average OA rate for 80% of the data is 99.37%. Hence, the proposed method is slightly better in accuracy than the traditional manual method, and some CNN methods are improved, even with the modified vision transformers(ViTs).

Table 4. The OA obtained from UCM-21 with different methods.

Download CSV Display Table

To further demonstrate the prediction results of our method for each LULC class in the test data of UCM-21, we plot the confusion matrix and calculated the $KC$ . (a) shows the predictions made by the model using 50% of the data, suggesting that all classes are accurately predicted with a probability close to 1, and the obtained $KC$ is 0.982. ‘Dense Residential’ and ‘Building’, ‘Medium Residential’ and ‘Mobile Homepark’ misidentify each other. (b) illustrates the predictions obtained using 80% of the data, where most classes are predicted accurately with a probability of 1, and the obtained $KC$ is 0.995. Only ‘Building’ and ‘Dense Residential’, as well as ‘Intersection’ and ‘overpass,’ have some minor problems in recognizing each other. These results also validate the contribution of more training data to the learning ability of our model.

Figure 12. The CM obtained by our method on the UCM-21.: (a)5:5 (b)8:2.

4.3.4. OPTIMAL-31

We randomly selected 80% of the data from OPTIMAL-31 to train the model, and the rest of the data was used to test the model's accuracy and compare it with the SOTA algorithm. reports the accuracy of different algorithms on the OPTIMAL-31 test data. When 80% of the data is randomly selected for training, our method achieves an average OA rate of 97.07%. Compared with the traditional manual and some improved CNN methods, even with some modified ViTs, the proposed method demonstrates a slightly higher accuracy.

Table 5. The OA obtained from OPTIMAL-31 with different methods.

Download CSV Display Table

We plot the confusion matrix and calculated the $KC$ to demonstrate our approach's predictive results for each LULC class in the test data of OPTIMAL-31. shows the predictions made by the model using 80% of the data for training, suggesting that most classes are predicted accurately with a probability of 1, and the obtained $KC$ is 0.973. Only ‘Buildings’ and ‘Dense Residential’, as well as ‘Intersection’ and ‘overpass’, have some small errors in recognizing each other. These results also reflect that our method can achieve relatively good results in classifying data with multiple classes.

Figure 13. The CM obtained by our method on the OPTIMAL(8:2).

4.4. Ablation experiments

4.4.1. Various model structures

Aiming to highlight the superiority of the two CNN mix-connected structures, we compare their experimental results with those of the single-branch structure. Specifically, we sequentially utilize only Efficient V2 and Xception as the backbone of the network, import their respective pre-training parameters in advance, and retain the data pre-processing for LULC. For this setup, the highest feature level of the network is chosen as the discriminative feature of the class prediction module. depicts the corresponding results of OA after classification. reveals that, compared with a single Efficient V2 or Xception network, the proposed network architecture has obvious classification advantages on different datasets and demonstrates a more stable classification performance.

Figure 14. The OA obtained from different structures.

4.4.2. Effects of transfer learning

To verify the significant role played by transfer learning in model training, we did not employ any pre-trained models but utilized the best model obtained on the training dataset to predict the test dataset and counted the classification accuracy. illustrates the corresponding results, highlighting that on the four datasets for LULC and training and test sets, the model employing the pre-trained models is smoother after training, better fitted, and presents a higher OA than training without the pre-trained model.

Figure 15. Comparison of training and testing process.

4.4.3. Different fusion approaches

To verify the positive effect of our proposed fusion strategy for peer-level feature fusion, we built two commonly used fusion approaches, i.e. feature-element-by-element addition and feature-channel-by-channel concatenation fusion approaches. Moreover, to prove that our fusion approach is better than the augmented attention mechanisms, we replaced the attention feature part with SE (Hu, Shen, and Sun Citation2018) and CBAM(Woo et al. Citation2018), respectively. We preserved the basic architecture by keeping some modules, e.g. data pre-processing and multilevel fusion. depicts the experimental results, showing that our proposed fusion scheme has obvious advantages and obtains the highest OA.

Figure 16. The OA obtained from different fusion methods.

4.4.4. Discriminative feature generation

To analyze the positive effect of different feature fusions in the multi-scale fusion module on the discriminative features, we also designed two methods for generating discriminative features, i.e. directly choosing the high-level after the fusion of the siblings and fusing the high-level with the middle-level. It should be noted that the other conditions were preserved, and the generated discriminative features were input into the class prediction module. The corresponding results are presented in , revealing that the three-layer feature fusion positively affects the OA of the network and verifies the effectiveness and advancement of the multi-scale fusion algorithm.

Figure 17. The OA obtained from different discriminant features.

5. Discussion

5.1. Applicability analysis

We used four remote-sensing image datasets with relatively small data volumes during the classification experiments and obtained a relatively good classification performance. Aiming to explore the good performance of our proposed method in larger and more complex datasets, we chose the NWPU-RESISC45 (Cheng, Han, and Lu Citation2017) with large data volumes and the AID dataset with prominent class imbalance problems. For convenience of comparison with the state-of-the-art algorithms, we chose the commonly used training ratio, i.e. utilizing 20% of the samples per class.

lists the OA obtained by different methods in the two datasets, revealing that in the NWPU-RESISC45 classification problem, the best model is proposed for predicting 80% of the sample data, and the average OA is 95.03%. Similarly, our method attains an average OA of 95.78% in the AID dataset. By comparing the OA of the two datasets with the state-of-the-art algorithms, it is observed that our method obtains a relatively high accuracy.

Table 6. The OA obtained from NWPU-RESISC45 and AID with different methods.

Download CSV Display Table

These experimental results show our method's superiority and reflect our method’s capability to perform well on less sample data and obtain a relatively high classification effect in datasets with a relatively large sample size. It also proves that our method can obtain a good classification effect in sample data with obvious class imbalance features.

5.2. Limitations analysis

To analyze the possible problems of XE-Net in classifying large datasets, we calculated precision, Recall, Specificity, and F1-Score for each class of NWPU-RESISC45. presents the results and infers that our method achieved good classification results on 43 classes with classification evaluation indexes close to 1. Only the classes ‘Church’ and ‘Palace’ are less effective than the others in the prediction process. To analyze the reasons, we plot the confusion diagram shown in , which shows that 11% of the samples in the 8th Class ‘Church’ were predicted to be in the 28th Class ‘Palace’, and 14% of the samples in ‘Palace’ were predicted to be ‘Church’.

Figure 18. The evaluation metrics obtained by our method on the NWPU-RESISC45.

Figure 19. The CM obtained by our method on the NWPU-RESISC45.

Similarly, illustrates the evaluation metrics obtained for the 30 classes in the AID dataset, revealing that our method has some problems only in the prediction of the 17th class ‘Park’, the 23rd class ‘Resort’, the 25th class ‘School’, and the 27th class ‘Square’. From the CM in , it can be seen that 6% of the ‘Park’ samples were predicted to be ‘Resort’; 5% of the ‘Resort’ samples were predicted to be ‘Park’; 4% of the ‘School’ samples were predicted to be ‘Resort’; 4% of the ‘Square’ samples were predicted to be ‘Resort’; 4% of the ‘School’ samples were predicted to be ‘Resort’; and 4% of the ‘Square’ samples were predicted to be ‘Resort’. 4% of the ‘School’ sample was predicted to be ‘Resort’; and 4% of the ‘Square’ sample was predicted to be ‘Center’.

Figure 20. The evaluation metrics obtained by our method on the AID.

Figure 21. The CM obtained by our method on the AID.

The existence of these problems is not only due to the dataset, but also has a lot to do with our model. Our method can only convey scalar information between the adjacent layers, and cannot capture the position and pose relationships between high-level and low-level features, or the spatial relationships between feature objects. In addition, the pooling layer we use also causes the loss of some features, which may contain valuable signatures. Therefore, XE-Net will have limitations in recognizing RSI with complex spatial relationship features. In future work, we will consider using different neural networks, such as capsule neural network and so on, to further compensate for the network's loss in the process of learning features.

6. Conclusion

This paper analyzes the problems of CNN-based methods for LULC classification, such as insufficient data volume, easy redundancy, and loss of features, and that the classification effect is seriously affected by the problems of imbalance between the RSI classes of LULC, inter-class similarity, and intra-class variability. To solve the above problems, we propose a new two-branch hybrid model that incorporates the features of Xception and EfficientNet-V2 branches while maintaining their respective advantages. Specifically, the problem of insufficient data volume is mitigated by data augmentation, and the class imbalance is ensured by the sampling method to ensure the same number of samples of each class for each learning. Meanwhile, we use a migration learning technique to accelerate model fitting. Additionally, the newly designed sibling feature fusion module achieves the fusion of different features of the two branches, enhancing local and global modeling. Furthermore, the newly designed multi-scale fusion module effectively uses high, middle, and low-level features, which enriches the discriminative features. Finally, the mapping relationship between RSIs of LULC and semantic classes is established using the cross-entropy loss function under the Adam optimizer.

We conducted sufficient experiments on RSSCN-7, WHU-19, UCM-21, and OPTIMAL-31, and the results proved the effectiveness and advancement of the proposed model. Meanwhile, the XE-NET showed better applicability for the AID and NWPU-RESISC45. In the future, we will consider combining other networks to further improve the classification accuracy of RSI for LULC.

Acknowledgements

The authors sincerely thank the anonymous reviewers and journal editors for their valuable suggestions to improve the quality of this article.

Data availability statement

The dataset used for the study is publicly available.

RSSCN-7: https://aistudio.baidu.com/aistudio/datasetdetail/52117;

WHU-19: https://aistudio.baidu.com/aistudio/datasetdetail/51733;

UCM-21: http://weegee.vision.ucmerced.edu/datasets/landuse.html;

OPTIMAL-31: https://aistudio.baidu.com/aistudio/datasetdetail/51798;

AID: https://aistudio.baidu.com/datasetdetail/52025;

NWPU-RESISC45: https://aistudio.baidu.com/datasetdetail/51873.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Correction Statement

This article has been corrected with minor changes. These changes do not impact the academic content of the article.

Additional information

Funding

This work is supported by the Internal Parenting Program [grant no 145AXL250004000X] and National Natural Science Foundation of China [grant no 42071316].

References

Adegun, Adekanmi Adeyinka, Serestina Viriri, and Jules-Raymond Tapamo. 2023. “Review of Deep Learning Methods for Remote Sensing Satellite Images Classification: Experimental Survey and Comparative Analysis.” Journal of Big Data 10 (1): 93. https://doi.org/10.1186/s40537-023-00772-x.
Google Scholar
Alhichri, Haikel. 2023. “RS-DeepSuperLearner: Fusion of CNN Ensemble for Remote Sensing Scene Classification.” Annals of GIS 29 (1): 121–142.
Web of Science ®Google Scholar
Alhichri, Haikel, Asma S. Alswayed, Yakoub Bazi, Nassim Ammour, and Naif A. Alajlan. 2021. “Classification of Remote Sensing Images using EfficientNet-B3 CNN Model with Attention.” IEEE Access 9: 14078–14094.
Web of Science ®Google Scholar
Alomar, Khaled, Halil Ibrahim Aysel, and Xiaohao Cai. 2023. “Data Augmentation in Classification and Segmentation: A Survey and new Strategies.” Journal of Imaging 9 (2): 46. https://doi.org/10.3390/jimaging9020046.
PubMed Web of Science ®Google Scholar
Bazi, Yakoub, Laila Bashmal, Mohamad M. Al Rahhal, Reham Al Dayil, and Naif Al Ajlan. 2021. “Vision Transformers for Remote Sensing Image Classification.” Remote Sensing 13 (3): 516. https://doi.org/10.3390/rs13030516.
Web of Science ®Google Scholar
Chen, Chen, Baochang Zhang, Hongjun Su, Wei Li, and Lu Wang. 2016. “Land-use Scene Classification Using Multi-Scale Completed Local Binary Patterns.” Signal, Image and Video Processing 10: 745–752. https://doi.org/10.1007/s11760-015-0804-2.
Web of Science ®Google Scholar
Chollet, François. 2017. “Xception: Deep Learning with Depthwise Separable Convolutions.” Paper Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, July 2017.
Google Scholar
Cheng, Gong, Junwei Han, and Xiaoqiang Lu. 2017. “Remote Sensing Image Scene Classification: Benchmark and State of the Art.” Proceeding of the IEEE 105 (10): 1865–1883.
Web of Science ®Google Scholar
Cheng, Gong, Xingxing Xie, Junwei Han, Lei Guo, and Gui-Song Xia. 2020. “Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 13: 3735–3756.
Web of Science ®Google Scholar
Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2020. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929.
Google Scholar
Dutta, Suparna, and Monidipa Das. 2023. “Remote Sensing Scene Classification under Scarcity of Labelled Samples—A Survey of the State-of-the-Arts.” Computers and Geosciences 171: 105295.
Web of Science ®Google Scholar
Ekim, Burak, and Elif Sertel. 2021. “Deep Neural Network Ensembles for Remote Sensing Land Cover and Land use Classification.” International Journal of Digital Earth 14 (12): 1868–1881.
Web of Science ®Google Scholar
Fan, Runyu, Lizhe Wang, Ruyi Feng, and Yingqian Zhu. 2019. “Attention Based Residual Network for High-Resolution Remote Sensing Imagery Scene Classification.” Paper Presented at the 2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, July 2019.
Google Scholar
Gao, Yue, Jun Shi, Jun Li, and Ruoyu Wang. 2020. “Remote Sensing Scene Classification with Dual Attention-Aware Network.” Paper Presented at the 2020 IEEE 5th International Conference on Image, Vision and Computing, Beijing, July 2020.
Google Scholar
Guofeng, Sheng, Wen Yang, Tao Xu, and Hong Sun. 2012. “High-resolution Satellite Scene Classification using a Sparse Coding based Multiple Feature Combination.” International Journal of Remote Sensing 33 (8): 2395–2412.
Web of Science ®Google Scholar
Hou, Yan-E, Kang Yang, Lanxue Dang, and Yang Liu. 2023. “Contextual Spatial-Channel Attention Network for Remote Sensing Scene Classification.” IEEE Geoscience and Remote Sensing Letters 21: 1–5.
Web of Science ®Google Scholar
Hu, Jie, Li Shen, and Gang Sun. 2018. “Squeeze-and-excitation Networks.” Paper Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, June 2018.
Google Scholar
Jégou, Hervé, Florent Perronnin, Matthijs Douze, Jorge Sánchez, Patrick Pérez, and Cordelia Schmid. 2012. “Aggregating Local Image Descriptors Into Compact Codes.” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (9): 1704–1716. https://doi.org/10.1109/TPAMI.2011.235.
PubMed Web of Science ®Google Scholar
Kumar, A. Ramesh, and D. Saravanan. 2013. “Content Based Image Retrieval Using Color Histogram.” International Journal of Computer Science and Information Technologies 4 (2): 242–245.
Google Scholar
Lazebnik, Svetlana, Cordelia Schmid, and Jean Ponce. 2006. “Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories.” Paper Presented at the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, USA, June 2006.
Google Scholar
Li, Zewen, Wenjie Yang, Shouheng Peng, and Fan Liu. 2021. “A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects.” IEEE Transactions on Neural Networks and Learning Systems 33 (12): 6999–7019.
Web of Science ®Google Scholar
Liu, Ze, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows.” Paper Presented at the Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, Canada, October 2021.
Google Scholar
Mehmood, Maryam, Ahsan Shahzad, Bushra Zafar, Amsa Shabbir, and Nouman Ali. 2022. “Remote Sensing Image Classification: A Comprehensive Review and Applications.” Mathematical Problems in Engineering 1–24.
Web of Science ®Google Scholar
Naushad, Raoof, Tarunpreet Kaur, and Ebrahim Ghaderpour. 2021. “Deep Transfer Learning for Land use and Land Cover Classification: A Comparative Study.” Sensors 21 (23): 8083. https://doi.org/10.3390/s21238083.
PubMed Web of Science ®Google Scholar
Olga, Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. “ImageNet Large Scale Visual Recognition Challenge.” International Journal of Computer Vision 115: 211–252.
Web of Science ®Google Scholar
Oliva, Aude, and Antonio Torralba. 2006. “Building the Gist of a Scene: The Role of Global Image Features in Recognition.” Progress in Brain Research 155: 23–36.
PubMed Web of Science ®Google Scholar
Peng, Feifei, Wei Lu, Wenxia Tan, Kunlun Qi, Xiaokang Zhang, and Quansheng Zhu. 2022. “Multi-output Network Combining GNN and CNN for Remote Sensing Scene Classification.” Remote Sensing 14 (6): 1478.
Web of Science ®Google Scholar
Rendón, Eréndira, Roberto Alejo, Carlos Castorena, Frank J. Isidro-Ortega, and Everardo E. Granda-Gutiérrez. 2020. “Data Sampling Methods to Deal with the Big Data Multi-class Imbalance Problem.” Applied Sciences 10 (4): 1276.
Google Scholar
Seung, Park, Cheol-Hwan Yoo, and Yong-Goo Shin. 2023. “Effective Shortcut Technique for Generative Adversarial Networks.” Applied Intelligence 53 (2): 2055–2067.
Web of Science ®Google Scholar
Shen, Junge, Tianwei Yu, Haopeng Yang, Ruxin Wang, and Qi Wang. 2022. “An Attention Cascade Global–Local Network for Remote Sensing Scene Classification.” Remote Sensing 14 (9): 2042. https://doi.org/10.3390/rs14092042.
Web of Science ®Google Scholar
Shen, Junge, Tong Zhang, Yichen Wang, Ruxin Wang, and Min Qi. 2021. “A Dual-Model Architecture with Grouping-Attention-Fusion for Remote Sensing Scene Classification.” Remote Sensing 13 (3): 433. https://doi.org/10.3390/rs13030433.
Web of Science ®Google Scholar
Tan, Mingxing, and Quoc V. Le. 2019. “Efficientnet: Rethinking Model Scaling for Convolutional Neural Networks.” Paper Presented at the International Conference on Machine Learning, Stockholm, USA, June 2019.
Google Scholar
Tan, Mingxing, and Quoc V. Le. 2021. “EfficientNetV2: Smaller Models and Faster Training.” Paper Presented at the International Conference on Machine Learning, Online, July 2021.
Google Scholar
Tang, Xu, Mingteng Li, Jingjing Ma, Xiangrong Zhang, Fang Liu, and Licheng Jiao. 2022. "EMTCAL: Efficient Multiscale Transformer and Cross-level Attention Learning for Remote Sensing Scene Classification." IEEE Transactions on Geoscience and Remote Sensing 60: 1–15.
Web of Science ®Google Scholar
Tang, Xu, Qiushuo Ma, Xiangrong Zhang, Fang Liu, Jingjing Ma, and Licheng Jiao. 2021. "Attention Consistent Network for Remote Sensing Scene Classification." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14: 2030–2045.
Web of Science ®Google Scholar
Tang, Guoliang, Zhijing Liu, and Jing Xiong. 2019. “Distinctive Image Features from Illumination and Scale Invariant Keypoints.” Multimedia Tools and Applications 78: 23415–23442.
Web of Science ®Google Scholar
Temenos, Anastasios, Nikos Temenos, Maria Kaselimi, Anastasios Doulamis, and Nikolaos Doulamis. 2023. “Interpretable Deep Learning Framework for Land use and Land Cover Classification in Remote Sensing Using SHAP.” IEEE Geoscience and Remote Sensing Letters 20: 1–5.
Web of Science ®Google Scholar
Thapa, Aakash, Teerayut Horanont, Bipul Neupane, and Jagannath Aryal. 2023. "Deep Learning for Remote Sensing Image Scene Classification: A Review and Meta-Analysis." Remote Sensing 15 (19):4804.
Web of Science ®Google Scholar
Wang, Sheng, Kaiyu Guan, Chenhui Zhang, Qu Zhou, Sibo Wang, Xiaocui Wu, Chongya Jiang, et al. 2023. “Cross-scale Sensing of Field-Level Crop Residue Cover: Integrating Field Photos, Airborne Hyperspectral Imaging, and Satellite Data.” Remote Sensing of Environment 285: 113366. https://doi.org/10.1016/j.rse.2022.113366.
Web of Science ®Google Scholar
Wang, Jinjun, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and Yihong Gong. 2010. “Locality-constrained Linear Coding for Image Classification.” Paper Presented at the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, June 2010.
Google Scholar
Wang, Lituan, Lei Zhang, Xiaofeng Qi, and Zhang Yi. 2021. “Deep Attention-based Imbalanced Image Classification.” IEEE Transactions on Neural Networks and Learning Systems 33 (8): 3320–3330.
Web of Science ®Google Scholar
Wang, Qi, Shaoteng Liu, Jocelyn Chanussot, and Xuelong Li. 2018. “Scene Classification with Recurrent Attention of VHR Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 57 (2): 1155–1167.
Web of Science ®Google Scholar
Woo, Sanghyun, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. “Cbam: Convolutional Block Attention Module.” Paper Presented at the Proceedings of the European Conference on Computer Vision, Munich, Germany, September 2018.
Google Scholar
Xia, Gui-Song, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. 2017. “AID: A Benchmark Data Set for Performance Evaluation of Aerial Scene Classification.” IEEE Transactions on Geoscience and Remote Sensing 55 (7): 3965–3981.
Web of Science ®Google Scholar
Xu, Kejie, Hong Huang, and Peifang Deng. 2021. "Remote Sensing Image Scene Classification based on Global–local Dual-branch Structure Model." IEEE Geoscience and Remote Sensing Letters 19: 1–5.
Google Scholar
Yang, Yi, and Shawn Newsam. 2010. “Bag-of-visual-words and Spatial Extensions for Land-use Classification.” Paper Presented at the Proceedings of the 18th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose, California, November 2010.
Google Scholar
Yu, Yunlong, and Fuxian Liu. 2018. “A two-Stream Deep Fusion Framework for High-Resolution Aerial Scene Classification.” Computational Intelligence and Neuroscience 2018 (4-5): 1–13.
Google Scholar
Zhao, Zhicheng, Jiaqi Li, Ze Luo, and Jian Li. 2020. “Remote Sensing Image Scene Classification Based on an Enhanced Attention Module.” IEEE Geoscience Chen, and Remote Sensing Letters 18 (11): 1926–1930.
Web of Science ®Google Scholar
Zhou, Jie, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun Jie. 2020. “Graph Neural Networks: A Review of Methods and Applications.” AI Open 1: 57–81. https://doi.org/10.1016/j.aiopen.2021.01.001.
Google Scholar
Zhang, Bin, Yongjun Zhang, and Shugen Wang. 2019. “A Lightweight and Discriminative Model for Remote Sensing Scene Classification with Multidilation Pooling Module.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12 (8): 2636–2653.
Web of Science ®Google Scholar
Zhang, Wei, Ping Tang, and Lijun Zhao. 2019. “Remote Sensing Image Scene Classification using CNN-CapsNet.” Remote Sensing 11 (5): 494.
Web of Science ®Google Scholar
Zou, Qin, Lihao Ni, Tong Zhang, and Qian Wang. 2015. “Deep Learning Based Feature Selection for Remote Sensing Scene Classification.” IEEE Geoscience and Remote Sensing Letters 12 (11): 2321–2325. https://doi.org/10.1109/LGRS.2015.2475299.
Web of Science ®Google Scholar

Scene classification for remote sensing image of land use and land cover using dual-model architecture with multilevel feature fusion

ABSTRACT

1. Introduction

2. Relate works

2.1. RSI data pre-processing methods

2.2. Artificial feature classification methods

2.3. Deep feature classification methods

3. Proposed method

3.1. The method of XE-Net

3.1.1. Branch features extraction

3.1.2. Sibling feature fusion

3.1.3. Discriminant feature enhancement

3.1.4. Multi-class prediction

3.2. Evaluation metrics

3.2.1. Overall accuracy

3.2.2. Confusion matrix

3.2.3. Kappa coefficient

4. Experiments

4.1. Datasets

4.1.1. RSSCN-7

4.1.2. WHU-19

4.1.3. UCM-21

4.1.4. OPTIMAL-31

4.2. Details

4.2.1. Environmental setting

Table 1. Introduction of environmental setting.

4.2.2. Hyperparameter setting

4.3. Classification experiments

4.3.1. RSSCN-7

Table 2. The OA obtained from RSSCN-7 with different methods.

4.3.2. WHU-19

Table 3. The OA obtained from WHU-19 with different methods.

4.3.3. UCM-21

Table 4. The OA obtained from UCM-21 with different methods.

4.3.4. OPTIMAL-31

Table 5. The OA obtained from OPTIMAL-31 with different methods.

4.4. Ablation experiments

4.4.1. Various model structures

4.4.2. Effects of transfer learning

4.4.3. Different fusion approaches

4.4.4. Discriminative feature generation

5. Discussion

5.1. Applicability analysis

Table 6. The OA obtained from NWPU-RESISC45 and AID with different methods.

5.2. Limitations analysis

6. Conclusion

Acknowledgements

Data availability statement

Disclosure statement

Correction Statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Information for

Open access

Opportunities

Help and information

Keep up to date

Your download is now in progress and you may close this window

Login or register to access this feature