Full article: Global heterogeneous graph convolutional network: from coarse to refined land cover and land use segmentation

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

The abundant details embedded in very-high-resolution remote sensing images establish a solid foundation for comprehending the land surface. Simultaneously, as spatial resolution advances, there is a corresponding escalation in the required granularity of land cover and land use (LCLU) categories. The coarse classes identified necessitate further refinement into more detailed categories. For instance, the ‘built-up’ class can be subdivided into specific categories such as squares, stadiums, and airports. These refined LCLU classifications are better equipped to support diverse domains. Nonetheless, most studies simply adopt methods initially designed for coarse LCLU when addressing the challenging refined LCLU segmentation. Few studies have considered the inherent relationships between coarse and refined LCLU, overlooking the potential exploitation of the numerous recently released LCLU products. To better leverage this prior knowledge, we propose the Global Heterogeneous Graph Convolutional Network (GHGCN). The GHGCN introduces a heterogeneous graph and excels in establishing relationships between coarse and refined LCLU, which can extract long-distance dependencies more effectively than convolutional neural networks. Furthermore, this model is performed end-to-end, eliminating the necessity for presegmentation and facilitating training acceleration. GHGCN exhibits competitive performance compared to state-of-the-art models, indicating its effective design in exploiting coarse LCLU data, especially for categories with limited samples. The source code is released at: https://github.com/Liuzhizhiooo/GHGCN.

KEYWORDS:

1. Introduction

1.1. Motivation

Land cover and land use (LCLU) information are essential for comprehending the land surface since they can reflect both physical characteristics and the socioeconomic activities on the ground surface. Therefore, recognizing LCLU supports various research domains, including environment science (Foley et al. Citation2005; Sterling, Ducharne, and Polcher Citation2013), ecosystem deterioration (Song et al. Citation2020; Vitousek et al. Citation1997), and climate monitoring (Feddema et al. Citation2005; Pielke Citation2005). With the increase in the spatial resolution of remote sensing imagery, tremendous demands exist for refined LCLU segmentation. They can better depict the ground surface and benefit land resource management, urban planning, and operation (Zhu et al. Citation2019; Ge et al. Citation2019; Zhang, Du, and Zhang Citation2019).

Most investigations have been conducted to identify coarse LCLU types due to limitations in temporal and spatial resolution (Esch et al. Citation2017; Gong et al. Citation2019; Zhu et al. Citation2022). It can be defined as a coarse category system containing a few classes that have a relatively straightforward correlation with the physical characteristics of the Earth’s surface (Comber, Brunsdon, and Farmer Citation2012). As a result, coarse LCLU can often be distinguished solely based on spectral features when employing low – or medium-resolution images. In contrast, this refined classification provides a detailed and precise depiction of land surface characteristics and human activities. To illustrate, the coarse category ‘built-up’ can be further subdivided into squares, stadiums, and airports. In essence, refined LCLU entails a detailed and specific classification within the coarse LCLU system, wherein coarse LCLU is systematically dissected into more granular subcategories.

However, a persistent semantic gap (Durand et al. Citation2007; Zhao, Zhong, and Zhang Citation2016) between images and a refined LCLU poses a formidable challenge for accurate identification. This gap is defined as the gap between the low-level information extracted from images and the high-level semantic interpretation. Moreover, refined LCLU classification can be divided into two steps: first, basic surface recognition, followed by a spatial reclassification that is performed based on the local spatial pattern of coarse LCLU results (Barnsley and Barr Citation1996; Wharton Citation1982). Compared to classifications based solely on images, this two-step approach significantly reduces complexity and aligns more seamlessly with the human cognitive process. Therefore, incorporating coarse LCLU in addition to images is feasible and attractive for bridging this gap (Albert, Rottensteiner, and Heipke Citation2015). The coarse LCLU information involves essential aspects of the land surface and human activities, facilitating the recognition of intricate patterns in refined LCLU. Additionally, a variety of medium – to high-resolution coarse LCLU classification products, such as FROM-GLC10 (Gong et al. Citation2019), Dynamic World (Brown et al. Citation2022), World Cover (Zanaga et al. Citation2022), Land Use/Land Cover Time Series (Karra et al. Citation2021), and Sinolc-1 (Li et al. Citation2023), have been released by the research community. These products provide a crucial data foundation for refined LCLU classification. Consequently, this investigation is significant for enhancing refined LCLU recognition and exploring the utilization of coarse LCLU results.

1.2. Related work

Harnessing the potential of coarse LCLU results to assist in obtaining precise LCLU categories remains a difficult undertaking. The key to this challenge lies in capturing the implicit and intricate relationships between them. Most existing approaches rely on artificially engineered features to express this relationship, and they can be broadly classified into two categories: (1) sliding window-based methods and (2) region-based methods. The former encompasses frequency-based (Eyton Citation1993; Wharton Citation1982) and co-occurrence matrix-based (Barnsley and Barr Citation1996; Van der Kwast et al. Citation2011; Zhang, Du, and Zhang Citation2018) techniques, and they necessitate manual specification of the window size. However, these methods impose limitations by assuming rectangular shapes for ground objects, thereby restricting the field of view for recognizing spatial patterns and scene structures. To overcome this, the use of an adaptively generated region has been proposed to improve spectral homogeneity (Lv et al. Citation2023). In contrast, region-based methods predominantly rely on adjacency graphs (Barnsley and Barr Citation1997; Barnsley and Barr Citation2000; Barr and Barnsley Citation1997; Walde et al. Citation2014), which enable a more comprehensive description of the spatial and structural relationships among ground objects. Nevertheless, implementing graph-based methods requires a preliminary image segmentation process, which segments an image into numerous parcels. This step is time-consuming and requires human involvement.

Recently, deep learning has achieved remarkable success in remote sensing due to its powerful feature extraction capabilities (Zhu et al. Citation2017; Zhang, Zhang, and Du Citation2016), including object detection (Tang et al. Citation2021; Zhao et al. Citation2022), image fusion (Sun et al. Citation2022; Liu et al. Citation2022; Sun et al. Citation2023), and semantic segmentation (Jiao et al. Citation2021; Zhang, Tang, and Zhao Citation2019). Equipped with a powerful framework for local pattern modeling by end-to-end and hierarchical learning, Convolutional Neural Networks (CNNs) have emerged as the most successful architectures for image segmentation. Prominent examples include the FCN (Long, Shelhamer, and Darrell Citation2015), DeepLabV3+ (Chen et al. Citation2018), BiSeNetV2 (Yu et al. Citation2021), and ConvNeXt (Liu et al. Citation2022). Moreover, Graph Convolution Networks (GCNs) (Kipf and Welling Citation2016) are a generalization of CNNs from raster data to graph data, demonstrating impressive performance in capturing intricate relationships. This is attributed to their ability to perform effective long-distance feature extraction. Consequently, CNNs and GCNs can be regarded as prime candidates for improving window-based and region-based methods, respectively. Furthermore, there is a growing trend to integrate both CNNs and GCNs for semantic segmentation tasks, and fusing these two architectures has proven beneficial (Hu et al. Citation2020; Li and Gupta Citation2018; Niu et al. Citation2022). Specifically, CNNs are treated as backbones for local visual feature extraction, upon which subsequent GCNs extract the global semantic features. Therefore, the core of the overall model resides in the GCN module, given its strong association with high-level semantic feature extraction and its pivotal role in comprehending the intricate relationships embedded in images.

Current works often fall short in effectively exploiting the abundant semantic information contained in coarse LCLU maps since they are built on homogeneous graphs (Niu et al. Citation2022; Su et al. Citation2022). As shown in (a), nodes from different categories are treated equally. Moreover, all the nodes are globally connected, which hinders the aggregation of distinguishable features for image segmentation. To overcome this, the CDGC (Hu et al. Citation2020) employs a coarse prediction map to construct class-wise graphs, where edges between nodes of different classes are removed. While this approach facilitates the clustering of nodes belonging to the same class, it ignores inter-class relationships. Similarly, a KNN graph is employed, and only nodes that are close to each other in the feature space are connected (Su et al. Citation2022). Moreover, inter-class and global dependency relations have been further explored (Liu, Schonfeld, and Tang Citation2021). However, dependency reasoning is performed by group weighted convolutions, which cannot fully express the complicated relationships between visual objects.

Figure 1. Graph comparison of different graph designs. (a) fully connected homogeneous graph; (b) spatially connected homogeneous graph; (3) class-wise connected homogeneous graph; (d) spatially connected heterogeneous graph; (e) fully connected heterogeneous graph.

Compared to globally connected graphs, spatially connected graphs (Hong et al. Citation2020; Liu et al. Citation2020; Wan et al. Citation2019) focus more on capturing the topological relationships among objects, as shown in (b). By assuming that spatially adjacent objects tend to share similar properties, the extracted features are less likely to be contaminated by nodes from different categories. A co-occurrence matrix of node types is employed to introduce prior category distributions to enhance the modeling capability of homogeneous graphs (Cui et al. Citation2021). However, prior knowledge is globally shared and not sufficiently precise to instruct node-level feature extraction. Furthermore, CHeGCN introduces heterogeneous graphs and incorporates node class information in the feature aggregation (Liu et al. Citation2022), providing a good solution for exploiting coarse LCLU classification, as depicted in (d). Nevertheless, spatially connected graphs are applied only after the images are segmented into parcels, which is time-consuming and lacks adaptability. Moreover, because these methods are composed of local connections, they are limited in global feature extraction and cannot be efficiently accelerated by GPUs.

Furthermore, the coarse LCLU result is employed solely in the graph convolution operation, which is implicit and ineffective. Motivated by the incorporation of depth data (Hazirbas et al. Citation2017), coarse LCLU data are believed to offer valuable geometric cues that can mitigate uncertainty even when identifying refined LCLU categories in homogeneous areas. In summary, existing models exhibit the following limitations: (1) ineffective utilization of coarse LCLU results, (2) insufficient integration of the coarse node category into node feature aggregation, and (3) incomplete training in a fully end-to-end manner due to the need for presegmentation.

1.3. Our contribution

To address the above issues, we propose the Global Heterogeneous Graph Convolutional Neural Network (GHGCN), as shown in . The competitive performance of GHGCN can be attributed to the following factors: (1) introducing a heterogeneous graph that effectively incorporates heterogeneous data (images and coarse LCLU data) for semantic segmentation, (2) effective exploiting of coarse LCLU data through three different approaches, and (3) end-to-end training of the GHGCN. Inspired by GloRe (Chen et al. Citation2019), the graph convolution operations in heterogeneous graphs are realized using CNNs. Moreover, this approach eliminates the need for image presegmentation required for graph construction. These properties make the training process end-to-end and well supported by GPUs. Moreover, the issue of indistinguishable features aggregated from nodes of different categories can be addressed by incorporating heterogeneous graphs, where the edge weights are influenced by the classes of the corresponding nodes.

Figure 2. The architecture of the proposed Global Heterogeneous Graph Convolutional Neural Network (GHGCN). Conv, feat, cls, G-prj, and G-reprj are abbreviations for convolutional neural network, feature map, classifier, graph projection, and graph reprojection, respectively. $X_{0}$ , $X$ , $V^{l}$ , $Z$ , and $\hat{Y}$ correspond to the symbols used in the formulas presented in the methodology section.

The main contributions of this paper are as follows:

The proposed framework can efficiently extract high-level semantic features crucial for the LCLU segmentation task from heterogeneous data. Instead of concatenating the input image and coarse LCLU mask, GHGCN innovatively employs heterogeneous graphs to extract information from both images and coarse LCLU data in a more integrated manner. Moreover, this framework is adaptable for addressing other inference tasks that involve the transitioning from coarse to refined results.
We explore various methods for deep mining prior knowledge, including both direct and implicit approaches. Specifically, GHGCN adopts three strategies to incorporate coarse LCLU data: (1) leveraging spatial cues extracted by an additional CNN branch with coarse LCLU as input, enhancing the identification of spatial patterns and guiding the spatial constraint of the coarse LCLU data, (2) improving the construction of long-distance relationships by globally creating nodes and aggregating features at the node level, and (3) applying category constraints from the class-wise classifier, significantly simplifying reclassification as coarse LCLU categories are ensured in advance.
The proposed GHGCN efficiently performs graph convolution in heterogeneous graphs and extracts long-distance relationships effectively. The graph-based GHGCN model is achieved in an end-to-end fashion and can be accelerated well by GPUs. Specifically, the graph nodes are dynamically generated in a data-driven manner without the requirement for presegmentation, enabling flexible adjustments of node creation during the training process. Moreover, the feature aggregation among heterogeneous nodes is implemented by a CNN, which can be supported by GPUs well.

In the dataset with various-grained labels, our model outperforms the current state-of-the-art segmentation algorithms. Utilizing local patterns extracted from coarse LCLU data, coupled with the relationships established through heterogeneous graphs, significantly improves the performance of fine-grained LCLU classification. Remarkably, the benefits derived from the coarse LCLU results become more evident as the classification system becomes more precise.

2. Methodology

In this section, the architecture of the proposed GHGCN is described in detail. It follows an encoder-decoder structure, as shown in . The encoder can be divided into two components: (1) the CNN module and (2) the GCN module. The CNN module contains two CNN branches. The upper branch extracts geometric cues from coarse LCLU, while the lower branch focuses more on capturing local features from images. The GCN module applies graph convolution to heterogeneous graphs to establish node-level long-distance relationships. Additionally, the decoder is a class-wise classifier composed of multiple classifiers, where pixels of the same coarse LCLU class share a classifier.

2.1. CNN module

The selection of CNNs as the backbone network for semantic segmentation is prevalent owing to their ability to proficiently capture local patterns. Since the pretrained model is recommended given limited samples, ResNet (He et al. Citation2016) pretrained with ImageNet (Krizhevsky, Sutskever, and Hinton Citation2012) is adopted as the CNN backbone. The CNN module consists of two CNN branches, both of which remove the last four convolutional layers of ResNet-50, specifically the conv5_x layer and the classifier. This adjustment effectively controls the number of model parameters within a desirable range. This reduction significantly decreases the number of parameters in our model and enhances its resistance to overfitting.

Since the pretrained backbone has already provided abundant local features prepared for high-level semantic feature extraction, the backbone network in the CNN module is frozen except for the first convolutional layer in the upper CNN branch because the input coarse LCLU data are converted using one-hot encoding and the input channel of the first convolutional layer is adjusted to 5. Freezing the CNN backbone can reduce the time consumed in training and dramatically improve the resistance to overfitting. Moreover, the primary focus of our research lies in capturing, rather than extracting, abstract semantic information from local features. Subsequently, the outputs from the two branches are concatenated along the feature dimension. In practice, before being fed into the GCN module and the feature fusion module, the dimensions of the stacked features are reduced from 2048 to 512 by a $1 \times 1$ convolutional layer.

2.2. GCN module

The introduction to the GCN module contains three components: (1) the basic concepts of the GCN, (2) the implementation of the GCN using a CNN, and (3) the global heterogeneous GCN. A graph $G = (V, E)$ is determined by vertices $V$ and edges $E$ . In practice, the vertex set $V$ is represented by a vertex feature matrix $V \in R^{N \times C}$ , and the edge set $E$ is denoted by an adjacency matrix $A \in R^{N \times N}$ , where $N$ and $C$ represent the number of nodes and the dimensions of the node features, respectively. The $i$ -th row of $V$ corresponds to the feature vector $v_{i} \in R^{C}$ of node $i$ . Node features are obtained through a process called graph projection, where pixels belonging to the same superpixel are assigned to their corresponding node. In graph projection, the grid-structured feature maps $X_{0} \in R^{H \times W \times C}$ are transformed to a graph-structured data representation $V$ , where node features are calculated as the average signatures of the pixels involved. The edge weight $A_{ij}$ represents the strength of the connection between node $i$ and node $j$ , with higher values indicating a closer relationship between these two nodes.

Each pixel as a node often leads to a large graph with intractable computations. To address this issue, many approaches adopt the Simple Linear Iterative Clustering (SLIC) (Achanta et al. Citation2012) algorithm to generate superpixels that serve as nodes. In GHGCN, the nodes are generated automatically and can be adjusted adaptively during the training process. Specifically, a learnable projection matrix $B \in R^{N \times L}$ is generated through a 1D convolutional layer, with the reshaped output of the CNN module $X \in R^{L \times C}$ as its input. Here, $L = H \times W$ , where $H$ and $W$ represent the height and width of the feature map, respectively. As a result, the projection function $V = f (X)$ , as depicted in EquationEquation 1(1) $V = f (X) = BX = Conv 1 D (X) X$ (1) , is formulated to project the feature maps from the coordinate space to the interaction space, enabling more effective reasoning for long-distance relationships. (1) $V = f (X) = BX = Conv 1 D (X) X$ (1) After acquiring the node feature matrix $V$ , the forward propagation process of a GCN layer can be expressed as: (2) $V^{l + 1} = σ (\tilde{A} V^{l} W^{l})$ (2) where $V^{l} \in R^{N \times C^{l}}$ and $V^{l + 1} \in R^{N \times C^{l + 1}}$ represent the input and output node feature matrices of the $l$ -th layer, respectively. The feature dimension of $V^{l + 1}$ is determined by the learnable linear transformation matrix $W^{l} \in R^{C^{l} \times C^{l + 1}}$ . The matrix $\tilde{A}$ denotes a symmetric normalized adjacency matrix computed based on $A$ . The activation function $σ (\cdot)$ is chosen to be Rectified Linear Units (ReLU) (Nair and Hinton Citation2010). The feature aggregation process in a GCN layer can be divided into two steps: (1) information diffusion and (2) state update (Chen et al. Citation2019). The former involves aggregating features across nodes, which corresponds to $\tilde{A} V$ . The latter is used for transforming the feature dimensions, which corresponds to $(\tilde{A} V) W$ . Based on the above analysis, graph convolution can be achieved by employing two 1D convolution layers in different directions, namely channel-wise and node-wise layers. Therefore, for fully connected graphs, EquationEquation 2(2) $V^{l + 1} = σ (\tilde{A} V^{l} W^{l})$ (2) can be rewritten as follows: (3) $V^{l + 1} = Conv 1 D (Conv 1 D {(V^{l})}^{T})^{T}$ (3) In heterogeneous graphs, considering the node type becomes crucial for feature aggregation. Assigning node categories occurs during graph projection, which transforms feature maps from the coordinate space to the interaction space. The determination of node labels is based on the most frequently occurring land cover category among the associated pixels. In contrast to SLIC, the nodes in our model are generated using a data-driven convolutional layer. However, during the early training phase, there is no guarantee that the pixels within a node will be homogeneous. To mitigate this issue, pixels within a node that do not belong to the coarse LCLU class of the node are excluded. This can greatly decrease the variance in pixels located in a node. In the information diffusion step, the consideration of meta-paths, denoted as $P$ , is introduced. A meta-path $P$ describes the relationship between node $i$ and node $j$ , where the relationship is determined based on the coarse LCLU labels of the node pair $(i, j)$ . In fact, the meta-path can be interpreted as a specific type of edge in heterogeneous graphs. As depicted in , the edge $e_{ij}$ is visualized with a gradient color, which is determined by the categories associated with nodes $i$ and $j$ . It is worth noting that the meta-path of node pair $(i, j)$ is the same as that of node pair $(j, i)$ since the graphs are undirected. Consequently, the total number of meta-path types is $\frac{m (m + 1)}{2}$ , where $m$ denotes the number of coarse LCLU classes.

After the information diffusion step, the node features ${\hat{V}}^{l}$ in heterogeneous graphs can be formulated as: (4) ${\hat{v}}_{i}^{l} = \sum_{j} (a_{P} Conv 1 D_{P} (v_{j}^{l}) + b_{P})$ (4) where $v_{j}^{l} \in R^{N}$ represents the $j$ -th column of $V^{l}$ , $Conv 1 D_{P} (\cdot)$ denotes the 1-D convolutional kernel associated with meta-path $P$ , $a_{P}$ and $b_{P}$ are scalar parameters used to adjust the importance of meta-path $P$ , and ${\hat{v}}_{i}^{l} \in R^{N}$ represents the $i$ -th column of ${\hat{V}}^{l}$ . The category of a node pair (i.e. meta-path $P$ ) influences the feature aggregation of nodes in heterogeneous graphs via two approaches: the meta-path based transformation function $Conv 1 D_{P} (\cdot)$ and meta-path based aggregation coefficients $a_{P}$ and $b_{P}$ . Finally, the computation of the GCN in heterogeneous graphs can be obtained using the following equation: (5) $V^{l + 1} = Conv 1 D ({\hat{V}}^{l})$ (5)

2.3. Feature fusion and classifier

The representations of VHR images extracted from different network architectures exhibit significant variations. CNNs primarily emphasize spectral-spatial features, while GCNs focus on capturing long-distance relationships among nodes. However, relying solely on features provided by a single architecture often limits the ability to achieve optimal results. Consequently, the GCN module is enhanced by incorporating CNN features to achieve better performance. Additionally, the residual nature of the GCN module facilitates gradient propagation and enables easier integration with various CNN backbones. Prior to feature fusion, it is necessary to map the node-level outputs of the GCN module back to the pixel-level, which involves an inverse process of graph projection known as graph reprojection. To accomplish this, we reuse the projection matrix $B$ , enhancing computational efficiency while ensuring minimal performance loss. The underlying assumption is that the adjacency matrix $B$ tends to be a symmetric orthogonal matrix that satisfies $B^{T} B = E$ . This assumption enables us to achieve no loss reversion after the graph projection and graph reprojection of feature maps, i.e. $B^{T} V = B^{T} BX = X$ . Consequently, the graph reprojection can be formulated as follows: (6) $Z = B^{T} V$ (6) where $Z \in R^{C \times L}$ indicates the reverse projected feature maps. During graph reprojection, the node features are assigned to the pixels located within that node. Subsequently, element-wise addition is performed between the reprojected GCN feature maps and the CNN feature maps.

The output $Y \in R^{L}$ is calculated after feeding the fused feature maps into the class-wise classifier, which contains $m$ individual classifiers. Each classifier comprises a $1 \times 1$ convolutional layer, a softmax function, and an $\arg \max$ operation applied to the feature dimension, as shown in EquationEquation 7(7) $Y = \underset{C}{\arg \max} (softmax (Conv 1 D (X + Z)))$ (7) . This process involves assigning pixels from different categories to their respective classifiers, essentially performing a reclassification of the coarse LCLU data. Finally, the pixel-level segmentation results $\hat{Y} \in R^{H \times W}$ are obtained by reshaping the output $Y \in R^{L}$ . (7) $Y = \underset{C}{\arg \max} (softmax (Conv 1 D (X + Z)))$ (7)

3. Experiments

In the experiments, our proposed GHGCN model is evaluated on the generated dataset. Eleven state-of-the-art deep learning methods are compared against our method: FCN (Long, Shelhamer, and Darrell Citation2015), DeepLabV3+ (Chen et al. Citation2018), UNet++ (Zhou et al. Citation2018), DANet (Fu et al. Citation2019), OCNet (Yuan et al. Citation2018), GloRe (Chen et al. Citation2019), DGCNet (Zhang et al. Citation2019), BiSeNetV2 (Yu et al. Citation2021), SegFormer (Xie et al. Citation2021), ConvNeXt (Liu et al. Citation2022), and SegNeXt (Guo et al. Citation2022). These models encompass a diverse range of architectures, including CNNs, GCNs, and attention networks. Three metrics, namely, the overall accuracy (OA), kappa coefficient, and mean intersection over union (mIoU), are employed to evaluate the performance of these models. Finally, ablation experiments and visualization results are conducted to demonstrate the benefits of coarse LCLU data and our model design.

3.1. Dataset

We generated datasets based on the Gaofen Image Dataset (GID) (Tong et al. Citation2020), the GID-15 dataset (Yang et al. Citation2022), and the Five-Billion-Pixels dataset (Tong, Xia, and Zhu Citation2023). The GID contains 150 Gaofen-2 (GF-2) images and the associated manual annotations with 5/15 categories. The Five-Billion-Pixel dataset is further completely labeled with 24 classes. For clarity, we refer to the category systems with 5, 15, and 24 types as GID5, GID15, and GID24, respectively. These three category systems are determined with reference to the Chinese Land Use Classification Criteria (GB/T 21010-2017). Our task is to predict the refined LCLU (GID15/GID24) with images and coarse LCLU (GID5) data.

GID5 comprises five primary classes: built-up, forest, farmland, meadow, and water. Unlabeled areas in GID5 are miscellaneous or unclear places that are highly challenging to annotate. This is the same for both GID15 and GID24. The GID15 expands upon the GID5 and includes 15 categories: industrial land, urban residential, rural residential, traffic land, garden land, arbor forest, shrub land, paddy field, irrigated land, dry cropland, natural meadow, artificial meadow, river, lake, and pond. Building upon GID15, GID24 possesses a more complete category system. It retains most of the classes from GID15 while further refining some categories. For example, traffic land is further divided into roads, overpasses, rails, and airports. The hierarchical classification system is described in . Notably, snow and bare land in GID24 are excluded due to the absence of annotations for GID5 and GID15.

Figure 3. The hierarchical category systems of GID5, GID15, and GID24 and the boxes are rendered with the corresponding colors. The abbreviations for GID5 are: Farm – farmland, Buil – build-up, Mead – meadow, Wate – water, Fore – forest, Unla – unlabeled area. The abbreviations for GID15 are: Urba – urban residential, Rura – rural residential, Indu – industrial area, Traf – traffic land, Gard – garden land, Arbo – arbor forest, Shru – shrub forest, Padd – paddy field, Irri – irrigated field, Dryc – dry cropland, Natu – natural meadow, Arti – artificial meadow, Rive – river. The abbreviations not mentioned for GID24 are as follows: Stad – stadium, Squa – square, Over – overpass, Rail – railway station, Airp – airport, Fish – fish pond, and Bare – bare land.

In our task, GID5 represents the coarse LCLU types, while GID15 and GID24 represent refined LCLU classes. As depicted in , nine GF-2 images are selected. The selection follows specific criteria to ensure dataset diversity: (1) inclusion of all the categories in GID24 (except snow and bare land), (2) considerable distinction between coarse land cover labels and refined land use annotations, and (3) sufficient spatial and temporal variations to ensure generalizability across different geographical contexts and seasons. lists the basic information about these images. The GF-2 MSI images provide a spatial size of $6800 \times 7200$ pixels, a spatial resolution of 4 m, and four bands: blue, green, red, and near-infrared. The first three visible bands are selected. These images, along with the corresponding annotation masks, are simultaneously cropped into patches of $256 \times 256$ pixels. Patches with too many unlabeled areas are discarded, yielding a total of 5218 patches. The class distributions of these patches are listed in . The statistics indicate a category imbalance in this dataset, while each of the eight categories, including airports, parks, and stadiums, constitute less than 1%. This category imbalance greatly challenges the ability of models to learn from classes with small sample sizes. These patches are then randomly divided into training, validation, and test sets at a nearly 1:1:2 ratio. Moreover, the category distributions across these three sets remain similar.

Figure 4. Nine Gaofen-2 (GF-2) images and their annotations on GID5, GID15, and GID24. The color systems of GID5, GID15, and GID24 are displayed at the bottom.

Table 1. The basic information about the selected GF-2 images, including the product name, date, and longitude and latitude of the left top point of the images.

Download CSV Display Table

Table 2. The category distribution of the generated dataset.

Download CSV Display Table

3.2. Hyperparameter configuration

The output channel of each branch in the CNN module is 1024 after removing the last four convolutional layers, specifically the conv5_x layer and the classifier. Before feeding to the GCN module, the dimension of the concatenated feature is reduced to 512. Moreover, there is 1 layer with 64 nodes in the GCN module, followed by a batch normalization (Ioffe and Szegedy Citation2015) layer and a ReLU activation layer. We use the cross-entropy loss and Adam optimizer (Kingma and Ba Citation2014) to train the GHGCN algorithm for 50 epochs. We employ a ‘poly’ learning rate policy: $lr = l r_{init} * (1 - {(iter / epochs)}^{power})$ , where the power equals 0.9 and the initial learning rate is 0.001 with a batch size of 8. All hyperparameters are determined by the performance in the validation dataset, and the same hyperparameter configuration is used for both qualitative and quantitative results. Our code is implemented in Python-3.8 and PyTorch-1.10.0.

All compared models are equipped with their corresponding pretrained model parameters, except for UNet++. To ensure a fair comparison, the majority of the selected state-of-the-art models employ the pretrained ResNet-50, with no convolutional layers removed. This ensures that the channel dimension of the feature maps in the encoders of these compared models is 2048, as specified by the CNN module in GHGCN. Moreover, the classifiers of all the compared models are replaced by class-wise classifiers to ensure fairness in the comparison. Furthermore, the compared models and our GHGCN model share the same hyperparameter setup. It is worth noting that all the quantitative metrics are obtained by the model with the maximum OA index on the validation dataset.

3.3. Comparison of classification performance

displays the segmentation scores (%) of GHGCN and seven state-of-the-art models in terms of OA, Kappa, and mIoU. Due to the imbalanced GID category distribution, the mIoU metric is considered a more comprehensive evaluation metric for model performance comparisons. Consequently, the subsequent analysis will primarily focus on the mIoU.

Table 3. Quantitative comparison of different methods.

Download CSV Display Table

GHGCN achieves the highest OA, Kappa, and mIoU on both the GID15 and GID24 datasets, surpassing the mIoU of the second-ranked model by approximately 2.8% and 5.8%, respectively. This highlights its effectiveness in capturing intricate semantic features and providing accurate segmentation results. Moreover, GHGCN demonstrates competitive performance with a moderate number of parameters (29.13 M) and computational cost (31.82 G FLOPs), indicating efficient resource utilization compared to some other models with higher complexity. Given the challenge of achieving both lightweight parameters and high accuracy, the GHGCN is a promising choice due to its balanced trade-off between model complexity and performance. Furthermore, the competitive performance of GHGCN on the test dataset demonstrates its ability to generalize across diverse geographical contexts and seasons.

Overall, the performance of the compared models exhibits the following trends: (1) state-of-the-art CNNs, exemplified by BiSeNetV2, outperform state-of-the-art attention models such as SegFormer. In succession, attention models surpass graph convolutional models, represented by GloRe, and these models, in descending hierarchies, outperform conventional convolutional models (e.g. DeepLabV3+) and attention models (e.g. DANet). This trend highlights the advantages of GCN models over traditional CNN models in long-distance relationship capture and emphasizes the potential for improvement in graph models compared to state-of-the-art convolutional models. (2) Models with fewer parameters achieve higher accuracy on the GID15 dataset, characterized by fewer categories, while models with a larger parameter count demonstrate superior performance on the more challenging GID24 dataset. For instance, BiSeNetV2 excels on GID15, while the ConvNeXt model exhibits superior performance on GID24. This observation suggests that lightweight models are better suited for GID15, where they are less prone to overfitting under relatively limited samples.

Specifically, we find that DeepLabV3 + outperforms the FCN and UNet++, which can be attributed to the effectiveness of the Atrous Spatial Pyramid Pooling (ASPP) design. Moreover, UNet++ performs worst in CNNs since it is trained from scratch, and the disparity between UNet++ and the FCN is further amplified in GID24 dataset. This suggests that employing pre-trained models in scenarios with relatively limited samples can provide considerable benefits. Furthermore, we notice that DANet performs even worse than UNet++, which reveals that the attention networks may not be as competitive as CNNs for this task.

Among the GCN models, GloRe surpassed DGCNet on the GID15 dataset but demonstrated slightly inferior performance on GID24 compared to DGCNet, which can be attributed to its smaller parameter count. This highlights the significance of proper graph construction in achieving image segmentation. DGCNet ineffectively generates nodes through pooling, where a node represents a rectangular area. In contrast, GloRe can dynamically adjust the projection matrix, allowing nodes to have arbitrary shapes. This adaptability proves advantageous in effectively capturing long-distance relationships. The GHGCN model further improves the edge weight calculation by incorporating coarse LCLU results, leading to the highest mIoU. Notably, the advantage of GHGCN over GloRe is amplified in GID24, proving the effectiveness of our proposed model.

Furthermore, presents the IoU values for all the categories in GID24. GHGCN achieved the optimal or near-optimal mIoU in most categories, particularly those characterized by limited samples (comprising less than 1% of the total). For instance, GHGCN achieves significant improvements in the airport category (30% increase), the park category (11% increase), the overpass category (10% increase), and the shrub forest category (7.5% increase). Moreover, in homogeneous built-up areas, the improvements in GloRe in industrial areas (2.56% increase), urban residential areas (1.3% increase), rural residential areas (1.85% increase), and roads (2.06% increase) are also not marginal.

Table 4. Quantitative comparison of different methods in terms of mIoU in GID24.

Download CSV Display Table

We further display the qualitative segmentation results of the nine test samples in . The comparison includes representative CNNs (ConvNeXt), attention models (SegFormer), and GCNs (DGCNet). Overall, the boundaries between different categories are more accurate and smoother in GHGCN and ConvNeXt. For instance, roads and water bodies are more complete in samples 2, 4, and 5. With its shallower convolutional structure, the CNN module retains more spatial details, which facilitates the recognition of narrow objects. Moreover, the use of coarse LCLU data as a spatial constraint among different coarse LCLU categories is beneficial. Even in homogeneous areas, GHGCN shows superiority. The classification results tend to be consistent in homogeneous areas, which is due to long-distance node-level feature aggregation. Moreover, the control of each node’s corresponding area to belong to the same category plays a crucial role in preventing feature confusion among different classes, especially when nodes are dynamically generated.

Figure 5. Visualization comparison of the prediction results obtained by DeepLabV3 (Dplb3), OCNet, GloRe, and GHGCN. The first three rows indicate the images of the RGB bands, the GID5 labels, and the GID24 labels. The last four rows represent the predictions of these four models. Areas with large differences are highlighted with white boxes.

As mentioned above, GHGCN shows significant superiority, especially in categories with limited samples. Specifically, these nine columns in represent the distinctions of these models in segmenting airports, parks, overpasses, shrub forests, paddy fields, stadiums, garden lands, fish ponds, and industrial areas. Additionally, it is observed that both DGCNet and SegFormer achieve similar results and underperform ConvNeXt, which is consistent with the quantitative evaluation.

3.4. Ablation experiments

The improvement in the GHGCN performance can be attributed to three key designs: (1) the integration of the CNN branch with coarse LCLU data as input (+LCLUCNN), (2) the employment of the GCN module (+GCN), and (3) the incorporation of the meta-path in feature aggregation (+Meta-path). To demonstrate the effectiveness of these factors, we conducted additional experiments as presented in . We developed five comparable models and tested their performance on both the GID15 and GID24 datasets. These five models share the same decoder and training configuration but vary in encoder design. The first model’s encoder comprises a CNN branch with images as input. Building upon this, two additional models were created, each incorporating an extra CNN branch with a coarse LCLU mask as input (+LCLUCNN) and the GCN module. These two models were designed to analyze the individual impacts of the dual-branch CNN and the GCN module. The fourth model combines both. Finally, the meta-path is further incorporated into the GCN module, which is actually the proposed GHGCN.

Table 5. Quantitative comparison of ablation models.

Display Table

Through the ablation study, we observed that all the three design factors contribute significantly to the improved performance of our GHGCN model. When none of these factors are applied, our model is equivalent to the FCN model with the last four convolutional layers of ResNet-50 removed. Compared to the performance of FCN in , the removal of these layers in our model is advantageous because it preserves more spatial details. When only the LCLUCNN factor is added, our model becomes a double-branch FCN model. This results in improvements of 1.71% and 2.98% in the mIoU on GID15 and GID24, respectively. Moreover, when only the GCN factor is employed, our model degrades to GLoRe. The increase in the mIoU over the FCN demonstrates the effectiveness of the GCN module. Notably, we find that the GCN factor has a more significant impact than the LCLUCNN factor does, which additionally improves GID15 and GID24 by 2.62% and 0.98%, respectively. Furthermore, when combining both, the resulting model further promotes the mIoU metric. Finally, introducing a meta-path leads to additional performance enhancements based on the former. These findings demonstrate the positive effects of all these designs on the use of coarse LCLU labels. Furthermore, we observed that the impact of the additional LCLUCNN branch is more pronounced in GID24. This may be attributed to the increased significance of the spatial constraints introduced by the additional branch when dealing with more complex categories. Moreover, the benefits of the meta-path factor have also been enhanced in GID24, as it optimizes the aggregation of node features.

3.5. Visualization analysis

In this section, we visualized the following components to gain a better understanding of the mechanism of our GHGCN model: (1) the CNN features generated from the coarse LCLU data, (2) the automatically created nodes in the heterogeneous graph, and (3) the node feature distributions before and after the GCN layer. These visualizations can assist in comprehending how GHGCN efficiently utilizes coarse LCLU data.

We visualize the feature maps of the CNN branch with coarse LCLU data as input to illustrate how these features work in . We select 6 representative features of three layers from samples 1 and 3 as examples. The brighter the pixel is, the stronger the feature response. Overall, shallower layer features tend to focus more on edge information, which is beneficial for narrow object segmentation and can provide more precise boundaries. Deeper layers tend to generate spatial cues that concentrate more on topological positions. Specifically, the features of the conv4_x layer in sample 1 are concentrated on different topological positions, including the two sides (the 2nd column), the top (the 3rd column), and the center (the 4th column) of the built-up mask. Correspondingly, the features of the conv4_x layer in sample 3 are concentrated on the left side (the 2nd column), the top (the 1st column), and the center (the 6th column) of the built-up mask. The spatial cues facilitate segmentation since the topological position can benefit the recognition of refined LCLU. For example, pixels located in the center of the built-up mask are more likely to belong to the urban residential area than those located on the edges are.

Figure 6. Visualization of CNN feature maps with coarse LCLU data as input. Each row includes 6 typical features of the same layer. Each image is normalized by its minimum and maximum values, and brighter pixels indicate higher values. The spatial cues manifest from the edge information to the topological locations as the layer deepens.

To present the adaptively generated nodes, the learned projection weights of the nodes are shown in . After the weights are produced, the node features are determined by the weighted sum of the pixel features. The projection weights can be viewed as a special kind of presegmentation, where they are dynamically created without any spatial restriction and can be any value rather than the average sum. Specifically, we selected 4 nodes from samples 1, 2, and 5 in for illustrative examples. The automatically generated nodes can achieve similar and even more flexible results than presegmentation. (1) Each node corresponds to a specific area. For example, node 1 in the first row represents the airport area, node 3 in the second row indicates the park area, and node 4 in the third row corresponds to the water body. (2) The generated nodes can represent areas on a coarser scale. For instance, node 1 in the second row represents the built-up areas, including the industrial, road, overpass, and urban residential areas. (3) The generated nodes exceed the limitation of representing single local areas, which is more flexible than presegmentation-based node production (e.g. SLIC). In the last row, nodes 1 and 2 include two isolated areas that belong to the same categories. These properties offer greater adaptability in capturing global relationships between nodes. Moreover, we find that the projection weights of the same area can be either positive or negative, such as for node 1 and node 2 in the last row. In other words, the features of different nodes pointing to the same region may have contrasting values.

Figure 7. Visualization of the adaptively generated nodes and the nodes segmented by Simple Linear Iterative Clustering (SLIC). Each row contains an image, the corresponding GID24 label, nodes of SLIC, and the learned projection weights of 4 nodes. The yellow line in the third column indicates the boundaries of different nodes. In columns four to seven, the color red represents positive values, and the color blue denotes negative values. The darker the color is, the greater the value. The adaptively generated nodes can yield comparable and even more flexible results than the presegmentation output.

Furthermore, we generated the nodes via SLIC for comparison, as shown in the third column in . The parameters for SLIC, including segment number, compactness, and sigma, are set to 128, 10, and 2, respectively. Since SLIC segments images based on spatial-spectral features, the generated nodes possess three characteristics: (1) they are locally homogeneous, with spatially nonadjacent pixels not assigned to the same node, (2) there is no intersection between different node, and (3) the pixels within a node are treated equally. In contrast, the dynamically generated nodes of the GHGCN break the constraint of spatial adjacency. Moreover, node features are calculated by the sum of pixel features with adaptively determined weights. Therefore, compared to the pre-processing SLIC method, which requires manual determination of segmentation parameters, the end-to-end and adaptive approach of GHGCN is more flexible and powerful.

The feature distributions before and after the GCN layer are visualized to demonstrate how the GHGCN model improves the separability of different categories. Principal component analysis (PCA) was used to reduce the feature dimension to two in Samples 5 and 4 for illustration. Typically, the ideal feature distribution should ensure that the clusters of different categories are separable: (1) the points of different classes (clusters) should be far apart, and (2) the points of the same class should be tightly distributed. As shown in , after the GCN layer, we observe a noticeable improvement in the separability of the feature distributions. The points belonging to the same category now form a tighter cluster, whereas points from different classes are distributed farther apart. The tighter intra-class distribution and increased inter-class separation make it easier to distinguish among different categories and contribute significantly to the improved performance of our model. For example, in the first row, the mixed clusters are split apart after the application of the GCN layer, allowing for improved discrimination of nodes with similar CNN features. In the second row, the clusters become tighter and more distinguishable, especially for the water and forest regions. This finding demonstrated the positive effects of the GCN module.

Figure 8. Visualization of node feature distributions before and after the GCN layer. Each row indicates an individual sample, and the feature dimension is reduced to two by principal component analysis for better visualization. After the GCN layer, the clusters of different categories are divided further apart while the clusters of the same class are gathered more tightly, which greatly benefits the classification.

4. Discussion

4.1. Going deeper with GCN

Despite showing promise, GCNs typically have shallower architectures than CNNs. This is mainly due to the vanishing gradient problem, which limits the number of layers in most current GCNs to fewer than 5. The depth of a model can significantly impact its performance, especially when the number of layers is limited due to the vanishing gradient problem. To investigate this impact, we conduct experiments in GID24 to evaluate the performance of GHGCN with 1–3 layers, where each layer has 64 hidden units. All other hyperparameters remain unchanged from the previous experiments.

As shown in , the best model depth is 1, which is consistent with the results of GloRe. This result aligns with our earlier observation that the generated nodes in CHeGCN cover global regions and are fully connected, enabling the effective construction of long-distance relationships. In contrast, presegmentation-based graphs can only represent local areas with spatially connected nodes, requiring more layers for long-distance relationship extraction.

Table 6. Quantitative comparison of models with different GCN layer numbers in GID24.

Download CSV Display Table

4.2. The influence of node number

The graph size, determined by the number of nodes, plays a crucial role in GCNs. The number of nodes in the graph affects the size of superpixels and has a significant impact on the GCN performance. If the number of nodes $N$ is too small, it may result the oversmoothing of large objects, while too many nodes can lead to noisy representations of tiny objects. Hence, finding an optimal value for the number of nodes is a challenging task, and it often depends on the dataset and task requirements. In our study, we conduct further experiments on GID24 to investigate the influence of the number of nodes on the performance of GHGCN. Specifically, we evaluated the performance of GHGCN with different node numbers ranging from 16 to 256 in , using a single GCN layer. All the hyperparameters were kept consistent with those used in the previous experiments.

Table 7. Quantitative comparison of models with different node numbers in GID24.

Download CSV Display Table

The performance variations of the GHGCN models with 32–256 nodes are marginal, indicating that GHGCN is robust to changes in the number of nodes. The performance variations in GHGCNs with 32–256 nodes are marginal. This robustness can be attributed to the advantages of the node generation method in GHGCNs. The adaptive creation of the node projection matrix over the entire image allows for flexibility in defining the corresponding area of a node. The correlated area can be of any shape and can even cover multiple isolated regions, as shown in . This adaptability enables the model to effectively capture spatial relationships and maintain consistent performance across different node numbers. However, when the node number is decreased to 16, there is a considerable decrease in performance. This is because, with only one GCN layer and too few nodes, the model capability is significantly limited. Therefore, selecting a relatively large value for the number of nodes is recommended.

4.3. The quality of coarse LCLU data

The employed coarse LCLU mask shares the same spatial resolution as the refined LCLU labels to explore the theoretical positive effects of the coarse LCLU data. Moreover, their category systems are consistent. However, in practice, the available coarse LCLU data often fail to meet these criteria. Commonly used coarse LCLU products typically have spatial resolutions of 10, 30 m, or even lower and may exhibit category system mismatches. Therefore, the proposed model may be limited by these two mismatches in practical applications, and further research is needed on these two types of mismatches. Furthermore, we have observed that the benefits of the coarse LCLU mask for identifying refined vegetation categories are marginal, as their identification relies more on temporal features than visual patterns.

4.4. Limitations and future work

The GHGCN model is built upon the features of a pretrained CNN, while multi-scale information is not fully exploited. Multi-scale information has been shown to be beneficial for image segmentation tasks. Techniques such as pyramid pooling and ASPP have demonstrated their effectiveness in capturing contextual information at different scales, leading to more accurate and robust segmentation results. Incorporating multi-scale CNN features into the GHGCN model can enhance the model’s ability to capture spatial context at different scales, which is particularly valuable when dealing with objects or structures of various sizes. Moreover, the idea of creating multi-scale heterogeneous graphs on multi-scale CNN features aligns well with the idea of aggregating information from various levels of spatial detail. In the future, we will investigate the potential to build multi-scale heterogeneous graphs on multi-scale CNN features, which is indeed an interesting and promising direction for further improving the segmentation performance.

5. Conclusion

This work highlights the importance of incorporating prior knowledge (i.e. coarse LCLU information) for refined LCLU classification in addition to imagery. The proposed framework, which combines CNNs and GCNs and effectively utilizes coarse LCLU data, shows promising results and contributes to the advancement of image segmentation techniques for remote sensing applications. Specifically, we explore three different approaches to utilizing this information. Each of them has positive effects on improving the segmentation results. Notably, incorporating coarse LCLU in the feature aggregation process within heterogeneous graphs has shown significant success. In our proposed model, graph convolutions in heterogeneous graphs are implemented in an end-to-end manner. This streamlines the training process and facilitates better optimization. Our model not only improves segmentation performance but also demonstrates the potential of harnessing coarse LCLU results effectively. The design of this model can lead to further advancements in LCLU classification and offer valuable contributions to the remote sensing community.

Acknowledgments

The authors gratefully acknowledge the free access to the GID dataset (https://x-ytong.github.io/project/GID.html), the GID-15 dataset (https://captain-whu.github.io/HPS-Net/), and the Five-Billion-Pixels dataset (https://x-ytong.github.io/project/Five-Billion-Pixels.html).

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The data used in this study are available by contacting the authors.

Additional information

Funding

This work was supported by the National Key R&D Program of China under grant number 2021YFB3900503.

References

Achanta, Radhakrishna, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. 2012. “SLIC Superpixels Compared to State-of-the-Art Superpixel Methods.” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (11): 2274–2282. https://doi.org/10.1109/TPAMI.2012.120.
PubMed Web of Science ®Google Scholar
Albert, L., F. Rottensteiner, and C. Heipke. 2015. “An Iterative Inference Procedure Applying Conditional Random Fields for Simultaneous Classification of Land Cover and Land Use.” ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences II-3/W5: 369–376. https://doi.org/10.5194/isprsannals-II-3-W5-369-2015.
Google Scholar
Barnsley, Michael J., and Stuart L. Barr. 1996. “Inferring Urban Land Use from Satellite Sensor Images Using Kernel-Based Spatial Reclassification.” Photogrammetric Engineering and Remote Sensing 62: 949–958.
Web of Science ®Google Scholar
Barnsley, Michael J., and Stuart L. Barr. 1997. “Distinguishing Urban Land-Use Categories in Fine Spatial Resolution Land-Cover Data Using a Graph-Based, Structural Pattern Recognition System.” Computers, Environment and Urban Systems 21 (3): 209–225. https://doi.org/10.1016/S0198-9715(97)10001-1.
Google Scholar
Barnsley, Michael J., and Stuart L. Barr. 2000. “Monitoring Urban Land Use by Earth Observation.” Surveys in Geophysics 21 (2): 269–289. https://doi.org/10.1023/A:1006798328429.
Web of Science ®Google Scholar
Barr, Stuart, and Micheal Barnsley. 1997. “A Region-Based, Graph-Theoretic Data Model for the Inference of Second-Order Thematic Information from Remotely-Sensed Images.” International Journal of Geographical Information Science 11 (6): 555–576. https://doi.org/10.1080/136588197242194.
Web of Science ®Google Scholar
Brown, Christopher F., Steven P. Brumby, Brookie Guzder-Williams, Tanya Birch, Samantha Brooks Hyde, Joseph Mazzariello, Wanda Czerwinski, et al. 2022. “Dynamic World, Near Real-Time Global 10 m Land use Land Cover Mapping.” Scientific Data 9 (1): 251. https://doi.org/10.1038/s41597-022-01307-4.
Web of Science ®Google Scholar
Chen, Yunpeng, Marcus Rohrbach, Zhicheng Yan, Yan Shuicheng, Jiashi Feng, and Yannis Kalantidis. 2019. “Graph-Based Global Reasoning Networks.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 433–442.
Google Scholar
Chen, Liang-Chieh, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. “Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation.” In Computer Vision – ECCV 2018, 833–851. https://doi.org/10.1007/978-3-030-01234-2_49
Google Scholar
Comber, Alexis J, Chris F Brunsdon, and Carson JQ Farmer. 2012. “Community Detection in Spatial Networks: Inferring Land Use from a Planar Graph of Land Cover Objects.” International Journal of Applied Earth Observation and Geoinformation 18: 274–282. https://doi.org/10.1016/j.jag.2012.01.020.
Web of Science ®Google Scholar
Cui, Wei, Meng Yao, Yuanjie Hao, Ziwei Wang, Xin He, Weijie Wu, Jie Li, Huilin Zhao, Cong Xia, and Jin Wang. 2021. “Knowledge and Geo-Object Based Graph Convolutional Network for Remote Sensing Semantic Segmentation.” Sensors 21 (11): 3848. https://doi.org/10.3390/s21113848.
PubMed Web of Science ®Google Scholar
Durand, Nicolas, Sebastien Derivaux, Germain Forestier, Cedric Wemmert, Pierre Gancarski, Omar Boussaid, and Anne Puissant. 2007. “Ontology-Based Object Recognition for Remote Sensing Image Interpretation.” In 19th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2007). Vol. 1, 472–479. https://doi.org/10.1109/ICTAI.2007.111.
Google Scholar
Esch, Thomas, Wieke Heldens, Andreas Hirner, Manfred Keil, Mattia Marconcini, Achim Roth, Julian Zeidler, Stefan Dech, and Emanuele Strano. 2017. “Breaking New Ground in Mapping Human Settlements from Space – the Global Urban Footprint.” ISPRS Journal of Photogrammetry and Remote Sensing 134: 30–42. https://doi.org/10.1016/j.isprsjprs.2017.10.012.
Web of Science ®Google Scholar
Eyton, J. Ronald. 1993. “Urban Land Use Classification and Modelling Using Cover-Type Frequencies.” Applied Geography 13 (2): 111–121. https://doi.org/10.1016/0143-6228(93)90053-4.
Web of Science ®Google Scholar
Feddema, Johannes J., Keith W. Oleson, Gordon B. Bonan, Linda O. Mearns, Lawrence E. Buja, Gerald A. Meehl, and Warren M. Washington. 2005. “The Importance of Land-Cover Change in Simulating Future Climates.” Science 310 (5754): 1674–1678. https://doi.org/10.1126/science.1118160.
PubMed Web of Science ®Google Scholar
Foley, Jonathan A., Ruth DeFries, Gregory P. Asner, Carol Barford, Gordon Bonan, Stephen R. Carpenter, F. Stuart Chapin, et al. 2005. “Global Consequences of Land Use.” Science 309 (5734): 570–574. https://doi.org/10.1126/science.1111772.
PubMed Web of Science ®Google Scholar
Fu, Jun, Jing Liu, Haijie Tian, Yong Li, Yongjun Bao, Zhiwei Fang, and Hanqing Lu. 2019. “Dual Attention Network for Scene Segmentation.” In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3141–3149. https://doi.org/10.1109/CVPR.2019.00326.
Google Scholar
Ge, Panpan, Jun He, Shuhua Zhang, Liwei Zhang, and Jiangfeng She. 2019. “An Integrated Framework Combining Multiple Human Activity Features for Land Use Classification.” ISPRS International Journal of Geo-Information 8 (2): 90. https://doi.org/10.3390/ijgi8020090.
Web of Science ®Google Scholar
Gong, Peng, Han Liu, Meinan Zhang, Congcong Li, Jie Wang, Huabing Huang, Nicholas Clinton, et al. 2019. “Stable Classification with Limited Sample: Transferring a 30-m Resolution Sample Set Collected in 2015 to Mapping 10-m Resolution Global Land Cover in 2017.” Science Bulletin 64: 370–373. https://doi.org/10.1016/j.scib.2019.03.002.
PubMed Web of Science ®Google Scholar
Guo, Meng-Hao, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-min Hu. 2022. “SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation.” Advances in Neural Information Processing Systems 35: 1140–1156.
Google Scholar
Hazirbas, Caner, Lingni Ma, Csaba Domokos, and Daniel Cremers. 2017. “FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture.” In Computer Vision – ACCV 2016, 213–228. https://doi.org/10.1007/978-3-319-54181-5_14
Google Scholar
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. “Deep Residual Learning for Image Recognition.” In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/CVPR.2016.90.
Google Scholar
Hong, Danfeng, Lianru Gao, Jing Yao, Bing Zhang, Antonio Plaza, and Jocelyn Chanussot. 2020. “Graph Convolutional Networks for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 59 (7): 5966–5978. https://doi.org/10.1109/TGRS.2020.3015157.
Web of Science ®Google Scholar
Hu, Hanzhe, Deyi Ji, Weihao Gan, Shuai Bai, Wei Wu, and Junjie Yan. 2020. “Class-Wise Dynamic Graph Convolution for Semantic Segmentation.” In European Conference on Computer Vision, 1–17.
Google Scholar
Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” In International Conference on Machine Learning, 448–456.
Google Scholar
Jiao, Libin, Lianzhi Huo, Changmiao Hu, and Ping Tang. 2021. “Refined UNet V3: Efficient End-to-End Patch-Wise Network for Cloud and Shadow Segmentation with Multi-Channel Spectral Features.” Neural Networks 143: 767–782. https://doi.org/10.1016/j.neunet.2021.08.008.
PubMed Web of Science ®Google Scholar
Karra, Krishna, Caitlin Kontgis, Zoe Statman-Weil, Joseph C. Mazzariello, Mark Mathis, and Steven P. Brumby. 2021. “Global Land Use / Land Cover with Sentinel 2 and Deep Learning.” In 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, 4704–4707. https://doi.org/10.1109/IGARSS47720.2021.9553499.
Google Scholar
Kingma, Diederik P., and Jimmy Ba. 2014. “Adam: A Method for Stochastic Optimization.” arXiv Preprint arXiv:1412.6980.
Google Scholar
Kipf, Thomas N, and Max Welling. 2016. “Semi-Supervised Classification with Graph Convolutional Networks.” arXiv Preprint arXiv:1609.02907.
Google Scholar
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012. “Imagenet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25.
Google Scholar
Li, Yin, and Abhinav Gupta. 2018. “Beyond Grids: Learning Graph Representations for Visual Recognition.” In Advances in Neural Information Processing Systems. Vol. 31.
Google Scholar
Li, Zhuohong, Wei He, Mofan Cheng, Jingxin Hu, Guangyi Yang, and Hongyan Zhang. 2023. “SinoLC-1: The First 1-Meter Resolution National-Scale Land-Cover Map of China Created with the Deep Learning Framework and Open-Access Data.” Earth System Science Data Discussions 2023: 1–38. https://doi.org/10.5194/essd-2023-87.
Google Scholar
Liu, Zhuang, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. “A ConvNet for the 2020s.” In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11966–11976. https://doi.org/10.1109/CVPR52688.2022.01167.
Google Scholar
Liu, Mingyuan, Dan Schonfeld, and Wei Tang. 2021. “Exploit Visual Dependency Relations for Semantic Segmentation.” In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9721–9730. https://doi.org/10.1109/CVPR46437.2021.00960.
Google Scholar
Liu, Xuan, Ping Tang, Xing Jin, and Zheng Zhang. 2022. “From Regression Based on Dynamic Filter Network to Pansharpening by Pixel-Dependent Spatial-Detail Injection.” Remote Sensing 14 (5): 1242. https://doi.org/10.3390/rs14051242.
Web of Science ®Google Scholar
Liu, Zhi-Qiang, Ping Tang, Weixiong Zhang, and Zheng Zhang. 2022. “CNN-Enhanced Heterogeneous Graph Convolutional Network: Inferring Land Use from Land Cover with a Case Study of Park Segmentation.” Remote Sensing 14 (19): 5027. https://doi.org/10.3390/rs14195027.
Web of Science ®Google Scholar
Liu, Qichao, Liang Xiao, Jingxiang Yang, and Zhihui Wei. 2020. “CNN-Enhanced Graph Convolutional Network with Pixel-and Superpixel-Level Feature Fusion for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 59 (10): 8657–8671.
Web of Science ®Google Scholar
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. 2015. “Fully Convolutional Networks for Semantic Segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440.
Google Scholar
Lv, Zhiyong, Pengfei Zhang, Weiwei Sun, Jón Atli Benediktsson, Junhuai Li, and Wei Wang. 2023. “Novel Adaptive Region Spectral–Spatial Features for Land Cover Classification with High Spatial Resolution Remotely Sensed Imagery.” IEEE Transactions on Geoscience and Remote Sensing 61: 1–12. https://doi.org/10.1109/TGRS.2023.3275753.
Web of Science ®Google Scholar
Nair, Vinod, and Geoffrey E. Hinton. 2010. “Rectified Linear Units Improve Restricted Boltzmann Machines.” In Proceedings of the 27th International Conference on International Conference on Machine Learning, 807–814.
Google Scholar
Niu, Ruigang, Xian Sun, Yu Tian, Wenhui Diao, Yingchao Feng, and Kun Fu. 2022. “Improving Semantic Segmentation in Aerial Imagery via Graph Reasoning and Disentangled Learning.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–18. https://doi.org/10.1109/TGRS.2021.3121471.
Web of Science ®Google Scholar
Pielke, Roger A. 2005. “Land Use and Climate Change.” Science 310 (5754): 1625–1626. https://doi.org/10.1126/science.1120529.
PubMed Web of Science ®Google Scholar
Song, Wen, Wei Song, Haihong Gu, and Fuping Li. 2020. “Progress in the Remote Sensing Monitoring of the Ecological Environment in Mining Areas.” International Journal of Environmental Research and Public Health 17 (6): 1846. https://doi.org/10.3390/ijerph17061846.
PubMed Web of Science ®Google Scholar
Sterling, Shannon M., Agnès Ducharne, and Jan Polcher. 2013. “The Impact of Global Land-Cover Change on the Terrestrial Water Cycle.” Nature Climate Change 3 (4): 385–390. https://doi.org/10.1038/nclimate1690.
Web of Science ®Google Scholar
Su, Yanzhou, Jian Cheng, Wen Wang, Haiwei Bai, and Haijun Liu. 2022. “Semantic Segmentation for High-Resolution Remote-Sensing Images via Dynamic Graph Context Reasoning.” IEEE Geoscience and Remote Sensing Letters 19: 1–5. https://doi.org/10.1109/LGRS.2022.3145499.
Web of Science ®Google Scholar
Sun, Weiwei, Kai Ren, Xiangchao Meng, Gang Yang, Jiangtao Peng, and Jiancheng Li. 2023. “Unsupervised 3-d Tensor Subspace Decomposition Network for Spatial–Temporal–Spectral Fusion of Hyperspectral and Multispectral Images.” IEEE Transactions on Geoscience and Remote Sensing 61: 1–17. https://doi.org/10.1109/TGRS.2023.3324028.
Web of Science ®Google Scholar
Sun, Weiwei, Kai Ren, Xiangchao Meng, Gang Yang, Chenchao Xiao, Jiangtao Peng, and Jingfeng Huang. 2022. “MLR-DBPFN: A Multi-Scale Low Rank Deep Back Projection Fusion Network for Anti-Noise Hyperspectral and Multispectral Image Fusion.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–14. https://doi.org/10.1109/TGRS.2022.3146296.
Web of Science ®Google Scholar
Tang, Jiwen, Zheng Zhang, Lijun Zhao, and Ping Tang. 2021. “Increasing Shape Bias to Improve the Precision of Center Pivot Irrigation System Detection.” Remote Sensing 13 (4): 612. https://doi.org/10.3390/rs13040612.
Web of Science ®Google Scholar
Tong, Xin-Yi, Gui-Song Xia, Qikai Lu, Huanfeng Shen, Shengyang Li, Shucheng You, and Liangpei Zhang. 2020. “Land-Cover Classification with High-Resolution Remote Sensing Images Using Transferable Deep Models.” Remote Sensing of Environment 237: 111322. https://doi.org/10.1016/j.rse.2019.111322.
Web of Science ®Google Scholar
Tong, Xin-Yi, Gui-Song Xia, and Xiao Xiang Zhu. 2023. “Enabling Country-Scale Land Cover Mapping with Meter-Resolution Satellite Imagery.” ISPRS Journal of Photogrammetry and Remote Sensing 196: 178–196. https://doi.org/10.1016/j.isprsjprs.2022.12.011.
PubMed Web of Science ®Google Scholar
Van der Kwast, Johannes, Tim Van de Voorde, Frank Canters, Inge Uljee, Stijn Van Looy, and Guy Engelen. 2011. “Inferring Urban Land Use Using the Optimised Spatial Reclassification Kernel.” Environmental Modelling & Software 26 (11): 1279–1288. https://doi.org/10.1016/j.envsoft.2011.05.012.
Web of Science ®Google Scholar
Vitousek, Peter M., Harold A. Mooney, Jane Lubchenco, and Jerry M. Melillo. 1997. “Human Domination of Earth’s Ecosystems.” Science 277 (5325): 494–499. https://doi.org/10.1126/science.277.5325.494.
Web of Science ®Google Scholar
Walde, Irene, Sören Hese, Christian Berger, and Christiane Schmullius. 2014. “From Land Cover-Graphs to Urban Structure Types.” International Journal of Geographical Information Science 28 (3): 584–609. https://doi.org/10.1080/13658816.2013.865189.
Web of Science ®Google Scholar
Wan, Sheng, Chen Gong, Ping Zhong, Bo Du, Lefei Zhang, and Jian Yang. 2019. “Multiscale Dynamic Graph Convolutional Network for Hyperspectral Image Classification.” IEEE Transactions on Geoscience and Remote Sensing 58 (5): 3162–3177.
Web of Science ®Google Scholar
Wharton, Stephen W. 1982. “A Contextual Classification Method for Recognizing Land Use Patterns in High Resolution Remotely Sensed Data.” Pattern Recognition 15 (4): 317–324. https://doi.org/10.1016/0031-3203(82)90034-6.
Web of Science ®Google Scholar
Xie, Enze, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, and Ping Luo. 2021. “SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers.” In Advances in Neural Information Processing Systems 34: 12077–12090.
Google Scholar
Yang, Kunping, Xin-Yi Tong, Gui-Song Xia, Weiming Shen, and Liangpei Zhang. 2022. “Hidden Path Selection Network for Semantic Segmentation of Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 60: 1–15. https://doi.org/10.1109/TGRS.2022.3197334.
Web of Science ®Google Scholar
Yu, Changqian, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, and Nong Sang. 2021. “BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation.” International Journal of Computer Vision 129 (11): 3051–3068. https://doi.org/10.1007/s11263-021-01515-2.
Web of Science ®Google Scholar
Yuan, Yuhui, Lang Huang, Jianyuan Guo, Chao Zhang, Xilin Chen, and Jingdong Wang. 2018. “OCNet: Object Context Network for Scene Parsing.” arXiv e-Prints September. https://doi.org/10.48550/arXiv.1809.00916.
Google Scholar
Zanaga, Daniele, Ruben Van De Kerchove, Dirk Daems, W. De Keersmaecker, Carsten Brockmann, Grit Kirches, Jan Wevers, et al. 2022. “ESA WorldCover 10 m 2021 V200.”
Google Scholar
Zhang, Xiuyuan, Shihong Du, and Yuan Zhang. 2018. “Semantic and Spatial Co-Occurrence Analysis on Object Pairs for Urban Scene Classification.” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 11 (8): 2630–2643. https://doi.org/10.1109/JSTARS.2018.2854159.
Web of Science ®Google Scholar
Zhang, Xiuyuan, Shihong Du, and Jixian Zhang. 2019. “How Do People Understand Convenience-of-Living in Cities? A Multiscale Geographic Investigation in Beijing.” ISPRS Journal of Photogrammetry and Remote Sensing 148: 87–102. https://doi.org/10.1016/j.isprsjprs.2018.12.016.
Web of Science ®Google Scholar
Zhang, Li, Xiangtai Li, Anurag Arnab, Kuiyuan Yang, Yunhai Tong, and Philip H. S. Torr. 2019. “Dual Graph Convolutional Network for Semantic Segmentation.” arXiv e-Prints September. https://doi.org/10.48550/arXiv.1909.06121.
Google Scholar
Zhang, Wei, Ping Tang, and Lijun Zhao. 2019. “Remote Sensing Image Scene Classification Using CNN-CapsNet.” Remote Sensing 11 (5): 494. https://doi.org/10.3390/rs11050494.
Web of Science ®Google Scholar
Zhang, Liangpei, Lefei Zhang, and Bo Du. 2016. “Deep Learning for Remote Sensing Data: A Technical Tutorial on the State of the Art.” IEEE Geoscience and Remote Sensing Magazine 4 (2): 22–40. https://doi.org/10.1109/MGRS.2016.2540798.
Web of Science ®Google Scholar
Zhao, Zhitao, Ping Tang, Lijun Zhao, and Zheng Zhang. 2022. “Few-Shot Object Detection of Remote Sensing Images via Two-Stage Fine-Tuning.” IEEE Geoscience and Remote Sensing Letters 19: 1–5. https://doi.org/10.1109/LGRS.2021.3116858.
Web of Science ®Google Scholar
Zhao, Bei, Yanfei Zhong, and Liangpei Zhang. 2016. “A Spectral–Structural Bag-of-Features Scene Classifier for Very High Spatial Resolution Remote Sensing Imagery.” ISPRS Journal of Photogrammetry and Remote Sensing 116: 73–85. https://doi.org/10.1016/j.isprsjprs.2016.03.004.
Web of Science ®Google Scholar
Zhou, Zongwei, Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and Jianming Liang. 2018. “UNet++: A Nested u-Net Architecture for Medical Image Segmentation.” Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, 3–11.
Google Scholar
Zhu, Xiao Xiang, Chunping Qiu, Jingliang Hu, Yilei Shi, Yuanyuan Wang, Michael Schmitt, and Hannes Taubenböck. 2022. “The Urban Morphology on Our Planet – Global Perspectives from Space.” Remote Sensing of Environment 269: 112794. https://doi.org/10.1016/j.rse.2021.112794.
PubMed Web of Science ®Google Scholar
Zhu, Xiao Xiang, Devis Tuia, Lichao Mou, Gui-Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. 2017. “Deep Learning in Remote Sensing: A Comprehensive Review and List of Resources.” IEEE Geoscience and Remote Sensing Magazine 5 (4): 8–36. https://doi.org/10.1109/MGRS.2017.2762307.
Web of Science ®Google Scholar
Zhu, Zhe, Yuyu Zhou, Karen C. Seto, Eleanor C. Stokes, Chengbin Deng, Steward T. A. Pickett, and Hannes Taubenböck. 2019. “Understanding an Urbanizing Planet: Strategic Directions for Remote Sensing.” Remote Sensing of Environment 228: 164–182. https://doi.org/10.1016/j.rse.2019.04.020.
Web of Science ®Google Scholar

Global heterogeneous graph convolutional network: from coarse to refined land cover and land use segmentation

ABSTRACT