538
Views
0
CrossRef citations to date
0
Altmetric
Review

Crop field extraction from high resolution remote sensing images based on semantic edges and spatial structure map

, , , , , & show all
Article: 2302176 | Received 30 Oct 2023, Accepted 15 Dec 2023, Published online: 24 Jan 2024

Abstract

Crop field boundary extraction is crucial to remote sensing images attained to support agricultural production and planning. In recent years, deep convolutional neural networks (CNNs) have gained significant attention for edge detection tasks. Moreover, transformers have shown superior feature extraction and classification capabilities compared to CNNs due to their self-attention mechanism. We proposed a novel structure that combines full edge extraction with CNNs and enhances connectivity with transformers, consisting of three stages: a) preprocessing the training data; b) training the semantic edge and spatial structure graph models; and c) vectorizing the fusion of semantic edge and spatial structure graph outputs. To cater specifically to high-resolution remote sensing image crop-field boundary extraction, we developed a CNN model called Densification D-LinkNet. Its full-scale skip connections and edge-guided module adapted well to different crop-field boundary features. Additionally, we employed a spatial graph structure generator (Relationformer) based on object detection that directly outputs the structural graph of the crop field boundary. This method relies on good connectivity to repair fragmented edges that may appear in semantic edge detection. Through multiple experiments and comparisons with other edge-detection methods, such as BDCN, DexiNed, PidiNet, and EDTER, we demonstrated that our proposed method can achieve at least 9.77% improvement in boundary intersection over union (IoU) and 2.07% improvement in polygon IoU on two customized datasets. These results indicate the effectiveness and robustness of our approach.

1. Introduction

Acquiring comprehensive information on crop field plots, including land plotting, crop yield calculations, and agro-development planning, is crucial for enhancing agricultural productivity. The traditional method of delineating crop field boundaries is laborious and time-consuming as it involves on-site visits and manual mapping in conjunction with remote sensing images, which fails to reflect land changes due to urban development (Prishchepov et al. Citation2018; Xia et al. Citation2016).

Contemporarily, a broader practice for crop field plot extraction involves the use of image segmentation based on edge detection, region segmentation, and machine learning. Traditional methods for detecting edges rely on low-level local cues to identify changes in curvature, noise, or color, such as the Canny algorithm (Canny Citation1986), which uses the calculus of variations to optimize a given functional. However, shallow features aim at areas with evident changes primarily and do not consider contextual semantic information, resulting in a large amount of noise and blurred edges. Region segmentation-based methods primarily aim to address the problem of non-closed boundaries in edge detection. Assumedly, adjacent elements inside a region have similar values (Tremeau and Borel Citation1997). There are two basic methods for region segmentation, including merging and splitting (Fan et al. Citation2001). The basic steps of region segmentation (Bins et al. Citation1996) are: a) obtain the initial segmentation result of the image; b) merge similar fragments and divide dissimilar ones; and c) continue splitting until there are no remaining fragments. However, region-based methods are dependent on well-segmented objects. If the segmentation quality of the image object is poor, it leads to reduced accuracy in the extraction of crop-field boundaries.

Convolutional neural networks (CNNs) and other deep learning models have shown tremendous advantages in the field of remote sensing (Volpi and Tuia Citation2018; Wagner and Oppelt Citation2020; Waldner and Diakogiannis Citation2020; Zhang et al. Citation2021). Currently, three main CNN-based methods are used for crop field detection: semantic segmentation, semantic edge detection, and mixed methods. Semantic segmentation (Ronneberger et al. Citation2015; Badrinarayanan et al. Citation2017; Lin et al. Citation2017; Zhu et al. Citation2017) recognizes crop fields in remote-sensing images, classifies the pixels into different categories, and assigns a unique identifier to each crop field. For example, Chen et al. (Citation2017) combined deep learning and atrous convolutions to achieve crop field segmentation, which helped researchers further improve (Firdaus-Nawi et al. Citation2011; Maggiori et al. Citation2017; Peng et al. Citation2017; Fu et al. Citation2019). However, deep convolutional networks are prone to the notorious issues of gradient explosion or vanishing gradients when the depth becomes excessively large. Simultaneously, dilated convolutions may introduce the problem of insufficient resolution. In order to address these challenges, another famous special structure known as U-Net was introduced (Ronneberger et al. Citation2015). Building upon the basis of FCN, they incorporated an up sampling stage and a feature channel fusion strategy for medical image segmentation. The U-Net structure (Diakogiannis et al. Citation2020; Citation2021; Wang et al. Citation2022) has achieved good results in the delineation of objects. Though semantic segmentation usually yields accurate crop field surface information, the crop field boundaries are not satisfying.

Researchers have employed semantic edge detection methods to acquire precise boundaries (Bertasius et al. Citation2015; Xie and Tu Citation2015; Yang et al. Citation2016; Liu et al. Citation2017; Zhu et al. Citation2017; Wang et al. Citation2019; Su et al. Citation2021). The semantic edge method generally uses multiple convolutional layers, each of which extracts different combinations of features. Bertasius et al. (Citation2015) employed a branched, fully connected subnetwork in conjunction with five convolutional layers; it employed low-level cues to detect object features and transformed them to high-level cues. But it is very challenging to detect high-quality contour only using convolutional layers. Yang et al. (Citation2016) introduced an encoder-decoder-based FCN for foreground object contour detection while suppressing the background edges and demonstrated superior performance in terms of generalized edge detection across various tasks. The issue that arised with deep convolutional layers, specifically having 6 encoders and 6 decoders, is that the network may struggle to converge when going deeper, Liu et al.(Citation2017) proposed a novel network uses all CNN features from different layers to perform the pixel-wise prediction in an image-to-image fasion and try to obtain accurate prediction in different scale. Researchers also used hierarchical, multi-scale features to improve edge and object detection.

Researchers have shifted their focus to model performance and portability due to the high computational costs associated with the increased number of neurons. In practical applications, fast inference speed is crucial, prompting a need for more efficient models. For instance, Wang et al. (Citation2019) proposed an end-to-end network for detecting occlusion boundaries that performed multiple tasks simultaneously. To address the issue of class imbalance, the network used an attention loss function that assigned different weights to false positives and false negatives. Su et al. (Citation2021) integrated traditional edge detection operations into convolutions using pixel-wise difference convolutions to improve performance. However, for tasks requiring long-range dependencies, it is evidently insufficient. The incorporation of attention mechanisms of the Vision Transformer, especially on larger datasets, outperforms traditional approaches in detecting edges. Pu et al. (Citation2022) used the overall background and detailed local information of images to extract clear object edges. This method effectively solves the edge extraction problem when objects are occluded, and it can handle extreme boundaries and nonboundary class imbalances. These methods show excellent results in segmentation and can identify the accurate edge contours of image objects.

Various techniques combining edge detection and segmentation have effectively improved the accuracy and robustness of edge detection, providing strong support for remote sensing image analysis. To tackle complex and fragmented crop field plot extractions, researchers have exploited combined semantic and edge detection methods. Xia et al. (Citation2018) used a method that includes the U-Net and RCF models to extract hard and soft boundaries and combine them to form complete crop field plots. Author also mentioned that the effectiveness of the method may be unfavorable when confronted with data exhibiting regional heterogeneity. For large-scale single fields extraction, Waldner and Diakogiannis (Citation2020) used a fully connected U-net backbone with conditional inference to recognize the plot range, plot boundaries, and nearest plot boundary distance, and finally synthesized the segmentation outputs of three segmentation models. when the distances between the plots’ edges are close, the model does not effectively differentiate between edges, resulting in predicted edge widths that are smaller than the fact. In order to address the issue, Long et al. (Citation2022) proposed a new multi-task BSiNet that combines the learning of three tasks: 1) crop field recognition, 2) crop plot boundary prediction, 3) obtaining distance features, and employs a spatial grouping enhancement module for crop field extraction. Xu et al. (Citation2023) employed a cascade multi-task network integrated with semantic and edge detection, a fixed edge refinement network featuring local connectivity, and a fusion model to extract crop field boundary information, which has greatly aided in the recovery of small plots and local spatial topology information.

In edge extraction tasks, the edge strength map obtained through semantic edge detection acts as a confidence measure. When the confidence level dips below a certain threshold, the detected edges tend to break, leading to a significant reduction in edge connectivity and overall accuracy. Compared with pixel-based segmentation, graph-based methods have less noise, higher accuracy, and better connectivity. However, overly smooth lines can result in the loss of numerous details. These two methods often counterbalance each other. Graph-based methods are available for object extraction in remote sensing imagery; they use an iterative graph-based approach to gradually add vertices and edges to the graph. In road extraction (Bastani et al. Citation2018; Chu et al. Citation2019; He et al. Citation2020; Tan et al. Citation2020; Wei et al. Citation2020), a common method is to output a spatial structure graph directly. Bastani et al. (Citation2018) predicted the starting point by viewing aerial imagery and then iteratively generating a spatial structure graph. However, this iterative process only considers local information and does not capture global information. Moreover, when a starting point deviates significantly from the ground truth, the prediction of the next starting point is biased. Tan et al. (Citation2020) improved the search method for points by making it flexible with a step-length detection technology that can accurately locate each point. A few researchers have combined the advantages of segmented-based and graph-based methods (He et al. Citation2020; Citation2020). While these methods excel in road detection, applying them directly to crop field edge detection poses a greater challenge. Unlike roads, crop field edges lack clearly defined key points, and their width varies, making them less tolerant to a fixed approach. Additionally, crop field exhibits stronger regional heterogeneity.

Recently, transformers have achieved considerable success in computer vision, prompting researchers to explore their applications in vision problems, including image classification (Chen et al. Citation2020; Kolesnikov et al. Citation2021), object detection (Carion et al. Citation2020; Zhu et al. Citation2020), semantic segmentation (Zheng et al. Citation2021), and image processing (Chen et al. Citation2021). Carion et al. (Citation2020) proposed an end-to-end transformer-based object detection method called the DETR, which consists of a learnable object sequence and a direct set-based prediction. DETR eliminates tedious object detection pipelines such as NMS in traditional methods and directly predicts objects. Building upon DETR, some improved transformer modules with slow convergence and limited spatial resolution for processing image features have been proposed. Zhu et al.(Citation2020) introduced a novel approach called deformable DETR, which combines the transformer’s relationship modeling ability with sparsity-based spatial sampling of deformable convolution. They developed a deformable attention module that focused on a small set of key sampling points around the reference and implemented an iterative bounding box refinement mechanism to improve detection performance, which proved to be both simple and effective. Relationfomer (Shit et al. Citation2022) couples object detection and relationship detection between objects from the detection transformer (DETR) and generates a spatial graph structure from the image. The method was unaffected by errors in the current iteration vertex during the generation of the next vertex. It is a clear improvement on global semantic information capture capabilities.

Inspired by the generation of graph structure, it is necessary to consider the structural nature of edges to obtain a complete representation of crop field edges. Representing the turning points in crop field edges with nodes and connecting these nodes with edges to form a complete graph is highly beneficial for vectorization. Non-closed faces are usually excluded during vectorization due to a lack of a few correct edges. While this omission might cause only a minor decrease in the evaluation metric score, it can lead to significant errors in practical applications. Directly outputting the graph from the network in graph structure tasks can avoid noise and complex postprocessing of segmentation, which may lead to errors in interpreting the segmented image. To overcome the limitations of semantic edges, in this study, we incorporated the innate advantages of graphs with semantic edges by introducing graphs into the segmentation results. We utilized the excellent connectivity of graphs guiding imperfect segmentation results and reduced the loss of crop field edges during vectorization.

Our contributions are summarized as follows:

  • We designed a CNN model, Densification D-LinkNet (DDLNet), that was tailored to address the complex and dense crop field edge extraction task in high-resolution remote sensing imagery. DDLNet combines contextual semantic information efficiently and achieves feature information fusion across different scales. It also outputs the correct edge information through an edge-guiding module with higher accuracy and lower noise levels.

  • We used graphs to guide the imperfect segmentation results of crop field edges in the previous step or to output correct and complete edge structures in cases where the segmentation results were incorrect. Our approach improved the connectivity of crop field edges and had advantages in the vectorization process, as sometimes a complete crop field area only requires filling in a small continuous edge structure.

2. Methodology

Our multi-level framework is divided into three stages: a) preprocessing of training data; b) training of the semantic edge (DDLNet) and spatial graph structure models (Relationformer); and c) fusion (). In the first stage, the training data were preprocessed, and the training labels were binarized and converted into corrected binary structure points and edge sets. The second stage is divided into two parts: (1) semantic edge model training, in which we used DDLNet for training and prediction, and the output result was an edge strength map; (2) edge graph structure model training using Relationformer. For fusion, we first performed binarization and refinement based on empirical thresholds on the edge strength map, then transformed the refined edge into a map structure and compared it with the edge patch output by Relationformer, fused discontinuous edge segments, and finally vectorized the output. Sections 2.1.1, and 2.1.2 detail the two models and fusion methods used.

Figure 1. Crop field boundary extraction flowchart.

Figure 1. Crop field boundary extraction flowchart.

2.1. Model architecture and calibration

2.1.1. DDLNet model

DDLNet retains the core elements of D-LinkNet (Zhou et al. Citation2018), which used encoder-decoder structure, dilated convolution, and a pretrained encoder. It has been employed for high-resolution remote-sensing road extraction, achieving competitive performance. However, D-LinkNet only utilized long-range skip connections to fuse features of the same scale at corresponding heights, mitigating information loss introduced by downsampling. It did not consider adequately expressing information from multi-scale features, thus neglecting the intricate interplay between objects and edges in the context of high-resolution remote sensing for crop field extraction. This leaves substantial room for improvement and enhancement. Inspired by DenseNet (Huang et al. Citation2017) and U-Net3+ (Huang et al. Citation2020), we added densification full-scale skip connections, which combined all low-level edge details with decoder features to enhance performance. The structure of DDLNet is shown in .

Figure 2. Densification D-Link Net architecture diagram. Blue matrix is a multi-channel feature map; the encoder architecture on the left is based on Resnet34; and the decoder architecture on the right uses deconvolution. Each dotted line represents a full-scale skip connection. Each of the convolutional layers uses the RELU activation function except for the output edge prediction result, which is activated by the sigmoid function.

Figure 2. Densification D-Link Net architecture diagram. Blue matrix is a multi-channel feature map; the encoder architecture on the left is based on Resnet34; and the decoder architecture on the right uses deconvolution. Each dotted line represents a full-scale skip connection. Each of the convolutional layers uses the RELU activation function except for the output edge prediction result, which is activated by the sigmoid function.

During the down sampling process, some details are lost inevitably due to the decrease in image resolution. Although the network could capture low-level information, high-level details are lost easily. It weakened the model’s ability to handle high-level information and lacked correlation between high- and low-level information, resulting in a loss of contextual semantic information during a few down sampling processes. D-LinkNet solved the issue using long skip connections that connect the same-scale between encoder and decoder to repair the loss of context information. Furthermore, we merged DenseNet and U-Net3+ and created a new multi-scale connection module called the full-scale skip connection module. This module not only performed skip connections within the same scale but also efficiently connected features from different scales to enhance the fusion of features across scales.

We employed deep multi-scale supervision to address the challenge of limited receptive fields in convolutional neural networks, enabling distinct feature information at varying scales to play a pivotal role. This approach entails predicting the feature information for each decoder layer (D1, D2, D3, D4, and D5) and utilizing label supervision to calculate the loss for each decoder layer. This substantially enhanced the network’s ability to learn features across all scales. We performed up sampling on each scale feature, restoring it to its original size for prediction and loss calculation using 1 × 1 convolution blocks to adjust the number of feature channels. The final predicted output was the result of fusing each scale feature Ŷfinal, with the predicted output being Ŷ and the output weight for each layer being wi. In this experiment, we selected wi=0.2 for each decoder layer, and the formula for calculation was as follows: (1) Ŷfinal=i=15wiŶi(1)

Only exploiting low-level features for semantic edge supervision is insufficient as low-level features from shallow layers preserve rich details while containing significant levels of noise and redundant information, leading to low efficiency and insignificant results. The use of high-level semantic information for semantic edge supervision can mitigate these issues effectively. The last layer of the DDLNet decoder, D1, encompasses both low- and full-scale high-level semantic features from other decoders (D2, D3, D4, and D5), allowing effective semantic edge detection through edge label supervision over the features of the D1 layer.

2.1.2. Edge detection based on spatial structure graph

The DETR, proposed by Carion et al. (Citation2020), demonstrates the potential of set-based object detection using an encoder-decoder transformer architecture. It begins with a CNN backbone to extract a compact feature representation and then reshapes the spatial dimensions of the extracted features into continuous vectors. These serialized features were complemented by positional encoding and transmitted to a transformer encoder. The decoder transforms a small number of embeddings, which are learned positional encodings. Finally, each decoder output embedding is passed to a shared feedforward network to predict its class and bounding box, or no-object class.

DETR generates a fixed-size set of N predictions at each pass through the decoder, with N set to significantly exceed the number of detectable objects in the image, following the Hungarian algorithm from prior work (Stewart et al. Citation2016). The loss function generates optimal bipartite matching between the predicted and ground-truth objects and optimizes the loss for specific objects. This object matching method works similarly to heuristic assignment rules, and DETR only needs to find a one-to-one matching result for the set prediction.

Our work utilized deformable DETR, which features a multi-scale deformable attention module that focuses on a small subset of key sampling points around a reference point rather than on the spatial size of the feature maps. Allocating a small set of fixed key points per query mitigates the issues of convergence and feature spatial resolution, making it an effective attention mechanism for feature maps. As an end-to-end object detector, deformable DETR is efficient and has faster convergence speeds than conventional DETR, thereby opening new possibilities for our study. Given a feature map  fIRC×H×W and q query elements that encode content features and 2-d reference points xq in an index, deformable attention features were obtained using the following formula: (2) DefAttn(fq,xq,fI)=m=1MWm[k=1KAmqkWmfI(xq+Δxmqk)](2) where m indexes the attention head, k indexes the sampled keys, and K and M are the total sampled number (K ≪ HW) and attention head number, respectively. Δxmqk and Amqk [0, 1] present the sampling offset and attention weight of kth sampling point in the mth attention head, respectively. They are obtained by linear projection over the query feature fq. As xq+Δxmqk is fractional, bilinear interpolation (Dai et al. Citation2017) is used for computing fI(xq+Δxmqk). The computing result goes through a single layer Wm followed by a multiplication with the attention weight Amqk and finally another single layer Wm merges all the heads.

We described the Relationformer architecture and loss function briefly, as described previously (Shit et al. Citation2022). The Relationformer consists of four main parts: the CNN backbone, transformer, object-detection head, and relation-detection head. The CNN backbone uses Res-Net101 to extract features according to fIRDf×#emb, where Df represents the spatial dimension of the feature and #emb represents the embedding dimension. The transformer uses an architecture with a deformable attention module and an encoder-decoder structure that exploits the spatial sparsity of image features to significantly accelerate the convergence of DETR training. This encoder is consistent with a deformable DETR encoder. The decoder uses N + 1 tokens as input, where N represents the first N object tokens. The object-detection head consists of two parts: a fully connected network or an MLP stack that outputs the location of each object, and a single-layer classification module for detecting objects and nonobjects. The input to the relation detection head is a pair of object tokens and a shared relationship token, which is processed by a three-layer fully connected network using formula 3 (where rln and r represent relationship and [rln]-token, respectively; MLP is a three-layer fully connected network headed by layer standardization; the ordering of object token pairs {oi,r,oj}ij determines the direction i→j). The structure is shown in . (3) erlnij=MLPrln({oi,r,oj}ij)(3)

Figure 3. Overall architecture and details of Relationformer. Using pairs of N objects (obj-token) and an additional (N + 1) token as the interaction and query process of the relationship between objects. Relationship defined as rln-token, and objects, and relationships between objects are combined to form a shared object-relationship diagram.

Figure 3. Overall architecture and details of Relationformer. Using pairs of N objects (obj-token) and an additional (N + 1) token as the interaction and query process of the relationship between objects. Relationship defined as rln-token, and objects, and relationships between objects are combined to form a shared object-relationship diagram.

2.1.3. Fusion

The fusion approach aggregates the semantic and topological information of crop field edges. Specifically, guided by a spatial structural map, it detects discontinuous segments of crop field edges within a certain range and determines their nearest nodes to connect them. The fusion process is illustrated in and Algorithm 1.

Figure 4. Specific implementation process for edge fusion involves converting the refined edges into a graph, and locating the shortest path between the start and end points. If the path is not found, new edges are added to the graph. Black circles refer to vertices; yellow circles refer to start and end points. Blue dot lines refer to the searching range default to 15 pixels (changing with the crop category) and the red line refers to the fixed edge.

Figure 4. Specific implementation process for edge fusion involves converting the refined edges into a graph, and locating the shortest path between the start and end points. If the path is not found, new edges are added to the graph. Black circles refer to vertices; yellow circles refer to start and end points. Blue dot lines refer to the searching range default to 15 pixels (changing with the crop category) and the red line refers to the fixed edge.

Algorithm 1:

Fusion

Input: Testing datasets x with predicted edge strength edge xp, the binarized edge xb, the refined edge xr, the crop of testing datasets xij,xrij, the transformed graph Gthin, the predicted spatial structure map Gpred, the search range R, the stride s and the threshold t.

Initialize:

xb = binarization(xp, t)

xr = skeletonize(xb)

for xij,xrij in s, width of x, height of x:

Gpred= relationformer_predict(xij)

Gthin= trans_to_graph(xrij)

for edge in Gpred:

n1, n2= edge.nodes

m1, m2= if closet_nodes(n1, n2, Gthin, R)

            else build_nodes(n1, n2, Gthin, R)

     if not has_path(m1, m2,Gthin)

        connect(m1, m2,Gthin)

xoutput= trans_to_edge(Gthin)

return xoutput

The first step in this process is to refine the crop field edges. Initially, the edge strength map with a size of 1024 × 1024 is converted into a binary image by an empirically set threshold of 50. Subsequently, we used the skeleton extraction algorithm (Zhang and Suen Citation1984) to extract a refined binary image of the crop-field edge as the basis for further repair.

The second step involves generating spatial structural maps using sliding windows. The high-resolution crop field image and predicted image were 1024 × 1024 pixels and 128 × 128 pixels, respectively, with the stride for rows and columns set to 32 pixels (balanced between precision and speed). Subsequently, 32 images of 128 × 128 crop field plots were input into Relationformer to predict the spatial structural map, which was used as a guide for edge repair.

Third, we check the connectivity of the refined edges. The refined agricultural edge results are transformed into a list of points and edges, denoted as Gthin, and the spatial structure map is denoted as Gpred. Following this, we traversed each edge in Gpred and searched for two points in Gthin that were closest to the starting and ending points of the edge, thereby creating new starting and ending points. We determined the connectivity between these new points by finding whether there was a shortest path between them. The edge is considered connected if the shortest path exists; otherwise, it is not.

The final step is to connect the disjointed segments. In the spatial structure map, when the starting and ending points were located in different connected components, we searched for the two nearest nodes in Gthin that belonged to different connected components and added a bridge between them. It connects the starting and ending points to form a unique path, thereby completing the entire merging process.

2.1.4. Loss function

The loss function is composed of two parts: the class-balanced mean squared error (CMSE) loss for semantic edge detection, and the L1 loss and GIOU loss for object detection to detect the true bounding boxes of predictions. Additionally, we used the Cross-entropy Classification Loss (CLS) to detect the predicted and actual categories.

Before we explain our loss function, it is necessary to introduce a widely used loss in typical natural image detection, HED, a class-balanced cross-entropy (CBCE) loss, which used a class-balancing weight β on a per-pixel term basis for overcoming the heavy biases between edges (10%) and nonedges (90%). The CBCE loss is calculated by the following formula: (4) Losscbce=βlog(Ŷj|Y|)(1β)log(Ŷj|Y+|)(4) where β=|Y+|/|Y| and 1β=|Y|/|Y|. |Y+| and |Y| denote the edge and non-edge ground truth label, respectively. The edge map predictions Ŷj (j∈[0,1,…H×W]) are computed using sigmoid function.

Though HED loss is applied in most edge detection networks like RCF and BDCN, blurring and roughness outputs are recurring due to significant noise interference. Hence, we proposed the use of CMSE loss, which incorporates class balance parameters into the foundation of MSE loss, to generate relatively refined edge intensity maps. The CMSE loss represents the sum of the squared absolute differences between the predicted and target values for each pixel. The formula for CMSE loss is as follows: (5) Losscmse=1m(β((Ŷj|Y|)Yj)2+(1β)(Ŷj|Y+|)Yj2)(5) where m = H×W, β=|Y+||Y|.

We describe two objects as having a valid relationship when they share a connection. Otherwise, the relationship is classified as background. However, these relationships are sparse within all possible permutations and object combinations. Consequently, computing the relationships among all possible object pairs can become exceedingly complex. To mitigate this problem, we randomly sampled three background relationships from the valid relationships, yielding a set R of size M, on which we computed Relationship Proposal Network loss for relationship detection. The overall loss includes 1 regression loss (Lreg) and the generalized intersection over union loss (LgIoU) between the predicted v˜box and the ground truth vbox box coordinates for object detection, the cross-entropy classification loss (Lcls) between the predicted class v˜cls and the ground truth class vcls for category prediction, and relationship loss (Lrln) between the predicted relation e˜rln (ij) and the ground truth relation erln(ij) for relationship detection. The simultaneous object-relationship graph generation loss function is defined as follows: (6) Ltotal =i=1N[1vclsi(λregLreg(vboxi,v˜boxi)+λgIoULgIoU(vboxi,v˜boxi))]+λclsi=1NLcls(vclsi,v˜clsi)+λrln{i,j}RLrln(erlnij,e˜rlnij)(6) where λreg, λgIoU, λcls and λrln are the loss functions specific weights.

2.2. Experiments

2.2.1. Datasets

The first dataset came from drones surveying tobacco planting regions in Chenzhou City, Hunan Province, covering 24 different villages. The second dataset featured high-resolution remote sensing images of crop fields from Google Earth in Hangzhou, Huzhou, Jiaxing, Ningbo, Shaoxing, and Zhoushan, all located in Zhejiang Province (). The tobacco-planting region dataset comprised 371 high-resolution images, each having a resolution of 1024 × 1024 pixels with a 1 m spatial resolution. The crop field dataset contains 150 high-resolution images, each having a resolution of 1024 × 1024 pixels and a spatial resolution of 0.955 m ().

Figure 5. Datasets. Chehzhou dataset covers 24 villages and Zhejiang dataset includes 6 cities (Hangzhou, Huzhou, Ningbo, Jiaxing, Zhoushan, Shaoxing, comprising 55 clips). Image provides a representative sample of the dataset’s appearance.

Figure 5. Datasets. Chehzhou dataset covers 24 villages and Zhejiang dataset includes 6 cities (Hangzhou, Huzhou, Ningbo, Jiaxing, Zhoushan, Shaoxing, comprising 55 clips). Image provides a representative sample of the dataset’s appearance.

Table 1. Datasets. IS and SR refer to image source and spatial resolution, respectively.

Ground-truth crop images were obtained through manual delineation in the study area. To facilitate model training, we cropped the images to 512 × 512 pixels for DDLNet edge detection, whereas the labels were cropped to 512 × 512 pixels. For the former training data, we cropped the images into 128 × 128 pixel channels and converted the labels into binary spatial structure datasets consisting of point and edge sets ().

Figure 6. Ground truth of two study regions. They are all manually drawn images from our laboratory.

Figure 6. Ground truth of two study regions. They are all manually drawn images from our laboratory.

2.2.2. Training and implementation details

Two separate models were built using the PyTorch framework and trained on a single NVIDIA 3090 GPU with 24 GB of memory. The DDLNet model used an initial learning rate of 2×104, which was updated every 100 iterations. The models were trained for 400 epochs on both the Hunan and Zhejiang datasets, with 80% of the data reserved for training and 20% reserved for testing. The Relationformer model used a lower initial learning rate of 2×105, and was trained for 100 epochs on both datasets.

2.2.3. Evaluation metrics

To evaluate the accuracy of crop field edge detection, we employed the Optimal Dataset Scale (ODS) and Optimal Image Scale (OIS) proposed by HED (Xie and Tu Citation2015). ODS and OIS employ a standard non-maximum suppression technique to obtain finer edges. We used a uniform threshold value for all images and selected the average F1 score (harmonic mean of precision and recall) of the results as ODS. Additionally, the optimal threshold was selected for each image to maximize the F1 score, and the average F1 score was used as the OIS.

Further, we validated the accuracy of our postprocessing methods for crop field surface extraction using joint intersection over union (IoU). IoU stipulates the degree of overlap between the predicted and actual regions of an image, with higher values indicating more accurate predictions. When calculating the polygon IoU, each crop field plot served as a label to reference, and we selected the plots with the largest intersecting area as the correct prediction, accumulating the IoU of each plot using the formula (7) IoU=ABAB(7) where A presents the predicted area while B represents the ground-truth.

However, IoU only reflects the accuracy of crop field plot area extraction and cannot reflect the accuracy of crop field edge and internal edge extraction. Therefore, we employed the boundary IoU to evaluate the accuracy of crop field edge detection. The crop field edges were dilated by setting the kernel size to 5 and using the dilation function (Exp) with the kernel size (ks). The calculation formula is as follows: (8) Boundary IoU=Exp(A,ks)Exp(B,ks)Exp(A,ks)Exp(B,ks)(8)

3. Results

3.1. Edge

We compared several of the most outstanding edge-detection methods in recent years, including BDCN (He et al. Citation2019), DexiNed (Poma, Riba, and Sappa Citation2020), PiDiNet (Su et al. Citation2021), and EDTER (Pu et al. Citation2022). All methods were trained with the authors proposed parameters to achieve optimal prediction results. As shown in , BDCN achieved the highest scores among all the methods in ODS (67.38%) and OIS (67.43%) for the dataset in Chenzhou City, Hunan Province. Our DDLNet method obtained second-place scores in ODS (67.38%) and OIS (67.43%), with a marginal difference of 0.48% from the top-performing method for both ODS and OIS and 9.8% and 6.64% from the third-place PiDiNet.

Table 2. Accuracy comparison of different edge detections for crop fields in Chenzhou city.

Both BDCN and DDLNet exhibited better extraction of edges that were unaffected by non-crop fields, resulting in closer OIS and ODS scores. However, the edge strength map produced by BDCN had a larger pixel width, which made it difficult to clearly distinguish between the two edges. In contrast, our method utilizes a smaller pixel width, which better distinguishes between the edges, handling this case more effectively. The scores for PiDiNet (OIS: 57.58%, ODS: 60.79%) were weaker than those for our method because PiDiNet generated more noise and did not distinguish clearly between soil ridges and field edges. EDTER (OIS:37.84%, ODS:46.08%) exhibited a lower extraction ability and a large amount of noise at the edges. After repeatedly adjusting the parameters and experimenting, we found that the edge extraction ability of DexiNed was insufficient, and its performance on the Chenzhou data was not acceptable. presents a comparison of the different methods.

Figure 7. Comparison of different methods used on the test dataset in Chenzhou city.

Figure 7. Comparison of different methods used on the test dataset in Chenzhou city.

shows that EDTER achieved the best OIS (70.14%) and ODS (68.88%) scores out of all methods in the cultivated land dataset in Northern Zhejiang. In contrast, DDLNet attained inferior OIS (66.52%) and ODS (66.51%) scores, which were 3.62% and 2.37% lower than EDTER, respectively.

Table 3. Accuracy comparison of different edge detection methods for crop fields in Northern Zhejiang.

EDTER stood out in crop field extraction tasks, accurately extracting crop field edges despite its tendency to produce non-cropland and soil ridges. PiDiNet and BDCN are other methods that can extract edges but suffered from high noise, broken edges, and incorrectly extracted edges. DexiNed extracted a significant number of unnecessary internal edges, which decreased the OIS and ODS scores significantly. DDLNet produced the visually cleanest edges with low noise but suffer from limited feature extraction ability in some images and exhibited more pixel points beneath the fixed threshold (selected by OIS and ODS in the calculation process) than EDTER, resulting in lower OIS and ODS scores. compares the edges from different methods.

Figure 8. Comparison of different methods on the Zhejiang test dataset in crop fields.

Figure 8. Comparison of different methods on the Zhejiang test dataset in crop fields.

Edge-detection methods produce non-binary edge-strength maps. Therefore, ODS and OIS select appropriate thresholds for the binarization and computation of the optimal threshold. In essence, ODS sets a uniform threshold for all images, maximizing the F1 score across the entire dataset, while OIS chooses distinct thresholds for individual images, optimizing their F1 score. However, these methods have limitations. They only consider the edge detection capability and do not reflect the performance of crop field extraction or edge-breaking situations. Selecting thresholds that are too high or too low can also result in false or missed detections, affecting the accuracy of performance metrics. These factors are crucial for practical applications because they can affect the actual performance of the model. Therefore, appropriate postprocessing is necessary to demonstrate the capability of the model in practical applications.

3.2. Post-processing

Vectorization is the process of converting the edge information in an image into a vector form. In this process, the pixels in the image are transformed into a series of vectors or curves with geometrical characteristics such as coordinates, length, and width. It is commonly used in map-making, image recognition, and other fields. Accurate edge strength maps are crucial for improving the accuracy and efficiency of vectorization. As shown in , the vectorization achieved with the DDLNet&Fusion method obtained the best scores for boundary IOU (32.01%) and polygon IOU (61.24%) in comparison to other methods in the dataset in Chenzhou City, demonstrating excellent performance in vectorization tasks. Compared to other methods, DDLNet&Fusion can extract accurate edge information and convert it into a high-quality vector form, thereby improving the accuracy and efficiency of vectorization.

Table 4. Accuracy comparison of post-treatment methods for crop fields in Chenzhou City.

The completeness of edges had the greatest influence on the boundary and polygon IOU scores. Despite post-processing, DexiNed could only detect a limited number of edges, resulting in low boundary IOU (2.89%) and polygon IOU (16.04%) scores. Furthermore, the holes generated during the binarization process significantly affected the line IOU scores. EDTER is influenced by the edge strength map, which generates many holes during the binarization process, leading to low boundary IOU scores (10.69%). However, polygon IOU (44.13%) was not affected by the accuracy of edges, and our postprocessing successfully extracted the most cultivated plots, resulting in higher scores. Moreover, because of the pixel widths of edges, two adjacent parallel lines are sometimes processed as a single line. The DDLNet edge intensity map had the narrowest pixel width among all models, making it more effective than BDCN in dealing with the issue of two parallel lines. Consequently, DDLNet achieved a higher boundary IOU score (32.01%) than the BDCN boundary IOU (26.00%). shows a post-processing comparison of different methods.

Figure 9. Comparison of the fusion results of different methods on the crop field dataset in Chenzhou City.

Figure 9. Comparison of the fusion results of different methods on the crop field dataset in Chenzhou City.

shows that using DDLNet&Fusion achieved the best boundary IOU (47.33%) and polygon IOU (64.88%) among all methods for vectorization of the Zhejiang crop fields. Our method outperformed the second-place EDTER for Boundary IOU and Polygon IOU scores by 14.54% and 3.59%, respectively.

Table 5. Accuracy of post-processing methods for edge detection in Zhejiang crop field.

For dense crop field extraction, our post-processing method performed well in various edge detection models. It overcame the limitations stemming from low edge intensity, which leads to binarization loss. In addition, it yields edges that precisely correspond to actual crop field situations. The widths of edge strength maps generated by BDCN and DexiNed are greater than those generated by our method. Moreover, the edge integrity in densely cropped field regions was suboptimal in both the boundary IOU and polygon IOU scores when compared with our DDLNet. Noise has a significant effect on boundary extraction, and reduced noise is more beneficial for the complete extraction of crop fields. Compared to DDLNet, the polygon IOU scores of PiDiNet and EDTER were marginally lower, having high edge integrity. However, the output edge strength map contained unclosed edge pixels that disrupted the fusion process, resulting in a boundary accuracy gap of 15.21% (PiDiNet) and 14.54% (EDTER). The fusion outcomes of different techniques are presented in .

Figure 10. Comparison of the fusion results of different methods in Zhejiang dataset.

Figure 10. Comparison of the fusion results of different methods in Zhejiang dataset.

3.3. Ablation experiment

To further demonstrate the enhanced performance of our method in semantic edge extraction through the fusion of spatial structure maps, we conducted a comparison with an eight-neighbor tracking algorithm. The underlying principle of the eight-neighbor tracking algorithm involves scanning an image to identify non-zero-pixel points. Starting from each non-zero point, we computed the pixel values in the binarized image and the edge-strength image for the eight surrounding pixels. If the pixel value in one direction of the binarized image exceeded zero, we examined the edge strength pixel values of the three opposite-direction pixels. The pixel with the highest value among the three was selected as the next starting point, and this process was iterated. In our post-processing method, the edge probability map was first binarized using a default threshold of 50. The threshold value was determined based on empirical experience, aiming to strike a balance between broken lines and the desired edge width. Subsequently, the skeleton of the binarized image was extracted, followed by vectorization.

As shown in , using fusion reduces the broken-line problems during the vectorization of crop fields effectively, achieving the best scores and closest visual effects to reality among all methods. In the tobacco dataset, the fused boundary IOU (32.01%) and polygon IOU (61.24%) achieved the highest scores. In the Zhejiang dataset, the fusion method obtained the highest scores for the boundary IOU (47.33%) and polygon IOU (64.88%).

Table 6. Accuracy comparison of different post-processing methods.

As shown in , when Fusion or Trace is not employed, the default post-processing method easily produces broken lines during binarization. As we discarded broken lines that could not form an area, the resulting boundary and polygon IOU scores were lower than those of Fusion and Trace. Trace relies on the edge-strength image, which means that tracking cannot continue when the edge probability is zero. Moreover, edges with insufficient probability can affect the tracking performance because edges detected with low probabilities may not be accurate or may be composed of multiple cluttered lines. Therefore, the boundary IOU and polygon IOU scores were higher than those of the default methods but lower than those of fusion. Our method is guided by the correct edge spatial structure map, which decreases the probability of incorrect connections and results in the highest scores.

Figure 11. Comparison of the effects of different post-processing methods on the Zhejiang dataset.

Figure 11. Comparison of the effects of different post-processing methods on the Zhejiang dataset.

4. Discussion

Convolutional neural networks utilize multi-scale feature fusion to retain more details, but this can lead to limitations in edge refinement capabilities and result in blurred and disconnected edges. The experimental results demonstrated that our method outperformed other semantic edge models in crop field extraction, though the scores of OID and ODS were not the best. There were two key reasons why we proposed our model. We enhanced the fusion of different scale features and designed the CMSE loss function, which effectively reduced the occurrence of broken and blurred edges. Compared with the most advanced models (BDCN, EDTER, PiDiNet, DexiNed), they all concentrated on edges from natural images and worked efficiently, but they ignored the connectivity of edges and the degree of thickness, which are essential in crop field extraction. For instance, BDCN used 5 blocks, each followed by a pooling layer to progressively enlarge the receptive in the next blocks, with the Scale Enhancement Module (SEM) to achieve a better trade-off between efficiency and accuracy. The results showed that large feature scales were easy to learn, but SEM did not appear to work well in extracting the small crop field plot; it was either blurred or too thick. We also tested the transformer’s ability to extract crop field edges with EDTER. Different from the CNNs, EDTER was designed to capture the long-range global context of the whole image and extract the short-range local cues for fine-grained edges in two stages. It was clear that EDTER compensated for the loss of the CNNs in learning long-range global context, and the experiments showed excellent performance with the highest OIS and ODS scores in the Zhejiang dataset. However, the results in the Chenzhou dataset did not reveal the same enhanced performance. Ongoing improvements to transformers in edge detection are still necessary. In conclusion, it is essential to highlight that despite the effectiveness of our edge detection model, disconnected edges persisted in the results of binary classification, particularly when handling blurred images or images with unclosed edges. This challenge remains a significant concern for semantic edge models and has yet to be fully addressed.

The diversity of the crop field sample had two crucial impacts on our experimental results. Firstly, it posed challenges in detecting and distinguishing different edge features during model training. Secondly, selecting the right threshold value during postprocessing was crucial. In Chenzhou, Hunan Province, where crop field plots had distinct edges but were similar to non-crop fields, setting a high threshold led to more disconnected boundaries. In high-density crop fields in Northern Zhejiang, where plots were closely situated and exhibited diverse shapes, opting for a low threshold resulted in thicker edges and overlapping boundaries. Striking the right balance in threshold selection was vital to accurately capturing boundaries in various crop field samples.

In practical applications, given that the edge of a crop field cannot be used due to broken lines, we removed the edges of crop fields that could not be polygonized during postprocessing. Removing messy lines can optimize the crop field extraction results; however, given that sometimes only a small part of a whole piece of crop field is missing and the entire edge is removed, it also impairs the accuracy of the post-processing edge compared to that of the semantic edge. Additionally, the eight-neighbor tracking algorithm is limited by the edge intensity map. If the edge intensity map is blurred, jagged edges may be generated and the edge transition is not sufficiently smooth; if the edge probability is low, tracking cannot be continued and edge disconnection cannot be avoided.

To enhance post-processing extraction, we introduced Relationformer, an effective and highly connected graph structure detection method whose robust connectivity solved the issue of broken lines after semantic edge thresholding. In long-distance edge detection, using clear line segments proved superior to binarization and eight-neighborhood tracking, although not without challenges. The accuracy of output nodes significantly influenced extraction outcomes, especially in areas with subtle changes, leading to potential positional shifts. Dense edges, irregular shapes, and short-distance segments posed difficulties in reflecting edge structure changes. Moreover, combining two edge detection methods sometimes resulted in misaligned edges, mainly due to neglecting edge structure diagram directionality. This approach led to significant errors in certain cases compared to ground-truth data.

It is certain that the setting of hyperparameters also have an impact on the experimental results. For example, the prediction results of DDLNet typically exhibited the cleanest edges when the number of training epochs was set to 400. Reducing training epochs resulted in more noise and thicker edges, and if the epochs exceeded 400, the generated edges were often incomplete. Opting for fewer training epochs can help retain more features, but it complicates the evaluation of negative samples during post-processing, and our post-processing algorithm needed further improvement. We opted for HIDDEN_DIM = 256 and DIM_FEEDFORWARD = 128 to accelerate training and prevent model overfitting when training relationshipformer. The size of OBJ_TOKEN can impact both the accuracy of the prediction results and the training speed across different datasets. After conducting multiple experimental comparisons, we set OBJ_TOKEN = 250 for the Chenzhou dataset to achieve better predictions. Given the substantial large categories and complex terrain variations in the Zhejiang dataset, we set OBJ_TOKEN = 600 to enhance prediction accuracy. However, it was important to note that the decision comed with a trade-off, as it reduces the model training speed and increases GPU memory usage. The specific choice should be based on the complexity of your training data.

The extraction of complete tillage edges is a complex task. The current approach focuses on solving the connectivity problem of tillage edge extraction using semantic edges and graphs. However, this approach has limitations. In subsequent studies, we plan to incorporate the direction of graph structures into our approach and consider the similarity of polygons to avoid outcomes that do not match the visual effects.

5. Conclusion

This paper introduced a multi-level framework that utilizes a combination of CNN and transformers for extracting crop fields from aerial images. It consists of two stages, each targeting specific challenges. In the first stage, we address the issue of discontinuity in detecting semantic crop field edges by employing a novel CNN model called Densification D-LinkNet. This model incorporates full-scale skip connections, deep multi-scale supervision, and semantic edge supervision, thereby enhancing edge-detection capabilities. In the second stage, we leverage both semantic and topological information about the edges. The topological information is utilized to correct any discontinuous edges during the vectorization process, ultimately producing complete and usable crop field vector plots.

Our method outperformed the previous approaches on two different experimental datasets by addressing the issue of edges that cannot be counted as partial missing, which resulted from inaccurate detection of semantic edges. Additionally, we introduced topological information into the edge extraction process for the first time, resulting in advanced edge extraction effectiveness.

Acknowledgments

Our experiment was based on two research datasets, both of which are proprietary to our laboratory. All authors express their gratitude to the reviewers and editors for their helpful comments and suggestions.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The dataset used in the article can be downloaded at https://pan.baidu.com/s/1JWivVLK2RVURm1MgGcQOyg?pwd=dgt8 –dgt8 and https://drive.google.com/drive/folders/1xHla26EzwOC_KiS2FkHcLkJSd7sQbhnk?usp=sharing; Some network-related codes are publicly available at https://github.com/649064287/DDLNet-main.git.

Additional information

Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2018YFB0505300 and in part by the National Natural Science Foundation of China under Grant 41701472 and 41971375.

References

  • Badrinarayanan V, Kendall A, Cipolla R. 2017. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell. 39(12):2481–2495. doi:10.1109/TPAMI.2016.2644615.
  • Bastani F, He S, Abbar S, Alizadeh M, Balakrishnan H, Chawla S, Madden S, DeWitt D. 2018. Roadtracer: automatic extraction of road networks from aerial images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Bertasius G, Shi J, Torresani L. 2015. Deepedge: a multi-scale bifurcated deep network for top-down contour detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Bins LS, Fonseca LG, Erthal GJ, Ii FM. 1996. Satellite imagery segmentation: a region growing approach. Simpósio Brasileiro de Sensoriamento Remoto. 8(1996):677–680.
  • Canny J. 1986. A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell. PAMI-8(6):679–698. doi:10.1109/TPAMI.1986.4767851.
  • Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. 2020. End-to-end object detection with transformers. Paper presented at the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16.
  • Chen H, Wang Y, Guo T, Xu C, Deng Y, Liu Z, Ma S, Xu C, Xu C, Gao W. 2021. Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL. 2017. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell. 40(4):834–848. doi:10.1109/TPAMI.2017.2699184.
  • Chen M, Radford A, Child R, Wu J, Jun H, Luan D, Sutskever I. 2020. Generative pretraining from pixels. In: Daumé Hal, III and Singh Aarti, editors. Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research. p. 1691–1703.
  • Chu H, Li D, Acuna D, Kar A, Shugrina M, Wei X, Liu M-Y, Torralba A, Fidler S. 2019. Neural turtle graphics for modeling city road layouts. In: Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Dai J, Qi H, Xiong Y, Li Y. 2017. GuodongZhang, Han Hu, and Yichen Wei. Deformable convolutionalnetworks. Paper Presented at the ICCV.
  • Diakogiannis FI, Waldner F, Caccetta P. 2021. Looking for change? Roll the dice and demand attention. Remote Sensing. 13(18):3707. doi:10.3390/rs13183707.
  • Diakogiannis FI, Waldner F, Caccetta P, Wu C. 2020. ResUNet-a: a deep learning framework for semantic segmentation of remotely sensed data. ISPRS J Photogrammetry Remote Sens. 162:94–114. doi:10.1016/j.isprsjprs.2020.01.013.
  • Fan J, Yau DK, Elmagarmid AK, Aref WG. 2001. Automatic image segmentation by integrating color-edge extraction and seeded region growing. IEEE Trans Image Process. 10(10):1454–1466. doi:10.1109/83.951532.
  • Firdaus-Nawi M, Noraini O, Sabri MY, Siti-Zahrah A, Zamri-Saad M, Latifah H. 2011. DeepLabv3+ _encoder-decoder with Atrous separable convolution for semantic image segmentation. Pertanika J Trop Agric Sci. 34(1):137–143.
  • Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H. 2019. Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • He J, Zhang S, Yang M, Shan Y, Huang T. 2019. Bi-directional cascade network for perceptual edge detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • He S, Bastani F, Jagwani S, Alizadeh M, Balakrishnan H, Chawla S, Elshrif MM, Madden S, Sadeghi MA. 2020. Sat2graph: road graph extraction through graph-tensor encoding. Paper presented at the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16.
  • He S, Bastani F, Jagwani S, Park E, Abbar S, Alizadeh M, Balakrishnan H, Chawla S, Madden S, Sadeghi MA. 2020. Roadtagger: robust road attribute inference with graph neural networks. Paper presented at the Proceedings of the AAAI Conference on Artificial Intelligence. doi:10.1609/aaai.v34i07.6730.
  • Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. 2017. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, Han X, Chen Y-W, Wu J. 2020. Unet 3+: a full-scale connected unet for medical image segmentation. Paper presented at the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
  • Kolesnikov A, Dosovitskiy A, Weissenborn D, Heigold G, Uszkoreit J, Beyer L, Minderer M, Dehghani M, Houlsby N, Gelly S. 2021. An Image is Worth 16x16 Words: transformers for Image Recognition at Scale.
  • Lin G, Milan A, Shen C, Reid I. 2017. Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Liu Y, Cheng M-M, Hu X, Wang K, Bai X. 2017. Richer convolutional features for edge detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Long J, Li M, Wang X, Stein A. 2022. Delineation of agricultural fields using multi-task BsiNet from high-resolution satellite images. Inter J Appl Earth Observ Geoinform. 112:102871. doi:10.1016/j.jag.2022.102871.
  • Maggiori E, Tarabalka Y, Charpiat G, Alliez P. 2017. Convolutional neural networks for large-scale remote-sensing image classification. IEEE Trans Geosci Remote Sensing. 55(2):645–657. doi:10.1109/TGRS.2016.2612821.
  • Peng C, Zhang X, Yu G, Luo G, Sun J. 2017. Large kernel matters–improve semantic segmentation by global convolutional network. Paper Presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Poma XS, Riba E, Sappa A. 2020. Dense extreme inception network: towards a robust cnn model for edge detection. Paper Presented at the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
  • Prishchepov A, Radeloff V, Buchner J, Yin H, Kuemmerle T, Bleyhl B. 2018. Mapping agricultural land abandonment from spatial and temporal segmentation of Landsat time series.
  • Pu M, Huang Y, Liu Y, Guan Q, Ling H. 2022. Edter: edge detection with transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Ronneberger O, Fischer P, Brox T. 2015. U-net: convolutional networks for biomedical image segmentation. Paper presented at the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18.
  • Shit S, Koner R, Wittmann B, Paetzold J, Ezhov I, Li H, Pan J, Sharifzadeh S, Kaissis G, Tresp V. 2022. Relationformer: a unified framework for image-to-graph generation. Paper presented at the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII.
  • Stewart R, Andriluka M, Ng AY. 2016. End-to-end people detection in crowded scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Su Z, Liu W, Yu Z, Hu D, Liao Q, Tian Q, Pietikäinen M, Liu L. 2021. Pixel difference networks for efficient edge detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision.
  • Tan Y-Q, Gao S-H, Li X-Y, Cheng M-M, Ren B. 2020. Vecroad: point-based iterative graph exploration for road graphs extraction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Tremeau A, Borel N. 1997. A region growing and merging algorithm to color segmentation. Pattern Recognition. 30(7):1191–1203. doi:10.1016/S0031-3203(96)00147-1.
  • Volpi M, Tuia D. 2018. Deep multi-task learning for a geographically-regularized semantic segmentation of aerial images. ISPRS J Photogrammetry Remote Sens. 144:48–60. doi:10.1016/j.isprsjprs.2018.06.007.
  • Wagner MP, Oppelt N. 2020. Deep learning and adaptive graph-based growing contours for agricultural field extraction. Remote Sensing. 12(12):1990. doi:10.3390/rs12121990.
  • Waldner F, Diakogiannis FI. 2020. Deep learning on edge: extracting field boundaries from satellite images with a convolutional neural network. Remote Sens Environ. 245:111741. doi:10.1016/j.rse.2020.111741.
  • Wang G, Wang X, Li FW, Liang X. 2019. Doobnet: deep object occlusion boundary detection from an image. Paper presented at the Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part VI 14.
  • Wang S, Waldner F, Lobell DB. 2022. Unlocking large-scale crop field delineation in smallholder farming systems with transfer learning and weak supervision. Remote Sens. 14(22):5738. doi:10.3390/rs14225738.
  • Wei Y, Zhang K, Ji S. 2020. Simultaneous road surface and centerline extraction from large-scale remote sensing images using CNN-based segmentation and tracing. IEEE Trans Geosci Remote Sensing. 58(12):8919–8931. doi:10.1109/TGRS.2020.2991733.
  • Xia L, Luo J, Sun Y, Yang H. 2018. Deep extraction of cropland parcels from very high-resolution remotely sensed imagery. Paper presented at the 2018 7th International Conference on Agro-geoinformatics (Agro-geoinformatics). doi:10.1109/Agro-Geoinformatics.2018.8476002.
  • Xia N, Wang Y, Xu H, Sun Y, Yuan Y, Cheng L, Jiang P, Li M. 2016. Demarcation of prime farmland protection areas around a metropolis based on high-resolution satellite imagery. Sci Rep. 6(1):37634. doi:10.1038/srep37634.
  • Xie S, Tu Z. 2015. Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision.
  • Xu L, Yang P, Yu J, Peng F, Xu J, Song S, Wu Y. 2023. Extraction of cropland field parcels with high resolution remote sensing using multi-task learning. Europ J Remote Sens. 56(1):2181874. doi:10.1080/22797254.2023.2181874.
  • Yang J, Price B, Cohen S, Lee H, Yang M-H. 2016. Object contour detection with a fully convolutional encoder-decoder network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Zhang H, Liu M, Wang Y, Shang J, Liu X, Li B, Song A, Li Q. 2021. Automated delineation of agricultural field boundaries from Sentinel-2 images using recurrent residual U-Net. Inter J Appl Earth Observ Geoinform. 105:102557. doi:10.1016/j.jag.2021.102557.
  • Zhang TY, Suen CY. 1984. A fast parallel algorithm for thinning digital patterns. Commun ACM. 27(3):236–239. doi:10.1145/357994.358023.
  • Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PH. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Zhou L, Zhang C, Wu M. 2018. D-LinkNet: linkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops.
  • Zhu XX, Tuia D, Mou L, Xia G-S, Zhang L, Xu F, Fraundorfer F. 2017. Deep learning in remote sensing: A comprehensive review and list of resources. IEEE Geosci Remote Sens Mag. 5(4):8–36. doi:10.1109/MGRS.2017.2762307.
  • Zhu X, Su W, Lu L, Li B, Wang X, Dai J. 2020. Deformable DETR: deformable Transformers for End-to-End Object Detection. Paper presented at the International Conference on Learning Representations.