3,531
Views
4
CrossRef citations to date
0
Altmetric
Articles

Analysis of large-scale UAV images using a multi-scale hierarchical representation

ORCID Icon, , , ORCID Icon &
Pages 33-44 | Received 19 Dec 2016, Accepted 12 Mar 2017, Published online: 16 Jan 2018

Abstract

Unmanned aerial vehicle (UAV)-based imaging systems have many superiorities compared with other platforms, such as high flexibility and low cost in collecting images, providing wide application prospects. However, the acquisition of the UAV-based image commonly results in very high resolution and very large-scale images, which poses great challenges for subsequent applications. Therefore, an efficient representation of large-scale UAV images is necessary for the extraction of the required information in a reasonable time. In this work, we proposed a multi-scale hierarchical representation, i.e. binary partition tree, for analyzing large-scale UAV images. More precisely, we first obtained an initial partition of images by an oversegmentation algorithm, i.e. the simple linear iterative clustering. Next, we merged the similar superpixels to build an object-based hierarchical structure by fully considering the spectral and spatial information of the superpixels and their topological relationships. Moreover, objects of interest and optimal segmentation were obtained using object-based analysis methods with the hierarchical structure. Experimental results on processing the post-seismic UAV images of the 2013 Ya’an earthquake and the mosaic of images in the South-west of Munich demonstrate the effectiveness and efficiency of our proposed method.

1. Introduction

1.1. Motivation and objective

Nowadays, unmanned aerial vehicle (UAV)-based imaging systems have been used in many remote sensing applications, such as DEM generation (Yang and Chen Citation2015) and object detection (Moranduzzo and Melgani Citation2014). UAVs have several advantages over traditional remote sensing platforms, such as higher flexibility and lower cost in collecting images, higher speed and better security. More importantly, the images are acquired with very high resolution (VHR), providing sufficient details for identification and extraction of objects. In addition, the UAV pre-processing procedures, i.e. image stitching (Li, Hui et al. Citation2015; Xu et al. Citation2016; Yu and Yang Citation2016) and 3D reconstruction (Turner, Lucieer, and Christopher Citation2012; Schönberger, Fraundorfer, and Frahm Citation2014), output extremely large-scale Digital Orthophoto Map (panorama or DOM) with VHR. The panorama or DOM can provide a global and detailed inspection of the investigative area. Thus the applications, such as segmentation and object of interest extraction, are very important tasks for processing the large-scale VHR panorama or orthoimage of UAV image. However, the interpretation of very large-scale VHR images remains a great challenge due to the big data volume and semantic complexity. In this paper, we mainly focus on the extraction of objects of interest and optimal segmentation of large-scale UAV images, which include several fundamental and essential problems:

(1)

The coarse-to-fine representation of large-scale VHR UAV images. A panorama or orthoimage of UAV images is a unification of multi-scale objects, with large-scale objects at coarse levels and small objects at fine levels. Thus, a multi-scale representation is essential to simultaneously detect large-scale objects and small-scale objects in VHR images. To construct a precise multi-scale representation, both color and texture information can make contributions. Thus, the extraction of image features and the design of hierarchical structure are still the key problems in VHR image interpretation.

(2)

The semantic information mining in VHR images. Based on the hierarchical image representation, several images of the same scene at different resolutions (such as low, medium, high and VHR) are available. However, the extraction of regions of interest and optimal segmentation still face some challenges related to image size, extraction accuracy and computational complexity. In addition, how to describe these semantic regions in the hierarchical structure is also a problem for the information mining task.

1.2. Related work

The afore-mentioned key issues about UAV image processing can be generally summed up as multi-scale representation and information mining. In the past two decades, there have been many valuable works in the literatures. For the large-scale VHR image representation, multi-scale hierarchical structure representation is a general practice. As a consequence, three main methods are utilized, i.e. image pyramid, wavelet transform and hierarchical image partitions. Binaghi et al. analyzed high-resolution scenes through a set of concentric windows and a Gaussian pyramidal resampling approach (Binaghi, Gallo, and Pepe Citation2003). Yang and Newsam (Citation2011) proposed a spatial pyramid co-occurrence to characterize the photometric and geometric aspects of images. The pyramid structures capture both the absolute and the relative spatial arrangements of objects, but the fixed regular shape and size choice of analysis window lack semantic difference. Baraldi and Bruzzone (Citation2004) used an almost complete basis for the Gabor wavelet transform of images at selected spatial frequencies, which appeared to be superior to the dyadic multi-scale Gaussian pyramid image decomposition. Actually, the wavelet decomposition is a low-pass filter of VHR images that represents the multi-scale property by the coefficients in different bands. However, the wavelet decomposition is a decimation of the original image which lacks consideration of the relationship between objects. By fully considering the semantic gap between different objects, some studies use object-based analysis methods to produce hierarchical image partitions. The topographic representation (Monasse and Guichard Citation2000; Luo, Aujol, and Gousseau Citation2009; Xia, Delon, and Gousseau Citation2010) is typically built on the gray-level image, which rarely concerns color difference. In the Refs (Salembier and Garrido Citation2000; Bai et al. Citation2015), various types of images, e.g. natural images, hyperspectral images, PolSAR images, were represented by a hierarchical structure, namely, binary partition tree (BPT), which was constructed based on particular region models and merging criteria. In these methods, BPT is an effective hierarchical structure for processing and analyzing images. It is theoretically established, easy to implement, and provides effective results. Starting from initial pixels or partitions of the image support, BPT merges the pair of the most similar regions at each step, until a single region is obtained. This structure represents homogeneous regions in the UAV data at different detail levels. Among these three hierarchical representation methods, the BPT is a relatively better structure with both local intrinsic properties and topological relationships being well reserved.

Segmentation is an important task for processing and analyzing UAV data, where an image is usually partitioned into several distinct and meaningful regions that fulfill a certain homogeneity criteria. These segments can provide rich information of the regions (color, texture, and geometric structure of the targets and scenes). However, the BPT representation often contains more regions than required, and the segmentation based on BPT is usually obtained by setting the number of segments or a threshold. In the particular case of UAV data, optimal segmentation from the hierarchies is still a difficult task. Several approaches about this problem in other domains have been proposed (Akcay and Aksoy Citation2008; Kiran and Serra Citation2013). The general practice is using dynamic programming (DP) for a set of energy functions. The original idea of minimizing an energy function over a hierarchy was provided by Breiman et al. (Citation1984). Salembier (Citation2015) proposed to use DP for pruning the BPT. These methods get the optimal partition by cutting a set of edges in the BPT structure. As a greedy algorithm, DP traverses the tree structure in a bottom-up way to find the globally optimal segmentation. The direct minimizing methods in DP treat the semantic gap between a node and its two descendants as a constant value. However, the semantic gap is rising from fine level to coarse level. To overcome the limitation of undersegmentation at coarse levels and oversegmentation at fine levels, the uniform entropy slice (UES) (Xu, Whitt, and Corso Citation2013) was proposed to flatten the hierarchy into a single segmentation and to seek a selection of objects that balances the objects’ energy function and the relative level.

For the semantic object of interest extraction from the hierarchical structure of VHR images, the general idea is using feature-based classification and deep learning methods. The feature-based classification methods (Felzenszwalb et al. Citation2010; Yu, Yang et al. Citation2016) use a list of standard features, e.g. color, texture and structure information, attribute pooling models, such as part-based models and bag of words to describe images. Then the features are put into some classifiers, such as support vector machine, decision tree and random forest, to get the final object labels of interest. To overcome the limitation of handcrafted or shallow learning-based features, deep learning algorithms, especially convolutional neural networks, are used to extract features for object detection (Cheng et al. Citation2016). However, regardless of traditional machine learning or deep learning methods, there should be a lot of labeled samples to train detection models. Besides, special models should be established to keep the scale invariant. For the object of interest extraction in large-scale UAV images, various land cover categories and complex semantic information make the model training methods inefficient and inapplicable. The general practice is to select one or several certain samples and then search the similar objects all over the multi-scale representation of images, such as Ecognition.

1.3. Our contributions

Inspired by the excellent work in the afore-mentioned aspects and our preliminary work in UAV image analysis (Yu, Yan et al. Citation2016), an object-based multi-scale hierarchical structure, the BPT, is utilized in this paper to represent large-scale UAV images to address the two important topics, i.e. optimal segmentation and extraction of the object of interest. The BPT structure derives from the definition of an initial partition obtained by an oversegmentation algorithm, i.e. simple linear iterative clustering (SLIC). During the BPT construction, we fully consider the spectral and spatial information of the superpixels and their topological relationships. Besides, an optimal segmentation is achieved by dynamic programming, which is different from our former work using the uniform homogeneity slice. Moreover, objects of interest are extracted using structural nearest neighbor search method. Figure shows a brief schematic diagram for analysis of large-scale UAV images.

Figure 1. A brief schematic diagram for analysis of large-scale UAV images. Source: Li, Tang et al. (Citation2015).

Figure 1. A brief schematic diagram for analysis of large-scale UAV images. Source: Li, Tang et al. (Citation2015).

The current paper differs from our conference paper (Yu, Yan et al. Citation2016) in the following three aspects: (1) the optimal segmentation strategy is modified using dynamic programming to find the globally optimal solution; (2) the objects of interest extraction using structural nearest neighbor search based on BPT is proposed; and (3) the extensive new experiments using the modified optimal segmentation and objects of interest extraction strategies are conducted and re-evaluated.

The remainder of this paper is organized as follows. Section 2 presents the object-based BPT construction of large-scale UAV images. Section 3 introduces the optimal segmentation algorithm and the objects of interest extraction method in detail. Some representative experimental results are exhibited in Section 4. Finally, we draw the conclusion of this work in Section 5.

2. Hierarchical image representation

The big data volume and very high-resolution properties make the interpretation of UAV panorama a great challenge. The object-based image analysis (OBIA) methods not only preserve the useful information in UAV images but also decrease the data volume to analyze. In addition, a hierarchical representation based on OBIA is an essential step for VHR image interpretation. This section mainly introduces the BPT representation of UAV panorama image.

2.1. Superpixels partition

Superpixel segmentation algorithms can be roughly divided into graph-based and gradient-based methods. Considering the computation speed and partition performance, we utilize the SLIC algorithm (Achanta et al. Citation2012) to obtain the initial superpixels. This method can produce consistent superpixels with similar sizes and shapes, and keep image boundary information at the same time. Furthermore, the segmentation results of large-scale UAV images are convenient for subsequent processing.

In this step, the large-scale VHR image with complex boundaries is segmented into many superpixels. Each superpixel is relatively homogeneous, thus it is unnecessary to consider the information details of the internal superpixels. In addition, the superpixel-based description can speed up the subsequent processing procedure and preserve the useful information. According to the principle of SLIC, the region size and regularity of superpixels can be set experientially.

2.2. Region model and similarity criterion

After image partition, the description of superpixels is an important task, which directly relates to the measurement of similarity between superpixels. Because the major difference of superpixels lies in color, we leave out size and shape information. The region model is characterized by color names (van de Weijer et al. Citation2009), which are linguistic color labels based on the assignment of colors in the real world. The color labels including 11 basic terms: black, blue, brown, gray, green, orange, pink, purple, red, white, and yellow, are learned from Google images. The learning result is a partition of the color space into 11 regions. To use this color feature, the RGB area of superpixel is mapped to the color attribute space. The color names of superpixel are defined as follows:(1) (2)

where cni(i = 1, …, 11) is the ith color name, N denotes the number of pixels in region R, f(x) = {Lab} is the three channels’ value of pixel x in Lab space, and p(cni|f(x)) denotes the probability of a color name of pixel x. color names are more photometric invariant than other color features because the different shades of color are mapped to the same color names.

The two important concepts in OBIA are region model and similarity measurement. The region model is characterized by color names. In other words, the region model of superpixels is equal to the color names of itself. Furthermore, the union of two superpixels is modeled by the average of their color names. While the similarity of region R1 and R2 is measured by the weighted Euclidean distance (WED)(3)

where denote the region models for R1 and R2, respectively; denote the data volumes of R1 and R2, respectively.

2.3. BPT construction

Based on superpixels segmentation, the bottom level of hierarchical structure is composed by the original superpixels. BPT structure is constructed by merging superpixels. Every node and every level of the hierarchical structure BPT contain semantic information. The leaves represent the original superpixels and the root represents the entire image. We can reconstruct the tree on the condition that the parent, siblings, and sons of every node are available. To enhance the efficiency of BPT construction, a priority queue is established using all pairs of neighboring regions. If a new pair of regions enters the queue, the position or the order will be determined by the WED of the two regions. The top of the queue, which is the most similar pair of neighboring regions, is popped out for merging. Note that one region has many neighborhoods. Therefore, if a region has been used to generate a new region, all pairs of regions that contain this region will no longer be used. The pseudo-code of the fast BPT construction process is depicted in Algorithm 1.

Algorithm 1. Fast CBPT construction

An example of BPT hierarchical representation of UAV image is shown in Figure .

Figure 2. The hierarchical representation of a UAV image in different levels. (a) A UAV image; (b) Several levels representation of the image.

Figure 2. The hierarchical representation of a UAV image in different levels. (a) A UAV image; (b) Several levels representation of the image.

3. Object-based analysis of the hierarchical representation

3.1. Optimal segmentation

The hierarchical structure, i.e. the BPT, represents the UAV panorama in multiple spatial scales. Based on this structure, the hierarchical segmentation algorithms can analyze images at different scales simultaneously, whose output is a set of regions that capture different partitions of different scales. The optimal segmentation based on the hierarchical structure can overcome the limitation of oversegmentation at fine levels and undersegmentation at coarser levels, which means that the partition of multi-scale meaningful regions can be exactly achieved. Under this conception, we design an optimal segmentation method based on the hierarchical structure.

3.1.1. Setting leaf-to-root path and objective function

Based on BPT construction, we denote the maximum hierarchical level by m, the node set of each level Ti is denoted by Vi, the entire tree is denoted by , and the individual node s at level i is denoted by . The only node at T1 is the root of T. Because not all original superpixels are at the bottom level, we copy the nodes at the upper level to the bottom level. Thus, each level of BPT corresponds to a partition of the image. During the process of BPT construction, once two nodes are merged, their parent node is at a new level.

A segmentation is a non-overlapped division of an image with the union restoring the image entirely. Thus, a partition in the hierarchy is a set of nodes satisfying the principle that there is one and only one node selected at each leaf-to-root path in the hierarchy. For example, Figure shows valid tree slices of a particular BPT. Each slice is highlighted by a black curve, and the nodes on the slice are darkened.

Figure 3. All valid tree slices of a particular BPT.

Figure 3. All valid tree slices of a particular BPT.

In the following, we formulate the above constraints. Let P denotes a p × n binary matrix, where p is the number of leaf nodes in T and n is the total number of nodes in T. Each row of P denotes a leaf-to-root path as shown in Figure (a). If a node is in the path, the value of the corresponding location at P takes 1, otherwise, takes 0. The corresponding path matrix of Figure is shown in Figure (b). There are three rows in P, which represent the three leaf-to-root paths in the BPT. For instance, node sequence (V1V2V5) is the path P2. Therefore, the value of the second row of P is .

Figure 4. BPT and the corresponding path matrix. (a) BPT. (b) Path matrix.

Figure 4. BPT and the corresponding path matrix. (a) BPT. (b) Path matrix.

Because a valid tree slice x consists of one and only one node in each path, the valid tree slice satisfies the following formula:

(4)

where lp is a p × 1 column vector, x is a 1 × n vector. If a node is selected as a partition, the value of the corresponding location in x is set as 1, otherwise set as 0. Thus any x satisfying Equation (Equation4) provides a possible partition of BPT, which corresponds to a plausible segmentation of the UAV panorama.

According to the constraint above, there are still many feasible tree slices which are proper segmentations of the image. However, our purpose is to find the optimal partition that is most meaningful. In this paper, we propose a meaningful criterion named minimal heterogeneity, which is defined as follows:

(5)

where R is a node in the BPT which consists of several adjacent superpixels, CNR is the region model, and CNi is the model of ith superpixel. Using this criterion, we can obtain the entire heterogeneity of region R.

Thus the segmentation objective is to seek a slice that balances the overall heterogeneity of selected nodes,

(6)

which subjects to Equation (Equation4). Here, T is the BPT node set of a UAV image.

3.1.2. Solving by dynamic programming

However, it is difficult to directly solve Equation (Equation6), which requires enumerating all tree slices, and the answer is a degenerate minimum which selects all leaf nodes because their heterogeneities are all zero. We add a penalty term which tends to select nodes at coarse levels

(7)

where is a constant value as data regularization term which encourages the optimization to find partitions with a reduced number of regions. Although nodes in the coarser levels have the relatively higher heterogeneity than nodes in the finer levels, the number of coarser level nodes is less than those at the finer levels. For the purpose of optimal segmentation, dynamic programming (Salembier and Foucher Citation2016), which is a greedy algorithm starting from the initial superpixels to extract the optimal partition by minimizing the criterion Equation (Equation6). In the experiment, we set according to the author’s recommendation.

Noticing that this strategy is different from our former work which used the uniform homogeneity slice (Yu, Yan et al. Citation2016), the reason can be summed up in two aspects. The first is the optimal solution. Quadratic programming problem may obtain local optima while dynamic programming can get global optima. The second is the time cost. Quadratic programming problem is very slow for the optimal segmentation of large-scale UAV images, whereas the dynamic programming is faster in this situation.

3.2. Object of interest extraction

The optimal segmentation gives a global inspection of large-scale UAV images by gathering the similar pixels into a region. However, without any annotation, we cannot obtain more details about semantic regions. The extraction of objects of interest deals with this problem with detecting specific objects in UAV image. In the multi-scale hierarchical structure, every single node is a partition of UAV images, and its semantics can be measured by specific features. Owing to the same region model in different levels of BPT, the similarity measurement of multi-scale objects can be unified to the same criteria. Providing one or several labeled samples, the extraction of objects of interest based on BPT can be turned into the search of similar nodes on BPT.

3.2.1. Searching strategy

We found that the nodes at low levels are included by the nodes at high levels. If a node is classified as a category, all the descendants are classified as the same category. The relationship of nodes is simply described in Figure . Therefore, the search process can be started from the root node in a top-down searching manner, and ended when we find the objects or there is no child node. If a node is classified as the same category with the giving sample, the search path from root node to this node is ended. Otherwise, we continue to compare the similarity between the children and the giving sample. From the aspect of merging operation, if the two child nodes of one node belong to the same category, the node must belong to the same category. Meanwhile, if one node is different from the parent node, the sibling of this node is also different from itself.

Figure 5. The same kind of node searching strategy.

Figure 5. The same kind of node searching strategy.

3.2.2. Spatial nearest neighbor definition

To measure the similarity between different nodes in BPT and the giving sample, we should define the region model and calculate the distance between nodes and sample. Using the region model defined in BPT construction, we calculate the Euclidean distance as follows:

(8)

where CNi is the color names of ith node in BPT, and CNO is the color names of giving object sample. The setting threshold Th is used to judge whether the ith node is similar to the sample. If the distance is smaller than the giving threshold, the ith node can be preliminary accepted as the same category with the giving sample. In addition, if a node belongs to the category while the parent node is not, the distance of parent node will be much bigger than the distance of current node. This constraint can be formulated as follows:

(9)

where DiP is the Euclidean distance between the parent node of ith node and the providing sample. The ratio DiP/Di is used to measure the dissimilarity between two kinds of nodes. The parameter is the threshold to judge whether we accept the preliminary node as the same category with the giving node.

The similarity between the ith node and the given sample can be measured by the above criteria. If the current node is recognized as the same category to the given sample, all the descendants of the ith node will also belong to this category, so the searching of this path (from root to the ith) can be stopped. However, if the current node does not belong to the same category to the giving sample, the searching process should go to the two child nodes. That is to say, the path searching ends when we find the nodes of the same category or there is no child node anymore.

4. Experimental results and discussion

In this section, we provide experimental results on two large-scale UAV images to demonstrate the effectiveness of the proposed analysis methods. The first image comes from the stitching of post-seismic UAV images captured by Canon 5D Mark II of the 2013 Ya’an earthquake (Li, Hui et al. Citation2015). The particular location is Yuxi village, Baosheng Town, Lushan County. Figure shows part of the large-scale UAV image of Ya’an. As a stitching of 93 image pieces, the data volume is 10,643 × 18,570 pixels. For convenience, we choose a part of the original panorama (3845 × 1038 pixels) to conduct the segmentation experiment.

Figure 6. The large-scale UAV images of the 2013 Ya’an earthquake. Source: Li, Tang et al. (Citation2015).

Figure 6. The large-scale UAV images of the 2013 Ya’an earthquake. Source: Li, Tang et al. (Citation2015).

Another data are the mosaic of UAV images from the area located in the South-west of Munich, Germany. The image sequences were acquired in 2015 using the air-borne DLR 3 K sensor system (Koch et al. Citation2016) with the resolution of approximately 6 cm. The original data size of the panorama is 12,240 × 13,540 pixels. We select a part of the images to conduct our experiments. The experimental platform consists of Intel Core i7 4790 CPU and 32GB RAM. We use Microsoft Visual Studio 2010, OpenCV 2.4.10 and MATLAB software. Figure shows the experimental data-set of Munich (4000 × 3000 pixels).

Figure 7. The experimental large-scale UAV image of Munich. Source: Koch et al. (Citation2016).

Figure 7. The experimental large-scale UAV image of Munich. Source: Koch et al. (Citation2016).

4.1. Optimal segmentation results

Observing the details of the data-sets, we can find the two experimental scenes are very complicated and consist of several different semantic areas. For example, buildings around roads show very similar spectral information, and shadows of buildings have a strong disturbance on the detection of the obscured regions.

Figure shows the optimal segmentation result of the large-scale UAV image in the 2013 Ya’an earthquake. The red lines are the segmentation boundaries. All pixels surrounded by a closed red line are gathered as a semantic object. To better describe the segmentation results, the intensity average (RGB channels, respectively) of all pixels in each object is mapped to each pixel inside. The corresponding result is shown in Figure . From an overall perspective, the edge information of objects is well preserved and the objects with high contrast are well segmented, such as roads and buildings. Moreover, the similar objects with different scales are segmented as individual regions. These qualitative analyses demonstrate that dynamic programming is applicable in optimal segmentation on hierarchical image representation. However, in some areas with poor photometric contrast, such as farmland and grassland, the segmented edges may go through some objects. The reason is that dynamic programming will pick up the areas with relatively different intensity. Therefore, the trees are distinguished from the grassland owing to the greener color.

Figure 8. The optimal segmentation result of UAV image in the 2013 Ya’an earthquake. Source: Li, Tang et al. (Citation2015).

Figure 8. The optimal segmentation result of UAV image in the 2013 Ya’an earthquake. Source: Li, Tang et al. (Citation2015).

Figure 9. The intensity average of objects of UAV image in the 2013 Ya’an earthquake. Source: Li, Tang et al. (Citation2015).

Figure 9. The intensity average of objects of UAV image in the 2013 Ya’an earthquake. Source: Li, Tang et al. (Citation2015).

Figure shows the optimal segmentation results of large-scale UAV image in Munich. Because the original image is very big, the red edges are not very clear in Figure (a), but it is obvious that there is no red line which goes through any object. To further show the effectiveness of our methods, we present the optimal segmentation result with the average intensity in Figure (b). The distinguished objects, such as buildings and roads, are well segmented. From the global perspective, we can still recognize the scene with the average intensity. Moreover, the segmented objects with different scales demonstrate the robustness of the BPT representation.

Figure 10. The optimal segmentation result of large-scale UAV image of Munich. (a) Segmentation result with red edges. (b) Segmentation result with intensity average. Source: Koch et al. (Citation2016).

Figure 10. The optimal segmentation result of large-scale UAV image of Munich. (a) Segmentation result with red edges. (b) Segmentation result with intensity average. Source: Koch et al. (Citation2016).

The experimental results show the effectiveness of the multi-scale hierarchical representation of large-scale UAV image. In addition, we give a brief time analysis to show the efficiency of our method. Table shows the time costs in BPT construction and dynamic programming. The number of superpixels represents the number of leaf nodes in BPT. The time index contains the time of BPT construction and dynamic programming. It costs only 0.650s for the UAV image in Ya’an with 4508 superpixels and 1.915 s for the UAV image in Munich with 13,390 superpixels. Moreover, we can also find that the time is approximately proportional to the number of superpixels. In addition, we compare this method with our former optimal segmentation method (Yu, Yan et al. Citation2016) using UES. The qualitative results of the optimal segmentation are almost the same. However, the quadratic programming in UES to find the optimal solution is very slow for large-scale UAV images. From Table , we can observe the superiority of dynamic programming.

Table 1. The efficiency analysis for the two large-scale UAV images.

In this study, dynamic programming uses a constant value as regularization. This parameter actually sets a hard threshold for the semantic gap between a node and its two descendants. However, the semantic gap is rising from fine level to coarse level, which means the value actually selects levels that satisfy the defined semantic gap.

For UAV image interpretation, the optimal segmentation scheme of UAV panorama image can be used in subsequent tasks, such as image classification, object detection and recognition, change detection. Figure shows the segmented area of grassy land and buildings. The building areas in Figure are finely distinguished, which can be used for extracting residential areas.

Figure 11. The distinguished area of grassy land and buildings in the segmentation results. (a) The distinguished area one in Munich. (Source: Li, Tang et al. Citation2015) (b) The distinguished area two in Ya’an. (Source: Koch et al. Citation2016)

Figure 11. The distinguished area of grassy land and buildings in the segmentation results. (a) The distinguished area one in Munich. (Source: Li, Tang et al. Citation2015) (b) The distinguished area two in Ya’an. (Source: Koch et al. Citation2016)

4.2. Extraction results of objects of interest

The optimal segmentation results show a global inspection of different kinds of objects with distinct boundaries. However, without any annotation of regions, the outcome objects have no semantic labels. Thus, we cannot distinguish different objects. In this section, we show the experimental results of specific object extraction. Based on the hierarchical structure, we first manually select a patch and assign it with a semantic label, then find the similar ones in the BPT. Since the multi-scale property has been considered in the BPT construction, the similar regions can be extracted regardless of what size the patch is. Using the above-mentioned Ya’an data-set, we select a road patch with the size of 30 × 30 pixels, the extraction results are shown in Figures and .

Figure 12. The road extraction result with red edges in Ya’an UAV image. Source: Li, Tang et al. (Citation2015).

Figure 12. The road extraction result with red edges in Ya’an UAV image. Source: Li, Tang et al. (Citation2015).

Figure 13. The road extraction result with intensity average in Ya’an UAV image. Source: Li, Tang et al. (Citation2015).

Figure 13. The road extraction result with intensity average in Ya’an UAV image. Source: Li, Tang et al. (Citation2015).

From the road extraction results, we can observe the good performance of BPT representation. Note that there is only one labeled sample with the size of 30 × 30 pixels in this experiment. We also conduct the object of interest extraction experiment on the UAV image in Munich. From the road extraction results (Figure ), we can see that the road structure is well distinguished. Although some buildings are wrongly detected as roads, these small mistakes have very little effects on the outline of roads. In addition, these false positives can be easily eliminated with some prior knowledge, such as line structure and continuity.

Figure 14. Road extraction results of UAV images in Munich. (a) Extraction result with red edge. (b) Extraction result with intensity average. Source: Koch et al. (Citation2016).

Figure 14. Road extraction results of UAV images in Munich. (a) Extraction result with red edge. (b) Extraction result with intensity average. Source: Koch et al. (Citation2016).

During the objects of interest extraction, there are two important parameters need to be carefully fine-tuned. The first one is the similarity threshold Th between an object and the giving sample. This threshold defines the semantic gap between two objects, which varies with the definition of region model. In our method, we just use color names as the region model, so it is with little variation, ranging from 0.005 to 0.010. The second parameter is the threshold of spatial distance gap, which further removes the outliers. As it defines the distance ratio of the parent of a node and itself, just as the nearest neighbor search, it varies from 1 to 2. In our experiment, we set as a constant value.

5. Conclusions

In this paper, a multi-scale hierarchical structure has been proposed to represent large-scale UAV images. The object-based BPT representation deals with the high resolution and large-scale property of UAV images by addressing two practical applications, i.e. optimal segmentation and object of interest extraction. This framework provides substantial possibilities for large-scale VHR image classification, object detection, change detection, etc. The experimental results demonstrate that BPT representation based on superpixels is an effective hierarchical structure for processing and analyzing large-scale UAV images. Future work will focus on refining the results of object extraction by considering object’s prior knowledge and contextual information.

Funding

This work was supported in part by the National Key Basic Research and Development Program of China [grant number 2013CB733404] and the National Natural Science Foundation of China [grant number 61271401], [grant number 91338113].

Notes on contributors

Huai Yu received BSc degree in Electronic Information Engineering from Wuhan University, Wuhan, China, in 2015 and is currently pursuing PhD degree at the School of Electronic Information, Wuhan University. His research interests include high-resolution image classification and change detection, vision-based SLAM, and obstacle avoidance for unmanned aerial vehicle.

Jinwang Wang received BSc degree in Communication Engineering from Lanzhou University, Lanzhou, China, in 2016 and is currently pursuing MSc degree at the School of Electronic Information, Wuhan University. His research interests include superpixel segmentation for unmanned aerial vehicle aerial image, vision-based tracking algorithm.

Yu Bai received BSc degree in Communication Engineering from South central University for Nationalities, Wuhan, China, in 2013 and is currently pursuing PhD degree at the School of Electronic Information, Wuhan University. His research interests include forest vertical structure parameters estimation, and PolSAR/PolInSAR data processing and interpretation.

Wen Yang received BSc degree in Electronic Apparatus and Surveying Technology, the MSc degree in Computer Application Technology, and PhD degree in Communication and Information System, all from Wuhan University, Wuhan, China, in 1998, 2001, and 2004, respectively. From September 2008 to September 2009, he worked as a visiting scholar with the Apprentissage et Interfaces team of the Laboratoire Jean Kuntzmann in Grenoble, France. From November 2010 to October 2013, he worked as a postdoctoral researcher with the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University. He is currently a full professor with the School of Electronic Information, Wuhan University. His research interests include object detection and recognition, image retrieval and semantic segmentation, multi-sensor information fusion, and change detection.

Gui-Song Xia received BSc degree in Electronic Engineering and the MSc degree in Signal Processing from Wuhan University, Wuhan, China, in 2005 and 2007, respectively and PhD degree in Image Processing and Computer Vision from the CNRS LTCI, TELECOM ParisTech, Paris, France, in 2011. Since March 2011, he has been a postdoctoral researcher with the Centre de Recherche en Mathmatiques de la Decision (CEREMADE), CNRS, Paris-Dauphine University, Paris, for one year and a half. He is currently a full professor with the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University. His research interests include mathematical image and texture modeling, content-based image retrieval, structures from motions, perceptual grouping, and remote sensing image understanding.

References

  • Achanta, R., A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk. 2012. “SLIC Superpixels Compared to State-of-the-Art Superpixel Methods.” IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (11): 2274–2282.10.1109/TPAMI.2012.120
  • Akcay, H. G., and S. Aksoy. 2008. “Automatic Detection of Geospatial Objects Using Multiple Hierarchical Segmentations.” IEEE Transactions on Geoscience and Remote Sensing 46 (7): 2097–2111.10.1109/TGRS.2008.916644
  • Bai, Y., W. Yang, G. S. Xia, and M. Liao. 2015. “A Novel Polarimetric-Texture-structure Descriptor for High-resolution PolSAR Image Classification.” Paper presented at Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Milan, Italy, July 26–31: 1136–1139.
  • Baraldi, A., and L. Bruzzone. 2004. “Classification of High Spatial Resolution Images by Means of a Gabor Wavelet Decomposition and a Support Vector Machine.” Paper presented at Proceedings of the International Society for Optics and Photonics, Remote Sensing, Maspalomas, Canary Islands, Spain, September 13: 19–29.
  • Binaghi, E., I. Gallo, and M. Pepe. 2003. “A Cognitive Pyramid for Contextual Classification of Remote Sensing Images.” IEEE Transactions on Geoscience and Remote Sensing 41 (12): 2906–2922.10.1109/TGRS.2003.815409
  • Breiman, L. I., J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. “Classification and Regression Trees (CART).” Biometrics 40 (3): 358.
  • Cheng, G., C. Ma, P. Zhou, X. Yao, and J. Han. 2016. “Scene Classification of High Resolution Remote Sensing Images using Convolutional Neural Networks.” Paper presented at Proceedings of the IEEE International Geoscience and Remote Sensing Symposium, Beijing, China, July 10–15: 767–770.
  • Felzenszwalb, P. F., R. B. Girshick, D. Mcallester, and D. Ramanan. 2010. “Object Detection with Discriminatively Trained Part-based Models.” IEEE Transactions on Software Engineering 32 (9): 1627–1645.
  • Kiran, B. R., and J. Serra. 2013. “Ground Truth Energies for Hierarchies of Segmentations.” In Mathematical Morphology and its Applications to Signal and Image Processing, edited by C. L. Luengo Hendriks, G. Borgefors, and R. Strand, 123–134. Heidelberg, Berlin: Springer.10.1007/978-3-642-38294-9
  • Koch, T., P. D’Angelo, F. Kurz, F. Fraundorfer, P. Reinartz, and M. Körner. 2016. “The TUM-DLR Multimodal Earth Observation Evaluation Benchmark.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop on Geo-Spatial Computer Vision, Las Vegas, USA, June 26–July 1: 19–26.
  • Li, X., N. Hui, H. Shen, Y. Fu, and L. Zhang. 2015. “A Robust Mosaicking Procedure for High Spatial Resolution Remote Sensing Images.” ISPRS Journal of Photogrammetry and Remote Sensing 109: 108–125.10.1016/j.isprsjprs.2015.09.009
  • Li, S., H. Tang, S. He, Y. Shu, T. Mao, J. Li, and Z. Xu. 2015. “Unsupervised Detection of Earthquake-Triggered Roof-Holes From UAV Images Using Joint Color and Shape Features.” IEEE Geoscience and Remote Sensing Letters 12: 1823–1827.
  • Luo, B., J. F. Aujol, and Y. Gousseau. 2009. “Local Scale Measure from the Topographic Map and Application to Remote Sensing Images.” Siam Journal on Multiscale Modeling and Simulation 8 (1): 1–29.
  • Monasse, P., and F. Guichard. 2000. “Fast Computation of a Contrast-invariant Image Representation.” IEEE Transactions on Image Processing 9 (5): 860–872.10.1109/83.841532
  • Moranduzzo, T., and F. Melgani. 2014. “Detecting Cars in UAV Images with a Catalog-based Approach.” IEEE Transactions on Geoscience and Remote Sensing 52 (10): 6356–6367.10.1109/TGRS.2013.2296351
  • Salembier, P. 2015. “Study of Binary Partition Tree Pruning Techniques for Polarimetric SAR Images.” In Mathematical Morphology and Its Applications to Signal and Image Processing, edited by J. A. Benediktsson, J. Chanussot, L. Najman, and H. Talbot. 51–62. Heidelberg, Berlin: Springer.10.1007/978-3-319-18720-4
  • Salembier, P., and S. Foucher. 2016. “Optimum Graph Cuts for Pruning Binary Partition Trees of Polarimetric SAR Images.” IEEE Transactions on Geoscience and Remote Sensing 54 (9): 5493–5502.10.1109/TGRS.2016.2566581
  • Salembier, P., and L. Garrido. 2000. “Binary Partition Tree as an Efficient Representation for Image Processing, Segmentation, and Information Retrieval.” IEEE Transactions on Image Processing 9 (4): 561–576.10.1109/83.841934
  • Schönberger, J. L., F. Fraundorfer, and J. M. Frahm. 2014. “Structure-from-Motion for MAV Image Sequence Analysis with Photogrammetric Applications.” ISPRS – International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XL-3 (3): 305–312.10.5194/isprsarchives-XL-3-305-2014
  • Turner, D., A. Lucieer, and W. Christopher. 2012. “An Automated Technique for Generating Georectified Mosaics from Ultra-High Resolution Unmanned Aerial Vehicle (UAV) Imagery, Based on Structure from Motion (SfM) Point Clouds.” Remote Sensing 4 (12): 1392–1410.10.3390/rs4051392
  • van de Weijer, J. V. D., C. Schmid, J. Verbeek, and D. Larlus. 2009. “Learning Color Names for Real-world Applications.” IEEE Transactions on Image Processing 18 (7): 1512–1523.10.1109/TIP.2009.2019809
  • Xia, G. S., J. Delon, and Y. Gousseau. 2010. “Shape-based Invariant Texture Indexing.” International Journal of Computer Vision 88 (3): 382–403.10.1007/s11263-009-0312-3
  • Xu, C., S. Whitt, and J. Corso. 2013. “Flattening Supervoxel Hierarchies by the Uniform Entropy Slice.” Paper presented at Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, December 1–8, 2240–2247.
  • Xu, Y., J. Ou, H. He, X. Zhang, and J. Mills. 2016. “Mosaicking of Unmanned Aerial Vehicle Imagery in the Absence of Camera Poses.” Remote Sensing 8 (3): 204.10.3390/rs8030204
  • Yang, B., and C. Chen. 2015. “Automatic Registration of UAV-Borne Sequent Images and LiDAR Data.” ISPRS Journal of Photogrammetry and Remote Sensing 101: 262–274.10.1016/j.isprsjprs.2014.12.025
  • Yang, Y., and S. Newsam. 2011. “Spatial Pyramid Co-occurrence for Image Classification.” Paper presented at Proceedings of the 13th International Conference on Computer Vision (ICCV), Barcelona, Spain, November 6–13: 1465–1472.
  • Yu, H., T. Yan, W. Yang, and H. Zheng. 2016. “An Integrative Object-based Image Analysis Workflow for UVA Images.” ISPRS − International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLI-B1: 1085–1091.
  • Yu, H., and W. Yang. 2016. “A Fast Feature Extraction and Matching Algorithm for Unmanned Aerial Vehicle Images.” Journal of Electronics and Information Technology 38 (3): 509–516.