1,565
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Pattern recognition of soldier uniforms with dilated convolutions and a modified encoder-decoder neural network architecture

ORCID Icon & ORCID Icon

ABSTRACT

In this paper, we study a deep learning (DL)-based multimodal technology for military, surveillance, and defense applications based on a pixel-by-pixel classification of soldier’s image dataset. We explore the acquisition of images from a remote tactical-robot to a ground station, where the detection and tracking of soldiers can help the operator to take actions or automate the tactical-robot in battlefield. The soldier detection is achieved by training a convolutional neural network to learn the patterns of the soldier’s uniforms. Our CNN learns from the initial dataset and from the actions taken by the operator, as opposed to the old-fashioned and hard-coded image processing algorithms. Our system attains an accuracy of over 81% in distinguishing the specific soldier uniform and the background. These experimental results prove our hypothesis that dilated convolutions can increase the segmentation performance when compared with patch-based, and fully connected networks.

Introduction

Countless efforts from military and civil defense agencies in the last decades have focused on detecting a known target in a video image Haralick (Citation1979); Haralick, Shanmugam, and Dinstein (Citation1973). The result of this hard work has always been based on image/video processing techniques. However, with the revolution of artificial intelligence in the last couple of years, the classical processing techniques are becoming obsolete Geron (Citation2017); Goodfellow, Bengio, and Courville (Citation2016); Mitchell (Citation1997), in part for the fixed characteristic of the coding, and the almost inexistent versatility and reusability of the code from one application to another. Algorithms for basic segmentation such as TextonForests Shotton, Johnson, and Cipolla (Citation2008) and Random Forest Shotton et al. (Citation2011) are limited by its low performance. Patch classification, where every pixel is classified individually with patches, is limited by the requirement of fixed size images Ciresan, Alessandro Giusti, and Schmidhuber (Citation2012); Shelhamer, Long, and Darrell (Citation2017). Models based on convolutional neural networks (CNNs) have increased the segmentation performance on popular segmentation datasets such as MSCOCO Lin et al. (Citation2014) and PASCAL VOC 2012 Everingham et al. (Citation2012). Fully connected layer CNN architectures allow generating segmentation from images of any size. Pre-trained CNNs allow to reuse the learned features for new tasks, enabling researchers to develop models faster, with less training data Pan and Yang (Citation2010); Morocho-Cayamcela, Eugenio, and Kwon (Citation2017); Shifat and Jang-Wook (Citation2020).

Even though some pattern recognition techniques have been recently exploited in the classification area, there is no record of using artificial intelligence (AI) techniques to detect the uniformity of soldiers accurately using an image semantic segmentation network. To solve this problem, we propose a segmentation network using two CNNs that reassemble a semantic pixel classifier. This technique has been proven to generalize to any scenario if the training data are well-defined Maggiori et al. (Citation2017a); Badrinarayanan, Kendall, and Cipolla (Citation2015); Morocho-Cayamcela and Lim (Citation2020); Morocho-Cayamcela, Eugenio, and Lim (Citation2020a). We test our segmentation network along with different segmentation techniques from the literature and prove that our design outperforms them for the classes of soldier and background. Our system attains an accuracy of over 81% in distinguishing the specific soldier uniform and the background from the image.

Model architecture

We use an encoder-decoder structure to exploit the multi-scale features in the dataset and perform feature-dense extraction Maggiori et al. (Citation2017b); Badrinarayanan, Kendall, and Cipolla (Citation2015). This is where the encoder–decoder architecture excels at, as it compresses the input to represent all of the information. Our encoder-decoder segmentation network architecture is shown in . The encoder stage uses a pre-trained CNN to downscale the images of the soldiers into a feature vector containing a dense pixel-location information. The decoder is employed to expand the compressed feature vector back to a categorical matrix with the original input size Morocho-Cayamcela, Eugenio, and Lim (Citation2020a).

Figure 1. Our model architecture employs an encoder-decoder structure. The encoder applies dilated convolution at different scales to encode multi-scale contextual information. The decoder refines the segmentation along boundaries. Morocho-Cayamcela, Eugenio, and Lim (Citation2020a). © 2020 IEEE

Figure 1. Our model architecture employs an encoder-decoder structure. The encoder applies dilated convolution at different scales to encode multi-scale contextual information. The decoder refines the segmentation along boundaries. Morocho-Cayamcela, Eugenio, and Lim (Citation2020a). © 2020 IEEE

The backbone of the encoder is based on the ResNet-101 architecture He et al. (Citation2016), which is a pre-trained CNN with 101-layers trained on the ImageNet dataset Deng et al. (Citation2009), built with five convolutional (Conv) modules, where each one of the modules possesses the same number of convolutional layers as the original ResNet-101. The first four convolutional blocks of ResNet-101 are reused, and the last block is adapted with parallel copies to apply dilated spatial pyramid pooling at different scales. Our model then concatenates the extracted features to send the data to the decoder stage. The dilated convolutions guarantee the robustness of our architecture to environment size changes caused by the multi-scale contextual information encoding. The dilated convolution function is represented as

(1) y[i]=k=1Kx[i+rk]w[k](1)

for each position i, on the output y, and filter w. The operation of convolution is dilated over the input map of features x, where the dilation rate r indicates the step at which the input is sampled. The value of r controls the field of view of the convolution. This method can be seen as an analogy to use the convolution function on the input x with up-sampled filters with r1 zeros added between two values of the sequential filter. Note that the standard convolution is a special case of dilated convolution, with a value of r=1. The filter’s field-of-view is regulated by adjusting the value of r. As the sampling rate r increases, the number of weights applied to the effective feature area decreases. illustrates the dilated spatial pyramid pooling (DSPP) process of our system with four parallel functions (1×1 convolution, and 3×3 dilated convolution with r values of 6, 12, and 18).

Figure 2. A systematical dilation creates an exponential receptive field growth without losing resolution. The figure presents the dilated convolutions in the proposed architecture with a 3×3 kernel and rates r of 6, 12, and 18. Morocho-Cayamcela, Eugenio, and Lim (Citation2020a) © 2020 IEEE

Figure 2. A systematical dilation creates an exponential receptive field growth without losing resolution. The figure presents the dilated convolutions in the proposed architecture with a 3×3 kernel and rates r of 6, 12, and 18. Morocho-Cayamcela, Eugenio, and Lim (Citation2020a) © 2020 IEEE

The features that were generated in the last step are then concatenated and sent to an additional convolution and batch normalization before the last 1×1 convolution. The decoder estimates the feature responses by adding low-level features from the encoder. A four-factor fast bilinear interpolation is implemented before generating the final categorical matrix.

Materials and methods

To build the network, we first created a customized ground truth database (with classes “soldier,” and “background”). Our system is trained with these ground truth examples x along with their label y, such that the CNN model can learn to classify new examples. The initial ground-truth database was then used to generate an image data store and a pixel label data store. From the dataset statistics, 23% of the images contained the class “soldier,” and 77% of the remaining pixels contained the class “background.” Ideally, all classes would have the same number of observations. We solved this class weighting issue by normalizing the input data. A random split of 60% of the images for the training stage and 40% for the testing/validating is employed for the analysis. shows a subset of soldier images used as ground truth in our segmentation network.

Figure 3. A small subset of labeled (ground truth) images from our database used to train the artificial intelligence-based image semantic segmentation network

Figure 3. A small subset of labeled (ground truth) images from our database used to train the artificial intelligence-based image semantic segmentation network

The segmentation network was built using VGG-16 He, Shaoqung, and Jian (Citation2018), a pre-trained CNN in order to transfer the initially learned weights to our segmentation network. Data augmentation techniques such as random translation and reflection were added to make the network robust to variability in the input data. illustrates the proposed system.

The model is trained by measuring how much each pixel belongs to a particular ground truth pixel in each iteration. To measure the performance of our model, we employed the difference between the probability distribution of the ground truth and the output using pixel-wise cross-entropy. If the predicted probability is different from the ground truth, the loss will increase. Our selected loss function is based on parameters, and the objective of our model is to find these parameter values that minimize the cost function. The training set has the values of (x(i),y(i)) for i=1,,m. We find the weights θ={θ(1),θ(2),θ(3),,θ(n)} that minimize J(θ) (cost function) as follows:

(2) J(θ)=1i=1mk=1Kyk(i)log(pˆk(i))(2)

where the value yk(i) is 1 if the target for the ith training example is k; otherwise, it is 0. The gradient vector of this loss function is represented with respect to θ(k) as

(3) θ(k)J(θ)=1i=1m(pˆk(i)yk(i))x(i)(3)

where x(i) contain the feature values of the ith image, and yk(i) is the desired output for the ith image in class k. Our model uses a partial differential iterative process to minimize the parameters in J(θ) Cauchy (Citation1976). To avoid the vanishing gradient problem, θ is initialized using Xavier’s technique Glorot and Bengio (Citation2010). The decomposition of the cost function as a sum over the example images can be represented as the negative conditional log-likelihood as

(4) J(θ)=1i=1mL(x(i),y(i),θ)(4)

with L as the loss per-example L(x,y,θ)=logp(y|x;θ). For these additive loss functions, we estimate

(5) θJ(θ)=1i=1mθL(x(i),y(i),θ)(5)

An extensive computational memory is required to compute ((5)). To balance the system memory usage, our model samples a minibatch of B={x(1),,x(m )} example images before each iteration. In addition, our model set m  to a multiple of m to optimize computation memory Robbins and Monro (Citation1951). Using soldier images from B, the algorithm optimizes the gradient descent as

(6) g=1m θi=1m L(x(i),y(i),θ)(6)
(7) θθεg(7)

with the value of ε as the learning rate.

The oscillations in (6) and (7) can cause the algorithm to not converge or diverge. To avoid these oscillations, the proposed model estimates the exponentially weighted average of past gradients and employs them to update θ. The algorithm uses a learning rate ε, an initial velocity v, and an initial set of parameters θ. The gradient is estimated using (8) for every epoch, v is computed with (9), and θ is updated using (10), as follows

(8) g1θiL(f(x(i);θ),y(i))(8)
(9) vαvεg(9)
(10) θθ+v(10)

with 0α1 as the previous step contribution. Finally, our model uses maxout regularization, and dropout by overwriting random features to zero to prevent the overfitting problem. Algorithm 4 illustrates a high-level learning process for our image segmentation network.

Algorithm 1 Parameter Learning and Optimization

Input: m, K, x, y, learning rate ε, momentum parameter α.

Output: Optimal hyperparameter values θ for segmentation.

Initialization:

1: Initialize v to zero.

2: Initialize θXavier’s initialization.

Data acquisition

3: Get soldier images from online server.

LOOP Data pre-processing

4: for each soldier image do

5: Resize images720×720 pixels.

6: Image augmentationRandom rotation and translation.

7: end for

8: Compute class weighting using the inverse frequency.

Define the cross-entropy cost function.

9: J(θ)=1i=1mk=1Kyk(i)log(pˆk(i))

10: θ(k)J(θ)=1i=1m(pˆk(i)yk(i))x(i)

Calculate the steepest descent with PDEs.

11: while stopping criterion not met do

12: Sample a minibatch B of m  samples from the training

set {x(1),,x(m)} with corresponding targets.

13: Compute the gradient estimate:

g1m θiL(f(x(i);θ),y(i))
.

14: Compute the velocity update: vαvεg.

15: Apply update: θθ+v

16: end while

17: return θ

Experimental results and simulation

After 200 epochs, the segmentation accuracy of the proposed model is compared against state-of-the-art segmentation techniques from related works. shows the accuracy of the two classes under study for the image segmentation models under study. We prove that using transfer learning and combining two CNNs in an encoder-decoder architecture, and employing stochastic gradient descent with momentum as the parameter optimizer, the accuracy of the segmentation attains 81.49% and 82.64% for the soldier and background classes. shows a subset of segmented images used our proposal. The images in the left show the segmented pixels overlapped with the image from the test set, and the images from the left show the semantic segmentation network pixel labeling overlapped with the ground truth. The green and magenta regions represent the regions where the segmentation results diverge from the expected ground truth. The visual metrics confirm the numerical results.

Table 1. Segmentation accuracy obtained with different models*

Figure 4. Results from the test set of images. The images from the left show the labeled pixels overlapped with the original image. The images from the left show the labeled pixels overlapped with the ground truth. The green and magenta regions highlight areas where the segmentation results differ from the expected ground truth

Figure 4. Results from the test set of images. The images from the left show the labeled pixels overlapped with the original image. The images from the left show the labeled pixels overlapped with the ground truth. The green and magenta regions highlight areas where the segmentation results differ from the expected ground truth

Figure 5. Proposed system architecture. The image dataset and training of the deep learning algorithm are shown on the left side. On the right side, the ground control station takes an action Ai on the environment. The tactical-robot receives a new state Si+1 and a reward Ri+1 (can be positive or negative) based on some policy, and the goal is to find a policy that maximizes the cumulative reward over a finite number of iterations. The green and magenta regions in the resulting images highlight the areas where the segmentation results differ from the expected ground truth

Figure 5. Proposed system architecture. The image dataset and training of the deep learning algorithm are shown on the left side. On the right side, the ground control station takes an action Ai on the environment. The tactical-robot receives a new state Si+1 and a reward Ri+1 (can be positive or negative) based on some policy, and the goal is to find a policy that maximizes the cumulative reward over a finite number of iterations. The green and magenta regions in the resulting images highlight the areas where the segmentation results differ from the expected ground truth

Convolutional encoder-decoder architecture, optimized with stochastic gradient descent with momentum.

Conclusions

The results obtained from our proposed segmentation network are very promising, with an accuracy over the 80%. The segmentation of the image is by far the most difficult part of a tracking system, with the blobs generated from the proposed segmentation network we can easily find the center and feedback the information to the camera moving system to change the position and center the target. The proposed segmentation network can also be re-trained over a different dataset to detect other targets, like enemy artillery, or their own soldiers for rescue purposes. The benefit found in the use of our proposed technique is that the CNN can find patterns that most image processing algorithms cannot and are impossible to recognize by the human eye. This technology helps de-camouflaging the targets through exploratory image analysis. The methodology presented in this paper is not intended to replace any triggering mechanism of the tactical robot, but to help the operators to take better decisions in the battlefield.

Disclosure Statement

The authors declare that they have no competing interests.

Additional information

Funding

This work was supported by the Ministry of SMEs and Start-ups, S. Korea (S2829065, S3010704), and by the National Research Foundation of Korea (2020R1A4A101777511);Ministry of SMEs and Startups [S2829065];Ministry of SMEs and Startups [S3010704];National Research Foundation of Korea [2020R1A4A101777511];

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.