1,215
Views
5
CrossRef citations to date
0
Altmetric
Research Article

Dynamic activation and enhanced image contour features for object detection

ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon, ORCID Icon &
Article: 2155614 | Received 11 Oct 2022, Accepted 01 Dec 2022, Published online: 19 Dec 2022

Abstract

Object detection technology is a popular research direction which is widely used in areas such as autonomous driving and medical diagnosis. At this stage mobile devices often have limited storage resources to deploy large object detection networks and need to meet real-time requirements. This paper proposes a lightweight and efficient object detection model based on YOLOv4, first using the lightweight network GhostNet to extract image features and reduce the number of parameters and computation of the backbone structure; then combining AFmodule and Meta-ACON activation function to enhance the feature extraction capability of the backbone network, which strengthen the mode’s ability to capture image spatial feature information; this paper also designs the RL-PAFPN feature fusion structure is with the Reslayer module to further improve the model’s ability to extract and fuse image features. By comparing other mainstream object detection models, the YOLOv4-Ghost-AMR network in this paper has less computation and fewer parameters, and the accuracy of the model reaches 86.83%, which is suitable for deployment in mobile devices with limited storage. The model proposed in this paper can be applied to medical, traffic and fault detection fields, which changes the traditional manual detection method and saves manpower and time costs, achieving high precision real-time object detection.

1. Introduction

With the development of deep learning technology, object detection based on convolutional neural networks is widely used in the fields of autonomous driving, medical diagnosis, and UAVs, with the aim of achieving precise target location and classification. The current deep learning-based object detection algorithms are mainly divided into two categories, which are regression-based one-stage object detection algorithms and two-stage object detection algorithms based on the Region Proposal strategy (Hu & Zhai, Citation2019). Classical one-stage object detection algorithms include: SSD, YOLO, YOLOv2, YOLOv3, YOLOv4, YOLOv5, YOLOv6, YOLOv7 (Liu et al., Citation2016; Redmon et al., Citation2016; Redmon & Farhadi, Citation2017, Citation2018; Bochkovskiy et al., Citation2020; Jocher et al., Citation2020; Li et al., Citation2022; Wang, Bochkovskiy, et al., Citation2022), which are trained in an end-to-end manner to generate a number of candidate frames directly in the whole image, and then regression prediction of category, frame size and position is performed on these candidate frames to generate predicted frames. Therefore, the one-stage object detection algorithm is fast and can meet the real-time task requirements. Commonly used two-stage object detection algorithms include: Fast R-CNN, Mask R-CNN, SPP-Net (Gavrilescu et al., Citation2018; He et al., Citation2015, Citation2017), etc., which first use selection search and other methods to generate candidate regions, then extract different levels of feature maps in the candidate frames by neural networks, and finally classify and localise the targets. These methods are usually more accurate than the one-stage object detection algorithms, but the detection speed is slower, which are not suitable for deployment in mobile devices that require high real-time performance.

At present, mobile devices such as mobile phones, UAVs and driverless cars often have limited storage resources, which cannot deploy large object detection networks (Yu & Zhang, Citation2021). To enable the model to be deployed on mobile devices and perform real-time object detection tasks, this paper proposes a lightweight and efficient object detection model YOLOv4-Ghost-AMR. The innovation points are as follows:

  1. Lightweight improvement of the backbone network. Firstly used the lightweight network GhostNet as the backbone network, because of the GhostNet convolutional operation in the original backbone network could be converted into two steps. Then it is easy to use half of the feature map for traditional convolution, in order to perform a cheap operation on the remaining half of the feature map. In addition, the proposed model could reduce the number of parameters and computational effort of the backbone network by combining deep separable convolution.

  2. Design the AFmodule structure to expand the number of channels of the input feature map by 4 times, which improves the detection speed of the model. The section3 introduced the AFmodule based on the GhostNet network to enable the backbone network to extract richer image edge features and deeper features than YOLOv4.

  3. Combine Meta-ACON activation functions in the backbone network, which allows the network to autonomously learn different parameters and fit different activation functions to achieve dynamic activation during the training process.

  4. Design the Reslayer module to build a deep feature fusion structure by residual convolution, which can avoid gradient disappearance during network training and improve the feature fusion effect of the model while protecting the integrity of image feature information.

This paper proposes the YOLOv4-Ghost-AMR model, “A” refers to the feature extraction module AFmodule, “M” refers to the dynamic activation function Meta-ACON, “R” refers to the deep residual convolution module Reslayer. Experimental results on the VOC2007 + 2012 dataset show that the YOLOv4-Ghost-AMR model proposed in this paper is more lightweight, more accurate, faster and more suitable for object detection tasks in current mobile devices, and is further extended to the Mycobacterium tuberculosis dataset for application, demonstrating the generalisability and application scalability of the method in this paper.

2. Related work

In order to meet the demand for high accuracy in object detection techniques, object detection methods proposed in recent years mainly increase the depth of the network to improve the accuracy of the model, but this approach will significantly increase the computational volume of the network, the number of parameters and the time required for network training, which is not suitable for real-time object detection tasks. In order to solve these problems, the existing object detection models need to be lightened and improved. The main methods for model lightening include:(1) improving the backbone feature extraction network by combining lightweight convolutional neural networks (Lan et al., Citation2020); (2) optimising the structure of traditional convolutional operations by using group convolution and depth-separable convolution to further reduce the number of parameters and the computational effort of the model; (3) using bottleneck structures, small-sized convolutional kernels to optimise the network structure, etc. Among them, combining lightweight convolutional neural networks to improve the model is an effective way to significantly reduce the computational cost of the model with minimal impact on the detection effect of the model.

Commonly used lightweight networks include SqueezeNet, MobileNets, ShuffleNet, GhostNet, etc. SqueezeNet uses Fire Module to replace 3 × 3 convolutions with 1 × 1 convolutions, which compress the number of convolutional channels and drastically reduce the number of parameters and computation of the network, but the accuracy of the model is reduced (Iandola et al., Citation2016). MobileNetv1 uses Deep Separable Convolution to reduce the computation and number of parameters of the model, which significantly improves the detection speed of the model, but also causes a large loss of feature map information (Howard et al., Citation2017). In response to the problems of MobileNetv1, MobileNetv2 introduces the Linear Bottleneck and Inverted Residual structures to reduce the damage of the ReLU activation function on non-linear features and reduce the information loss caused by the residual process (Sandler et al., Citation2018). MobileNetv3 introduces the SE module in the network, combining the inverse residual structure with the H-swish activation function to establish the dependencies between different feature channels and selectively extract feature information, which reduces the computational effort of the model (Howard et al., Citation2019). ShuffleNet (Zhang et al., Citation2018) uses pointwise group convolution and channel shuffle operation to reduce the number of parameters of the model while maintaining the model accuracy as much as possible. GhostNet proposes Ghost Module, which divides the traditional convolution operation into two parts, firstly acquiring the true feature layer of the image by ordinary convolution, and then performing a cheap operation on the true feature layer to acquire the remaining feature map with a small number of parameters. Studies have shown that GhostNet has better performance compared with previous lightweight networks (Han et al., Citation2020).

Lightweight object detection networks can be applied to unmanned vehicles, medical detection, aerial drones and other applications. Literature (Li & Li, Citation2022) used YOLOv4 as the base framework and used the lightweight network MobileNetv3 instead of CSPdarknet53 feature extraction network, and introduced depth-separable convolution and coordinate attention mechanisms to reduce the number of model parameters, but also caused a decrease in model accuracy. An improved YOLOv5 model was proposed, which added scSE module to the CSPDarknet and CSP modules to shorten the detection time of the model and improved the detection accuracy, which can effectively identify the gangue targets in the dataset (Yan et al., Citation2022). A Light-YOLOv4 smoke detection model was proposed, which used MobileNetv3 as the backbone network and introduced depth-separable convolution with BiFPN bi-directional pyramid structure to fuse image features (Wang, Hua, et al., Citation2022). This paper proposed the scaled-YOLOv4 model, which devised a simple and effective strategy to adjust the model size and used FPN as a feature extraction network to reduce the computational cost of the object detection network and the number of model parameters, but the accuracy of the model was significantly reduced (Wang et al., Citation2021). Literature (Zhu et al., Citation2022) proposed SRDD network, combining lightweight backbone ResT and dynamic anchor frames to improve accuracy while reducing the number of model parameters.

From the above-mentioned scholars’ studies, it can be seen that the introduction of lightweight networks in the object detection model can reduce the computational and parametric quantities and improve the detection speed, but the lightweight methods proposed in the literature (Li et al., Citation2022), literature (Wang, Hua, et al., Citation2022) and literature (Wang et al., Citation2021) have caused a loss of model accuracy, Literature (Zhu et al., Citation2022) improves the accuracy of the model by introducing additional modules along with model lightness improvements. In order to ensure the accuracy of the model while lightening the model, after designing the lightweight backbone network, this paper proposes the AFmodule structure to enable the backbone network to better extract the edge features of the image, improve the training speed of the model and greatly reduce the model parameters; On the other hand, this paper uses Meta-ACON to fit different activation functions to realise the dynamic activation of neurons; For the feature fusion module, this paper proposes the Reslayer depth residual convolution structure, and also designs a new RL-PAFPN pyramid feature fusion structure to improves the feature fusion ability of the model, and further improves the model’s accuracy.

3. Proposal model

To address the problem that the models in the above-related work only reduce the number of parameters of the model, this paper proposes a new model YOLOv4-Ghost-AMR that integrates the lightweight and accuracy of the model, which reduces the cost of model training, and is capable of accomplishing highly accurate real-time object detection tasks.

3.1. YOLOv4-Ghost-AMR’s network structure

This paper proposes a lightweight object detection model YOLOv4-Ghost-AMR. Firstly, using deeply separable convolutional and lightweight backbone network GhostNet to reduce the number of parameters and computation. Then this paper also combine AFmodule and Mate-ACON dynamic activation function to improve the extraction ability of the model for edge image features. Finally, the RL-PAFPN structure with the Reslayer module is designed, which improves the network's feature extraction and feature fusion capabilities.

The overall structure of the YOLOv4-Ghost-AMR object detection network is shown in Figure . During the training and testing of the model, the input image first passes through the AFlayer module, which reduces the width and height of the image by half and expands the number of channels by four times, and then the feature layers passe through several Ghost Modules, the Meta-ACON dynamic activation function is used to process the image feature layer, and the output feature layer with the number of channels (80, 80, 40), (40, 40, 112) and (20, 20, 160) is obtained; the feature layer with the number of channels (80,80,40) and (40,40,112) are adjusted to (80,80,256) and (40,40,512) respectively using the regular convolution of 1 × 1 and then pass into the RL – PAFPN structure; the feature layer with the number of channels (20,20,160) is passed into the SPP structure, after pooling and three convolution operations then passe into the RL-PAFPN structure; two upsampling and downsampling and feature fusion operations are used in the RL-PAFPN structure, and finally the three output feature layers from the RL-PAFPN structure are passed into the Head module for prediction. GhostNet will be described in more detail in section 3.2.

Figure 1. The overall network architecture design.

Figure 1. The overall network architecture design.

3.2. GhostNet

The original backbone network of YOLOv4 contains 53 convolutional layers and uses conventional convolution to extract image features, which generates redundant feature information and increases the number of parameters and computation. The use of lightweight convolutional neural networks as backbone networks is an effective way to reduce the number of model parameters and computational effort, for instance, the literature (Yang et al., Citation2022) uses MobileNetV3-Large to improve the LSTM network model for efficient intelligent waste classification. This paper uses the lightweight network GhostNet as the backbone network, combined with Ghost Module and Ghost Bottleneck to process the image information, which significantly reduces the computational and parametric effort of the model and maintain a high level of accuracy, extracting image features efficiently.

The process of Ghost Module is divided into two steps, as shown in Figure . the Primary Conv uses traditional convolution, normalisation and activation function to obtain a portion of the true feature layers in the input image, while the Cheap operation uses depth-separable convolution to linearly process the feature map to obtain the remaining feature layers with a small amount of computation and parametric quantities, and finally the two steps are stitched together to obtain the output feature map of Ghost Module.

Figure 2. Ghost module’s structure.

Figure 2. Ghost module’s structure.

Compared to the traditional convolution process, Ghost Module’s convolution operation is smaller: assuming that the size of the input feature map is H1×W1, the number of channels is C1, the size of the output feature map is H2×W2, the number of channels is C2×rate, and the convolution kernel is of size k×k, the computational effort of traditional convolution is: (1) (C2×rate)×H2×W2×C1×k×k(1)

Assuming that the Primary Conv in Ghost Module produces C2 feature layers and each feature layer is mapped to produce (rate1) new feature layers, then the Cheap operation obtains C2×(rate1) feature layers. When the size of the convolution kernel is unchanged, the calculation amount of the Primary Conv in the Ghost Module is: (2) C2×H2×W2×C1×k×k(2)

The amount of computation for the Cheap operation in the Ghost Module is: (3) C2×(rate1)×H2×W2×k×k(3)

The total calculation for the Ghost Module is: (4) C2×H2×W2×C1×k×k+C2×(rate1)×H2×W2×k×k(4)

Then the ratio of the traditional convolution to the calculation amount of the Ghost Module is: (5) (C2×rate)×H2×W2×C1×k×kC2×H2×W2×C1×k×k+C2×(rate1)×H2×W2×k×k=C1×rateC1+rate1(5) Since C1rate, C1×rateC1+rate1rate, so Ghost Module’s convolution operation is less than traditional convolution, which is more conducive to fast extraction of image features.

In GhostNet, each Ghost Bottleneck contains two Ghost Modules, which are used to adjust the number of channels of the feature map and reduce the amount of computation. Batch normalisation with the activation function is used after the first Ghost Module to acquire the non-linear features of the image, and only batch normalisation is used after the second Ghost Module to avoid the problem of gradient disappearance.

3.3. AFmodule

The first layer of the backbone network of the original YOLOv4 consists of traditional convolution operation with a large number of parameters, and the first layer of GhostNet only has a 3 × 3 convolution operation, which cannot extract the complete edge information of the feature map. So that, this paper proposes AFmodule to improve the first convolution layer of the backbone network, as shown in Figure . Firstly, the input image FM is partitioned into n feature layers (FM1,FM2,,FMn). Then apply a slicing operation to each feature layer to obtain pixel values with interval 1, so that the number of channels in each feature map layer is expanded to four times the original one, which converts the aspect information of the feature map into channel information. And the finally step is combining the sliced feature maps, which extracts more complete feature map edge information with less number of parameters.

Figure 3. AFmodule’s structure.

Figure 3. AFmodule’s structure.

The AFmodule simplifies the feature extraction process in the first layer of the backbone network with a lower number of parameters and calculation than conventional convolution.

Assuming that height of the input feature map is H, the width is W and the number of channels is C, the height of the output feature map is H, the width is W and the number of channels is C, the size of the convolution kernel for the first convolution is k×k, the height of the output feature map is H, the width is W and the number of channels is C2 , the size of the convolution kernel for the second conventional convolution is k×k, the height of the output feature map is H, the width is W and the number of channels is C.

Without considering the bias, the traditional convolution calculation amount is:  Flopscon=k×k×C×H×W ×C2+k×k×C2×H×W ×Cthe parameters amount is: Paramscon=k×k×C×C2+k×k×C2×C

The calculation amount of AFlayer is: FlopsAFmodule=k×k×4C×H×W×Cthe parameters amount is: ParamsAFmodule=k×k×4C×C

Then the ratio of the calculation amount and the parameter amount of the traditional convolution and AFmodule is: K=k×k×C×H×W×C2+k×k×C2×H×W×Ck×k×4C×H×W×C=k×k×C×C2+k×k×C2×Ck×k×4C×C=C+C8C

Since CC and K=C1, AFmodule can effectively reduce the calculation amount and parameter amount of the model.

As shown in Figure , this paper designs the AFmodule as the first layer of the model backbone network. Firstly, this paper uses a slicing operation to convert the width and height information of the feature map into channel information, and the number of input channels is expanded by four times while retaining the feature map information, then the image features are further extracted using batch normalisation and the SiLU activation, which improve the network’s extraction of image edge features and non-linear features and improve the accuracy of the model while reducing the number of parameters and computation in the convolution layer.

In this paper, comparative experiments were conducted on the YOLO series models to verify the effect of AFmodule.

Figure 4. The backbone network structure after the introduction of AFmodule.

Figure 4. The backbone network structure after the introduction of AFmodule.

The configuration environment used for model training and testing is: NVIDIA GeForce GTX1070 8G, PyTorch1.6 and Python 3.8.2. The experimental results are shown in Table , it can be seen that the FPS of the model increased by about two times and the mAP of the model increased by 0.69% after adding AFmodule to YOLOv3, which improved the accuracy of the model while increasing the detection speed of the model. With YOLOv4-tiny using the AFmodule module, the accuracy of the model is reduced, but the FPS of the model is increased by 26.88, which is probably due to the fact that the YOLOv4-tiny model is already a lightweight model, and the first convolutional layer of the YOLOv4-tiny backbone network has a large number of parameters, while the subsequent convolutional layers have small parameters, which makes it unsuitable for the use of AFmodule. However, in terms of detection speed, the introduction of AFmodule in YOLOv4-tiny can significantly improve the FPS and increase the detection speed of the model, the FPS of the YOLOv4 backbone network is improved by about two times and the mAP is improved by 0.7% after using AFmodule. The above experiments show that AFmodule can effectively improve the FPS and accuracy of the model, enabling the model to complete the object detection task quickly and efficiently.

Table 1. Performance comparison of the introduction of AFmodule in each model.

3.4. Meta-ACON

In order to solve the problem that ReLU tends to cause some neurons in the network to be deactivated (Dubey et al., Citation2022), that is a reason why we have LeakyRelu, parametrised Relu, or activations like Swish/Mish, in this paper, This paper use Meta-ACON (Ma et al., Citation2021) as the activation function in the Ghost Module of the backbone network, which controls whether the neurons in the network are activated or not through dynamic hyperparameters to improve the stability of the deep network during training.

The paradigm of the Meta-ACON family of activation functions is obtained by approximating the smoothed maximum function. For the function max(x1,x2,,xn), the smoothing approximation can be expressed as: (6) Sβ(x1,x2,,xn)=i=1nxieβxii=1neβxi(6) where β is the smoothing factor, when β, Sβmax(nonlinear), when β0, Sβmean(linear). When n=2, let x1=αm(x),x2=αn(x), σ denotes the general form of the Sigmoid activation function, then the paradigm of the Meta-ACON family of activation functions can be expressed as: (7) Sβ(αm(x),αn(x))=αm(x).eβαm(x)eβαm(x)+eβαn(x)+αn(x).eβαn(x)eβαm(x)+eβαn(x)=αm(x).11+eβ(αm(x)αn(x))+αn(x).11+eβ(αn(x)αm(x))=αm(x).σ[β(αm(x)αn(x))]+αn(x).σ[β(αn(x)αm(x))]=(αm(x)αn(x)).σ[β(αm(x)αn(x))]+αn(x)(7)

The Meta-ACON family of activation functions can be approximated as a variety of other activation functions.

Assuming αm(x)=p1x and αn(x)=p2x, this paper obtains the Meta-ACON activation function (p1(x)p2(x)).σ[β(p1p2)x]+p2x, where p1, p2 are used as parameters to control the value of the upper and lower bounds, and β is a conversion factor, which controls whether the network layer activates the neurons, using the function G(x) to adjust the value of β such that β=G(x)=σc=1ChHwWxc,h,wwhen β, Meta-ACON max(p1x,p2x), when β0, Meta-ACON mean(p1x,p2x).

It can be seen that with the dynamic change of β, Meta-ACON can be approximately smoothed to various activation functions. As shown in Figure , Meta-ACON can control whether to use the activation function and adjust the parameters to obtain different activation functions. In the process of training the model, the conversion factor β is continuously optimised, and the activation function is learnt autonomously in the convolution layers, which enhances the model’s capture of the spatial feature information of the image and improve the detection effect of the model.

Figure 5. Activation method of Meta-ACON.

Figure 5. Activation method of Meta-ACON.

In this paper, after using GhostNet to reduce the number of model parameters, the backbone network of the model is further improved by combining AFmodule and Meta-ACON, and the final backbone network structure obtained is shown in Table , G-Bneck stands for Ghost Bottleneck, the input image is firstly passed through AFmodule, which integrates the width and height information of the image feature layer into the channel information to extract the edge features of the image with lower computation and parametric number; then this paper uses several Ghost Bottlenecks to extract the information of the feature layer and adjust the number of channels of each feature layer. In the Ghost Bottleneck, the Meta-ACON activation function is used to extract the non-linear features of the image, and the conversion factor is optimised adaptively to achieve the conversion between the linear activation function and the non-linear activation function; Avgpool is used to obtain the global contextual feature map information and reduce the network parameters; finally, this paper uses 1 × 1 convolution to map the feature map into feature vectors, and also use the fully connected layer to obtain the information of all images for classification.

Table 2. Improved GhostNet’s structure.

3.5. Reslayer

The design of the feature fusion module in the object detection model is crucial, for instance, the literature (Yang et al., Citation2022) proposed a semantic segmentation method based on hierarchical feature fusion that can be applied to the recognition of remote sensing images. The feature fusion module of the original YOLOv4 contains two five-times convolution blocks, which tends to cause gradient divergence as the depth of the network gradually increases and is not conducive to model training. In order to solve the above problem, this paper proposes Reslayer, which combines the residual structure to improve the performance of the feature fusion module, the structure of Reslayer is shown in Figure , for the input feature map, the convolutional operation is divided into two branches, the right branch only uses convolution processing, the left part is first processed by convolution and then passed into the n-layer residual network block, which enhances feature extraction ability of the network without causing the divergence of the gradient. After the n-layer residual network block is processed, the generated feature map is designed by concat function with the feature information of the right part to expand the feature map’s channel, and finally this paper uses the convolution layer to extract the image special diagnosis.

Figure 6. Reslayer’s structure.

Figure 6. Reslayer’s structure.

Reslayer can solve the problem of gradient disappearance in deep network, which is proved as follows: Xi denotes the output of the network’s i-th layer, the residual function is denoted as F(), the YOLO function is denoted as L(), and the input and output of each layer of residual block are positive, which can be concluded.

as: Xi+1=L(Xi+F(Xi,Wi))=Xi+F(Xi,Wi), Xi+2=L(Xi+1+F(Xi+1,Wi+1))=Xi+1+F(Xi+1,Wi+1)=Xi+F(Xi,Wi)+F(Xi+1,Wi+1)  Xm=Xi+n=im1F(Xn,Wn)

The gradient update value of the deep layer network can be expressed as:

Di=LossXi=LossXm.XmXi=LossXm.(1+n=im1F(Xn,Wn)Xi), and since the output of each layer network is positive, so Di>0, in this case. In this case, the gradient will not disappear as the network deepens.

As shown in Figure , this paper introduces the Reslayer structure into the feature pyramid structure of YOLOv4 to design the RL-PAFPN feature fusion structure, improving the up-sampling and down-sampling process, which reduces the correlation between the semantic information of the feature layer generated by the deep network and the initial feature layer information, and also expands the feature information obtained by up-sampling and down-sampling to facilitate the fusion of image features of different scales, improving the accuracy of the model in recognising multi-scale targets.

Table  shows the number of parameters, computation and mAP for the introduction of the Reslayer module in the YOLOv3, YOLOv4-tiny and YOLOv4 models. The configuration environment used for model training and testing is: NVIDIA GeForce GTX1070 8G, PyTorch1.6 and Python 3.8.2 1. The introduction of the Reslayer module into the YOLOv3 model resulted in a small increase in the number of parameters and the computational volume of the model, by 0.21M and 0.14M respectively, the accuracy of the model improved by 1.32%, which is a more significant improvement. The addition of the Reslayer module to YOLOv4-tiny increases the number of parameters and the computational volume of the model by 0.24M and 0.1M respectively, but the mAP is increased by 1.42%. The addition of the Reslayer module to YOLOv4 increases the number of parameters and the amount of computation by 6.28M and 5.59M respectively, and the mAP is increased by 0.79%. It can be seen that the Reslayer module adds a small number of parameters and computational effort to the model, but it can effectively improve the accuracy of the model.

Figure 7. RL-PAFPN’s structure.

Figure 7. RL-PAFPN’s structure.

Table 3. Performance comparison of the introduction of the Reslayer module in each model.

4. Experiments and analysis of results

4.1. Experimental environment and dataset

The experimental dataset in this paper is Pascal VOC2007 + 2012, with 16,551 images. The dataset contains 20 target classes: people, cat, dog, cow, horse, sheep, bird, car, bus, bicycle, motorbike, boat, plane, train, chair, table, sofa, television, bottle, and potted plants (Hu et al., Citation2018). The configuration environment used for model training and testing is: NVIDIA GeForce GTX1070 8G, PyTorch1.6 and Python 3.8.2.

4.2. Ablation experiments

To verify the optimisation effect of each part of the lightweight object detection method mentioned above, this paper conducted ablation experiments, the results are as follows. As shown in Table , “GhostNet” indicates the use of GhostNet as the backbone network, “AFmodule” indicates the introduction of the AFmodule within the first convolutional layer of GhostNet, “Meta-ACON” indicates the addition of the adaptive activation function to the Ghost Module, and “RL-PAFPN” indicates the efficient feature fusion structure designed in combination with Reslayer. From the experimental results, it can be seen that when only GhostNet is used as the backbone network, the number of parameters of the model is reduced from the original 64.36M to 11.69M, but the accuracy of the model is reduced; after the introduction of AFmodule and Meta-ACON, the accuracy of the model is improved by 0.95% and 2.09% respectively, where AFmodule further reduces the number of parameters of the model; after combining with Reslayer to improve the feature fusion structure, the accuracy of the model is improved by 2.18%. Compared with the original YOLOv4 network, the number of parameters is reduced by 46.21M and the accuracy is improved by 2.41%.

Table 4. Comparison of ablation experiments.

4.3. Model comparison experiments and analysis

In order to verify the performance of the lightweight object detection model proposed in this paper, this paper compared our model with 7 single-stage object detection models commonly used in recent years. Four metrics: Params size, Model size, GFLOPs and mAP were compared in the same experimental environment.

As shown in Table , the YOLOv4-Ghost-AMR model proposed in this paper reduced the number of parameters and model size by approximately 71% compared with YOLOv4, and the computational volume was about one-fifth of the original, with an accuracy improvement of 2.41%; compared with the lightweight model YOLOv4-tiny, YOLOv4-Ghost-AMR has a larger number of parameters, model size and calculation amount, but the accuracy is 5.9% higher; compared with YOLOv5-l, the parameter volume was reduced by 29.32M, and the model accuracy was improved by 1.13%; compared with YOLOX-l and YOLOX-m, YOLOv4-Ghost-AMR had smaller parameter size and lower computational effort, with 0.2% and 5.29% accuracy improvement respectively; Compared with CenterNet, YOLOv4-Ghost-AMR has 14.51M fewer parameters, about 56% of the model size, lower calculation amount and 3.88% accuracy improvement; compared with Efficientdet-d7, the YOLOv4-Ghost-AMR model had a significantly lower number of parameters as well as a lower computational effort, and a 3.2% improvement in model accuracy. In terms of the overall performance of the model, the YOLOv4-Ghost-AMR object detection model proposed in this paper has a smaller number of parameters and less computation, which effectively improves the accuracy of the model while ensuring light weight.

  1. The performance of different baseline models with the AFmodule module is compared in Table . The experimental results show that the AFmodule can effectively improve the detection speed and accuracy of the model.

    The experimental results are shown in Table . The results show that the deep residual convolution Reslayer can effectively improve the detection accuracy of the model and only slightly increase the number of parameters and the computational effort of the model.

  2. The results of the ablation experiments of the YOLOv4-Ghost-AMR model are shown in Table . It can be seen that the GhostNet module can effectively reduce the computational and parametric quantities of the model, but causes the model to be less accurate, and after adding AFmodule, Meta-ACON and Reslayer, the accuracy of the model gradually increases, perfectly balancing the accuracy and lightness of the model.

  3. The results of comparing the YOLOv4-Ghost-AMR model with other object detection models are shown in Table . It can be seen that the YOLOv4-Ghost-AMR model proposed in this paper, while keeping the model lightweight, also takes into account the accuracy, which is more accurate than other lightweight models and more lightweight than other object detection models with high accuracy.

Table 5. Comparison of different models.

5. Application

Object detection technology is now used in many aspects of life, and is playing a huge role in the medical field. Tuberculosis is a chronic infectious disease caused by infection with Mycobacterium tuberculosis, which is transmitted through the respiratory tract and is highly contagious, causing serious problems in people's daily lives. The traditional way of detecting Mycobacterium tuberculosis is for the doctor to observe the stained Mycobacterium tuberculosis through a microscope, which is a complicated procedure that can easily lead to errors and omissions due to technical omissions or human factors. At this stage, new detection methods usually use special equipment to take stained images of Mycobacterium tuberculosis and use object detection techniques to automatically detect and count Mycobacterium tuberculosis, achieving a combination of disease diagnosis and object detection techniques, significantly reducing the time cost and human cost of detecting Mycobacterium tuberculosis, so this paper uses the detection of Mycobacterium tuberculosis as an application scenario to verify the effectiveness of the above model.

This paper completes a comparative experiment to validate the performance of the model proposed in this paper based on the Mycobacterium tuberculosis dataset (Tuberculosis) (Quinn et al., Citation2016), which consists of 1265 sputum images as well as 3734 bacterial entities. The training set contained 1025 images, the validation set contained 112 images and the test set contained 128 images, Figure  shows a selection of images from the tuberculosis dataset. The software environment and hardware environment for the experiments are the same as those described in Chapter 4.

Figure 8. Mycobacterium tuberculosis images from the tuberculosis dataset.

Figure 8. Mycobacterium tuberculosis images from the tuberculosis dataset.

Table  shows the performance of the original YOLOv4 model and the YOLOv4 model with each module improved on the Mycobacterium tuberculosis dataset, where “GN” corresponds to the GhostNet module, “AF” to the AFmodule module, and “AN” corresponds to the Meta-ACON module. In terms of the number of parameters and computation, the improved model proposed in this paper reduces the number of parameters by about 72% and the computation by 53%, which is an obvious improvement to the lightweight of the model; in terms of FPS, the FPS of the model proposed in this paper improves by about 51% on the original YOLOv4, which substantially improves the speed of model detection; in terms of mAP, the use of GhostNet lightweight network In terms of mAP, the improvement of the backbone network of the model caused a decrease in accuracy, but after adding the AFmodule module, Meta-ACON activation function and Reslayer feature fusion module, the accuracy of the model was improved to 79.71%.

Table 6. Performance of our model on the tuberculosis dataset.

Figure  shows the detection results of the original YOLOv4 model for some of the Mycobacterium tuberculosis images after training 300 epochs. It can be seen that the number of Mycobacterium tuberculosis detected as well as the accuracy is lower, and Table  in the previous section also demonstrates that the lower FPS of the original YOLOv4 requires more time for Mycobacterium tuberculosis detection, which is not conducive to the deployment of highly time-sensitive medical models.

Figure 9. Results of the original YOLOv4 test for selected Mycobacterium tuberculosis images.

Figure 9. Results of the original YOLOv4 test for selected Mycobacterium tuberculosis images.

Figure  shows the detection results of the selected Mycobacterium tuberculosis images after training the improved model proposed in this paper for 300 epochs. It can be seen that more numbers of Mycobacterium tuberculosis were detected and Mycobacterium tuberculosis was identified with higher accuracy. Based on the FPS demonstrated in Table  in the previous section, the model in this paper can identify Mycobacterium tuberculosis faster and more accurately, which is beneficial for the deployment of highly time-sensitive medical devices.

Figure 10. Results of our model test for selected Mycobacterium tuberculosis images.

Figure 10. Results of our model test for selected Mycobacterium tuberculosis images.

The model proposed in this paper can be deployed to detect Mycobacterium tuberculosis on microscopic medical devices and other devices, enabling rapid and accurate detection of the location and size of Mycobacterium tuberculosis, significantly reducing the time and labour costs of detecting Mycobacterium tuberculosis. The proposed model can also be applied to other diseases and can be easily deployed to highly time-sensitive medical devices, which is of great importance and value for the detection of pathogens in the medical field. The model will be applied to other medical datasets in the future to validate the model and further improve its performance.

6. Conclusion

This paper proposes a lightweight object detection network YOLOv4-Ghost-AMR, which improves the detection accuracy of the model while achieving model lightweight. In terms of the model lightweight design, this paper mainly combines GhostNet and AFmodule to reduce the number of parameters and computation of the backbone network, which also improves the network’s ability to capture image edge features and reduce the information loss of the feature map; then this paper introduces Meta-ACON activation function to improve the convolution module of GhostNet for acquiring real feature layers, and use adaptive parameters obtained to dynamically adjust the activation of the network, which obtaines more accurate non-linear feature relationships in images; this paper also designes a RL-PAFPN feature fusion structure, which combine the idea of residual networks to fuse image features using a deeper convolutional structure to further improve the detection effect of the model. Experimental results on the Pascal VOC dataset show that the proposed lightweight object detection model has less number of parameters computation than other single-stage object detection models, and achieves an accuracy of 86.83% with optimal performance. This paper further validates the efficiency and accuracy of the model on the Tuberculosis dataset, demonstrating that the model proposed in this paper can process large amounts of image data with high accuracy in a short period of time, facilitating its application to the current stage of medical applications such as bacteria detection. In the next step, this paper will explore more lightweight network design methods, further reducing the number of parameters and computation of the model while maintaining the accuracy to design a object detection model that is more suitable for mobile detection tasks, and applied to other fields for detection to further improve the detection performance and application value of the model.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work is supported by the National Natural Science Foundation of China [grant number 61602161, 61772180], Hubei Province Science and Technology Support Project [grant number 2020BAB012], The Fundamental Research Funds for the Research Fund of Hubei University of Technology [HBUT: 2021046].

References

  • Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M. (2020). YOLOv4: Optimal speed and accuracy of object detection. https://doi.org/10.48550/arXiv.2004.10934
  • Dubey, S. R., Singh, S. K., & Chaudhuri, B. B. (2022). Activation functions in deep learning: A comprehensive survey and benchmark. Neurocomputing. https://doi.org/10.1016/j.neucom.2022.06.111
  • Gavrilescu, R., Zet, C., Foalău, C., Skoczylas, M., & Cotovanu, D. (2018, October). Faster R-CNN: An approach to real-time object detection. In 2018 International Conference and Exposition on Electrical and Power Engineering (pp. 165–168). https://doi.org/10.1109/ICEPE.2018.8559776
  • Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., & Xu, C. (2020). GhostNet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1580–1589). https://doi.org/10.48550/arXiv.1911.11907
  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2961–2969). https://doi.org/10.48550/arXiv.1703.06870
  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916. https://doi.org/10.1109/TPAMI.2015.2389824
  • Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., & Vasudevan, V. (2019). Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1314–1324).
  • Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. https://doi.org/10.48550/arXiv.1704.04861
  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7132–7141). https://doi.org/10.48550/arXiv.1709.01507
  • Hu, Q., & Zhai, L. (2019). RGB-D image multi-target detection method based on 3D DSF R-CNN. International Journal of Pattern Recognition and Artificial Intelligence, 33(8). https://doi.org/10.1142/S0218001419540260
  • Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. https://doi.org/10.48550/arXiv.1602.07360
  • Jocher, G., Nishimura, K., Mineeva, T., & Vilariño, R. (2020). YOLOv5. Code repository.
  • Lan, R., Sun, L., Liu, Z., Lu, H., Pang, C., & Luo, X. (2020). MADNet: A fast and lightweight network for single-image super resolution. IEEE Transactions on Cybernetics, 51(3), 1443–1453. https://doi.org/10.1109/TCYB.2020.2970104
  • Li, B., & Li, J. (2022). Methods for landslide detection based on lightweight YOLOv4 convolutional neural network. Earth Science Informatics, 765–775. https://doi.org/10.1007/s12145-022-00764-0
  • Li, C., Li, L., Jiang, H., Weng, K., Geng, Y., Li, L., Ke, Z., Li, Q., Cheng, M., Nie, W., & Wei, X. (2022). YOLOv6: A single-stage object detection framework for industrial applications. https://doi.org/10.48550/arXiv.2209.02976
  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016, October). SSD: Single shot multibox detector. In European Conference on Computer Vision (pp. 21–37). Springer. https://doi.org/10.1007/978-3-319-46448-0_2
  • Ma, N., Zhang, X., Liu, M., & Sun, J. (2021). Activate or not: Learning customized activation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8032–8042). https://doi.org/10.48550/arXiv.2009.04759
  • Quinn, J. A., Nakasi, R., Mugagga, P. K., Byanyima, P., Lubega, W., & Andama, A. (2016, December). Deep convolutional neural networks for microscopy-based point of care diagnostics. In Machine Learning for Healthcare Conference (pp. 271–281). https://doi.org/10.48550/arXiv.1608.02989
  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 779–788).
  • Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7263–7271). https://doi.org/10.48550/arXiv.1612.08242
  • Redmon, J., & Farhadi, A. (2018). YOLOv3: An incremental improvement. https://doi.org/10.48550/arXiv.1804.02767
  • Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4510–4520). https://doi.org/10.48550/arXiv.1801.04381
  • Wang, C. Y., Bochkovskiy, A., & Liao, H. Y. M. (2021). Scaled-YOLOv4: Scaling cross stage partial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13024–13033). https://doi.org/10.1109/CVPR46437.2021.01283
  • Wang, C. Y., Bochkovskiy, A., & Liao, H. Y. M. (2022). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. https://doi.org/10.48550/arXiv.2207.02696
  • Wang, Y., Hua, C., Ding, W., & Wu, R. (2022). Real-time detection of flame and smoke using an improved YOLOv4 network. Signal, Image and Video Processing, 16(4), 1109–1116. https://doi.org/10.1007/s11760-021-02060-8
  • Yan, P., Sun, Q., Yin, N., Hua, L., Shang, S., & Zhang, C. (2022). Detection of coal and gangue based on improved YOLOv5. 1 which embedded scSE module. Measurement, 188, Article 110530. https://doi.org/10.1016/j.measurement.2021.110530
  • Yang, D., Du, Y., Yao, H., & Bao, L. (2022). Image semantic segmentation with hierarchical feature fusion based on deep neural network. Connection Science, 34(1), 1772–1784. https://doi.org/10.1080/09540091.2022.2082384
  • Yu, J., & Zhang, W. (2021). Face mask wearing detection algorithm based on improved YOLO-v4. Sensors, 21(9), 3263. https://doi.org/10.3390/s21093263
  • Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6848–6856). https://doi.org/10.48550/arXiv.1707.01083
  • Zhu, Y., Xia, Q., & Jin, W. (2022). SRDD: A lightweight end-to-end object detection with transformer. Connection Science, 34(1), 2448–2465. https://doi.org/10.1080/09540091.2022.2125499