Full article: IVP-YOLOv5: an intelligent vehicle-pedestrian detection method based on YOLOv5s

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Computer vision is now vital in intelligent vehicle environment perception systems. However, real-time small-scale pedestrian detection in intelligent vehicle environment perception systems is still needs to be improved. This paper proposes an intelligent vehicle-pedestrian detection method based on YOLOv5s, named IVP-YOLOv5, to use in vehicle environment perception systems. Based on the network structure of YOLOv5s, we replaced BottleNeck CSP with Ghost-Bottleneck to reduce the complexity of processing feature maps while maintaining good detection performance. To reduce the error between the ground truth box and the predicted box, we apply Alpha-IoU as the bounding box loss function, improving pedestrian detection accuracy and robustness. We introduce the slicing-aided hyper inference (SAHI) strategy, which enables the lightweight backbone network to capture more detailed features of pedestrians by enlarging image pixels. Experiments on the BDD100 K dataset show that the proposed IVP-YOLOv5 achieves 67.1% AP and 18.5% APs of pedestrian detection, and the GFLOPs and the number of parameters are only 10.5 and 4.9M.

KEYWORDS:

1. Introduction

Object detection technology plays a crucial role in computer vision due to its versatility in applications such as security systems, pedestrian re-identification, pedestrian tracking, and pedestrian intent prediction. With the development of intelligent vehicles, pedestrian detection has become a key technology object detection. Moreover, fast and accurate pedestrian detection methods are of great significance for the safety of intelligent vehicles on the road and the safety protection of pedestrians (Parekh et al., Citation2022; Zablocki et al., Citation2022). In intelligent vehicles, object detection generally uses more lightweight detection methods due to the limitations of the computing power of vehicle computing devices. Lightweight object detection methods have achieved good confidence scores for large and medium-scale pedestrians in complex traffic road scenes. However, it detects small-scale pedestrians at long distances with missing and false detections. Thus, how to make the object detection method improve the detection accuracy and keep it lightweight at the same time is an urgent problem to be solved (Westhofen et al., Citation2023).

Up to now, experts have proposed useful methods to improve the detection effect and robustness of small objects in complex scenes. Deng et al. (Citation2022) proposed the extended feature pyramid network (EFPN) based on the feature pyramid network (FPN), which enriches the detailed features in the region by using the super-resolution (SR) features as a new feature transfer module in the FPN to facilitate the detection of small and medium objects. Cai et al. (Citation2021) proposed a new pruning scheme that divides the DNN (Deep Neural Network) weights of the network layer into multiple blocks of equal size and prunes the weights within the blocks to the same shape. Furthermore, they also adopted a mobile GPU-CPU cooperation scheme so that the detection methods deployed on mobile devices can maintain good detection accuracy and achieve efficient reasoning speed. Sun et al. (Citation2022) integrated improved the spatial pyramid pooling (SPP) layers into the lateral connections of FPN to better extract fine-grained information from the shallow feature maps and improve the object detection accuracy of unmanned aerial vehicles. Ye et al. (Citation2022) proposed the global-local feature enhancement network (GLF-Net). In the feature extraction process, the local feature extraction (LFE) module and the global feature extraction (GFE) module extract the image's local and global features to accomplish stable feature extraction in complex backgrounds and dense scenes. The feature fusion module is responsible for fusing these global and local features to enhance the feature representation capability of the network model, thus improving the detection accuracy of multi-scale objects. For the problem of balancing the detection performance and efficiency of small objects, Yang et al. (Citation2022) proposed a query mechanism that leads to a faster inference of feature-pyramid based object detectors. Predicts the coarse position of small objects on low-resolution features and guides high-resolution features to calculate the exact result for the object. This approach takes full advantage of the high-resolution feature map and avoids the useless computation of large amounts of background information. Tian et al. (Citation2022) proposed a dual inspection mechanism to solve the problem of missed detection of small objects. When a single-stage detector misses an object, the denoising sparse autoencoder (DSAE) module extracts images of the likely regions of the object as low-dimensional feature vectors. Then, based on the results of the two detections, the instances in the image are ranked to identify the missed objects.

Due to the complex traffic environment faced by intelligent vehicles, the relative positions between the object and the vehicle-mounted camera are different, resulting in different scales of the object in the image. Some datasets have been clearly defined for objects at different scales. The city person dataset defines objects less than 75 px as small objects, objects with pixels less than 75 and more than 100 as medium objects, and objects with pixels more than 100 as large objects (Lv et al., Citation2022). In the MS COCO dataset, the defined range of small object pixels is less than 32 × 32, the medium object pixel range is 32 × 32∼96 × 96, large object pixel range is greater than 96 × 96 (Li et al., Citation2022). In the traffic sign dataset, objects whose width is less than 20% of the entire image are generally defined as small objects (Wang et al., Citation2022).

This paper takes pedestrians in complex traffic road scenarios as the research object. In order to solve the problems of false detection, missed detection, and poor detection of small-scale pedestrians in YOLOv5s, and considering real-time, we propose an intelligent vehicle-pedestrian detection method based on YOLOv5s (IVP-YOLOv5). The contributions of this paper are as follows: 1. We apply the Ghost-BottleNeck module, which enables the network model to maintain good detection performance while reducing the computation cost. 2. We use Alpha-IoU as the bounding box loss function to optimise pedestrian detection boxes’ localisation while improving pedestrian detection's robustness. 3. We apply the slicing-aided hyper inference (SAHI) strategy to improve pedestrian detection performance in complex traffic road scenes significantly.

The rest of the organisation of this paper is as follows: Section 2 outlines the related work of object detection and focuses on recent research. Section 3 describes the improvement method based on YOLOv5s. Section 4 conducts ablation experiments and real traffic road experiments on the proposed method. Section 5 deploys the proposed method on intelligent vehicles to implement campus pedestrian detection. Finally, Section 6 concludes the study.

2. Related work

2.1. Two-stage approaches

The two-stage approach consists of two main parts, the first is to decide region proposals by selective search, and the second uses the convolutional neural network (CNN) to extract features on region proposals and determine classes using classifiers (Zaidi et al., Citation2022). R-CNN series of algorithms are classic representatives of this category. The R-CNN algorithm extracts about 2000 bottom-up candidate boxes, transforming the object detection problem into a region classification problem through candidate region extraction (Liu et al., Citation2021). Fast R-CNN adds regions of interest to the feature extraction network. The SPP layer of the region of interest is a pyramid layer, which aims to extract the same proportion of feature maps from candidate regions of different sizes. In addition, Fast R-CNN enables end-to-end training and does not require additional cache space to store features (Arkin et al., Citation2021). In Faster R-CNN, the region generation network obtains the candidate region with the highest probability of object existence. The parameters of the network model and the extracted features are shared to improve the speed and accuracy of the algorithm (Maity et al., Citation2021).

In recent years, the two-stage object detection methods have made new progress. Cascade R-CNN focuses on optimising different IoU thresholds and uses cascaded regressors with three different IoU thresholds to train network parameters and achieve high-quality detection models (Ahmad et al., Citation2021). Grid R-CNN (Lu et al., Citation2019) uses a grid-guided localisation mechanism for precise object detection. Unlike traditional regression-based methods, this method explicitly captures spatial information and enjoys the position-sensitive property of fully convolutional architectures. In order to improve the detection efficiency of Grid R-CNN, Grid R-CNN Plus (Lu et al., Citation2019) reduces the computational cost of the network model by reducing the size of the feature map in the grid branch and the number of convolutional layers for feature fusion. In addition, the sampling strategy, normalisation method, non-maximum suppression (NMS), and some hyperparameters are further analysed and modified to optimise the performance of the network model. Sparse R-CNN (Sun et al., Citation2021) provide a set of sparse features that are fixed in scale and contain location information and proposal features in the prediction part of object detection. This structure effectively avoids operations such as candidate objects, labels, and post-processing such as NMS, thus speeding up the convergence of the network model. In order to solve the problem of inaccurate matching between the proposal box and the ground truth box in Sparse R-CNN, Dynamic Sparse R-CNN (Hong et al., Citation2022) uses a dynamic label assignment (DLA) method to assign many positive samples in the training stage to generate a better-quality proposal box in the subsequent stage. Featurized Query R-CNN (Zhang et al., Citation2022) uses the region proposal network (RPN) guided the query generation network (QGN) to generate image-aware object queries and object boxes, reducing the number of decoder stages and improving computational efficiency without compromising performance.

2.2. One-stage approachs

The one-stage method directly predicts the object location and label through CNN. YOLO is the first single-stage object detection method. It uses object detection as a regression problem and directly uses the entire picture as an input training model (Bhavya Sree et al., Citation2021). YOLOv2 sets the resolution of the input image to 448 × 448 (224 × 224 in YOLOv1), using high-resolution feature maps designed to locate smaller objects. In addition, the anchor-free mechanism has been introduced, which significantly improves the performance of the object detector (Joseph et al., Citation2021). YOLOv3 uses darknet-53 as a feature extraction network and uses a multi-scale feature map prediction method to enable the detector to have the ability to detect small objects (Hurtik et al., Citation2022). YOLOv4 changes the ReLU activation function in the darknet-53 network to the Mish activation function to enhance the feature extraction ability (Dewi et al., Citation2022). Multi-scale feature maps are processed in the neck using the FPN and path aggregation network (PAN). Besides, it applies practical techniques such as Mosaic data augmentation and parameter optimisation to improve the speed and accuracy significantly. YOLOv5 iteratively adjusts anchor boxes using an adaptive anchor box method. It enables the model to reduce computational complexity while ensuring inference accuracy by applying the CSPNet structure on the backbone network (Zhang et al., Citation2022). PP-YOLO uses ResNet50-vd as the backbone network and replaces some convolutions with variable convolutions (Jian & Lang, Citation2021). A reasonable optimisation strategy is used during training to balance accuracy and speed. YOLOX (Ge et al., Citation2021) adopts an anchor-free box design to reduce manual settings parameters. The performance of the detector is improved by Integrating techniques such as decoupled head and label assignments in the model. In YOLOR (Wang et al., Citation2021), a network that can complete multiple tasks simultaneously is applied. It learns the fused explicit and implicit knowledge, effectively improving the model performance and having low computational cost. PPYOLOE (Xu et al., Citation2022) is an anchor-free detection method. The RepRes-Block module is applied in the backbone network and neck so that the model can reduce the computational burden without losing detection accuracy. The ET-Head (Efficient Task-aligned Head) is used in the head part to increase the detection efficiency of the detector. YOLOv7 (Wang et al., Citation2022) applies the extend and compound scaling method, reducing the detector parameters by 50% and improving the inferencing speed and detection accuracy. In addition, the detector applies implicit knowledge combined with convolutional feature maps, the EMA (Exponential Moving Average), and other training techniques to improve detection accuracy without increasing inference costs.

3. Improvement methods

3.1. Ghost-BottleNeck module

The feature map generated through the backbone network usually contains many duplicates and similar feature maps. In order to reduce the computational cost of using convolution to generate feature maps, the Ghost module (Han et al., Citation2020) is considered to generate feature maps. The structure of the conventional convolution and Ghost module is shown in . The conventional convolution directly performs convolution calculation on the input feature map to obtain the output feature map. The Ghost module performs convolution calculations on the input feature map to generate the original feature map. The number of channels in the original feature map is smaller than in the output feature map. Then each channel of the original feature map is linearly transformed to obtain the Ghost feature map and stitched together to obtain the output feature map.

Figure 1. Structures of conventional convolution and Ghost module.

Assuming that the input data is $X \in R^{c \times h \times w}$ , the convolution filter is $f^{'} \in R^{c \times k \times k \times m}$ , and the output features of the m channels are $Y^{'} \in R^{h^{'} \times w^{'} \times m}$ . The conventional convolution operation in the Ghost module (ignoring the bias term) is as follows: (1) $Y^{'} = X * f^{'}$ (1) After the conventional convolution calculation, we obtain the original feature maps of m channels. In order to make up the required n feature maps, each channel's original feature maps generate Ghost feature maps separately by linear transformations, $Φ$ denotes the linear transformation in (b). The Ghost feature map calculates as follows: (2) $y_{i j} = Φ_{i, j} (y_{i}^{'}), \forall i = 1, \dots, m, j = 1, \dots, s$ (2) In the formula, $y_{i}^{'}$ is the $i$ original feature map, $Φ_{i, j}$ is the linear transformation, and $y_{i j}$ is the Ghost feature map, $Φ_{i, s}$ represents the identity transformation of i original feature map. According to the linear transformation of formula 2, the output data of the Ghost module is $n = m \cdot s$ feature maps $Y = [y_{11}, y_{12}, \dots, y_{m s}]$ .

When the size of the input feature map is $c \times h \times w$ , the size of the convolutionkernel used is $k \times k$ , and the size of the output feature map is $c^{'} \times h^{'} \times w^{'}$ . The number of parameters of the conventional convolution and Ghost module calculation process is as follows: (3) $\begin{aligned} P_{Conv} & = c \times c^{'} \times k \times k \end{aligned}$ (3) (4) $\begin{aligned} P_{Ghost} & = c \times m \times k \times k + m \times n \times d \times d \end{aligned}$ (4) Where $m$ is the number of channels in the original feature map, $n$ is the number of kernels in the linear transformation, $d \times d$ is the size of the linear kernel, and $d \times d \leq k \times k$ . The ratio of parameters of conventional convolution to Ghost module is as follows: (5) $R_{p} = \frac{P_{Conv}}{p_{Ghost}} = \frac{c \times c^{'} \times k \times k}{c \times m \times k \times k + m \times n \times d \times d} \approx \frac{c^{'}}{m}$ (5) Through theoretical analysis, it is found that the ratio of the parameters of the conventional convolution to the Ghost module is $c^{'} / m$ . When the number of channels of the original feature map is reduced, the parameter of the Ghost module will be less than that of the conventional convolution. The number of parameters is minimised if the conventional convolution process is skipped and the linear transformation directly generates the Ghost feature map.

The Ghost module is very different from the current efficient convolution calculation scheme. Firstly, compared with the widely used 1 × 1 pointwise convolution unit (Su et al., Citation2022), the Ghost module can customise the convolution of the primary convolution nuclear size. Secondly, the pointwise convolution is usually used to process information between different channels and the depthwise convolution (Patel et al., Citation2022) was used to process spatial information. In contrast, Ghost uses conventional convolution to generate the original feature maps and then uses simple linear transformation to enhance the features, which can ensure the quality of the feature map and reduce calculation costs.

Based on the advantages of the Ghost module in computing feature maps, we apply the Ghost-BottleNeck, as shown in . The Ghost-BottleNeck is mainly composed of a stack of two Ghost modules, the overall structure is similar to the residual block, and both use the skip connection method. The first Ghost module acts as a widening channel layer, and its function is to widen the channel of the feature map, thereby increasing the feature dimension. The second Ghost module acts as a channel preservation layer, and its function is to change the spatial dimension of the feature map to keep the channel consistent. When the input data is calculated by the Ghost module's convolution, normalisation, and ReLU activation function (CBR), the channels of the output features are reduced, thereby reducing the computational complexity in the convolution calculation process. Then CBR has performed again and spliced with the shortcut path to enhance the feature information of the feature map. After the second Ghost module outputs features, the partial feature maps of the input features are spliced with the output features to enhance the semantic information of the feature maps and maintain channel consistency. The second Ghost module lacks the ReLU activation function compared to the first Ghost module. The main reasons are listed as follows: on the one hand, it alleviates the problem of gradient explosion caused by backpropagation. On the other hand, after the output of the first Ghost module, the input data distribution of the next layer is inconsistent with the previous layer, which reduces the training speed of the network. Therefore, the ReLU activation function is not used in the right half.

Figure 2. Ghost-BottleNeck module.

Based on the above discussion and analysis, the Ghost-BottleNeck module based on the Ghost module can reduce computational complexity. In order to improve the computational efficiency of pedestrian detection, the BottleNeck CSP module is replaced by the Ghost-BottleNeck module. The changed network structure is shown in . The backbone network still uses CSPDarknet-53 with a better feature extraction effect. The Ghost-BottleNeck module uses a linear transform instead of convolution to generate a portion of the feature map when processing the feature map, allowing the backbone network to not only effectively reduce the computational cost but also to extract rich feature map information. When dealing with multi-scale feature maps in the neck, we still use the structure of FPN and PAN. Ghost-BottleNeck assumed as the role of feature enhancement and assists multi-scale feature fusion, and can reduce the computational load of multi-scale features. Therefore, it can reduce the computational complexity of the network model and maintain precision by embedding the Ghost-BottleNeck module into the backbone network and the neck.

Figure 3. Improved yolov5s network structure.

3.2. Optimisation of loss function

YOLOv5s uses the GIoU loss function to calculate the loss between the predicted boxes and ground truth boxes (Zhao et al., Citation2022). The calculation formula of GIoU is as follows: (6) $GIoU = IoU - \frac{| C - (A \cup B) |}{| C |}$ (6) Where IoU is the overlapping area of the predicted box, A is the predicted box, B is the ground truth box, and C is the minimum enclosing box of A and B.

The accuracy of its prediction box is critical when it comes to object detection. Although the GIoU inherits the advantages of IoU scale invariance, it can alleviate the problem of gradient disappearance when the prediction box and the ground truth box have no overlapping area (Song et al., Citation2022). However, the object box location is still inaccurate, and most ground truth box labels are manually labelled. In addition, the effect of noise on detector performance during training is considered. Alpha-IoU (He et al., Citation2021) is therefore used as the ground box regression loss function.

Alpha-IoU is based on the previous IoU and IoU variants by applying the Box–Cox transform to the IoU loss function, adding power regularisation, and then deriving it as Alpha-IoU, with the following equation: (7) $L_{α - IoU} = \frac{1 - I o U^{α}}{α}, α > 0$ (7) Most IoU terms in existing losses, such as log(IoU), can be derived by tuning parameters in Alpha-IoU loss. We can generalise the general IoU-based loss ( $L_{IoU}$ , $L_{GIoU}$ , etc.) by introducing the previous formula into a penalty term: (8) $L_{α - IoU} = 1 - I o U^{α_{1}} + P^{α_{2}} (B, B^{g t})$ (8) where $α_{1} > 0$ , $α_{2} > 0$ , and $P^{α_{2}} (B, B^{g t})$ denotes any penalty term calculated based on $B$ and $B^{g t}$ . The value of α is adjustable, and different adjustments of α can achieve different properties. When α > 1, it can help the model to focus more on objects with high IoU. The adjustment of α has better adaptability to the gradient reweighting, with IoU shifting from up-weighting to down-weighting when 0 < α < 1 and IoU shifting from down-weighting to up-weighting when α > 1, this scheme enables the model to achieve adaptive speed learning according to the IoU of the object. Typically, α = 3 has better performance.

3.3. Improvement of the inference process

Many detector studies improve detection performance by changing the structure of DNN or increasing the depth of the network (Pal et al., Citation2021). Although the improved detector has an excellent detection effect, it will increase the network model's computational complexity and the inference time. Therefore, SAHI (Akyon et al., Citation2022) strategy is used in detector inference to improve the detection performance of pedestrians while the complexity and memory requirements are as low as possible.

SAHI draws on the experience of sliding windows and applies its ideas to the picture-inferencing process. As shown in , SAHI divides two main parts of the full inference and slicing aided. Full inference feeds the entire image into the inference model to detect objects rich in feature information. The slicing aided slices the whole image into M × N overlapping sliced images while resizing the sliced images and maintaining the aspect ratio, feeding each image into the inference model independently to detect the objects. After resizing the slice image, the detailed features of pedestrians in the image are more pronounced, making the network model relatively easy to extract feature information. The results of the full inference and slice aided calculations are fed to the NMS, removing low-confidence detection boxes for the same object, and restoring the original image size.

Figure 4. Slicing Aided Hyper Inference strategies.

When applying the SAHI strategy for inference, the size of the slice image is essential to the detection results. Under the condition of not changing other parameters, when the slice image is smaller, although the inference model can capture more detailed features, it will also bring more false detections. Since the complexity of our application scene, when performing auxiliary slicing, the entire image is divided into four parts based on experience so that detailed features can be extracted without causing too many false detections.

4. Experiments

4.1. Environmental setup

The experimental environment of IVP-YOLOv5 mainly includes the hardware, software, and dataset used in the experiment. The experimental platform uses the Baidu BML Codelab training platform. The hardware and software of the computing platform are shown in Table .

Table 1. Configuration of the experimental environment of the computing platform.

Download CSV Display Table

In order to verify the performance of IVP-YOLOv5, an ablation experiment was performed using the general detection dataset PASCAL VOC (Zhang et al., Citation2022). Considering that IVP-YOLOv5s need to be applied to traffic road scenes, thus it was tested on BDD100 K (Zhang et al., Citation2022) and Nuscenes (Caesar et al., Citation2020) datasets. The division ratio of the training set and test set is 8:2, and the division results of the dataset containing only pedestrians are shown in .

Table 2. Classification results for datasets containing pedestrians.

Download CSV Display Table

The scales of pedestrians are inconsistent in the dataset. Therefore, we apply the definition of the absolute scale of objects in the MS COCO dataset to classify pedestrians of different scales. The statistics of the percentage of pedestrian at different scales in the three datasets are shown in . The images in the PASCAL VOC mainly originate from the network, and the image resolution is of different sizes. Thus, it is very appropriate to use the absolute scale measure of the object. There are more large and medium-scale pedestrians than small-scale in the PASCAL VOC dataset. The scale of pedestrians is relatively uniform, and there is no extreme case of large or small-scale pedestrians. The BDD100k and Nuscenes datasets are mainly real-world traffic road scenes, where pedestrians are more often presented as medium and small-scale pedestrians in the images, which is more challenging for the detector (Zhu et al., Citation2022).

Table 3. Statistics on the proportion of pedestrian objects in different scales in the dataset.

Download CSV Display Table

4.2. Ablation experiment

Ablation experiments were carried out on the PASCAL VOC dataset to test the effect of the improved part on the detector. In the experiment, YOLOv5s as v5s, and the Ghost-BottleNeck module as GhB, v5s + GhB+αIoU + SAHI is IVP-YOLOv5 proposed in this paper. The results of the ablation experiment are shown in . In the early training days, the values of AP and precision reach more than 0.8, mainly due to the pre-training model when training v5s. After the GhB is added to the v5s, the AP and the Precision value increase to an inevitable increase compared to v5s, and the loss value has a small amount of decrease and gradual convergence, which indicates that after the GhB module is increased, the performance of the network model has improved. After the αIoU is added to v5s + GhB, the AP and precision increase significantly with the increase in the epoch, and loss also becomes smaller and gradually converged, which shows that after increasing αIoU, it can improve the performance of the network model and losses easier to converge. The SAHI inference strategy is the improvement of the detection inferencing process, which does not involve the training of network models, so it is not reflected in .

Figure 5. Ablation experiment results on PASCAL VOC dataset.

The test results on the PASCAL VOC dataset are shown in . After adding the GhB module to the YOLOv5s, the AP is increased by 0.8%, the GFLOPs and the number of parameters is decreased by 5.5 and 2.2M respectively, and the detection speed is decreased by 2 ms. The results show that the GhB module generates a part of the feature map through linear transformation, which can significantly reduce the number of parameters and calculation of the network model. The bounding box loss function GIoU in v5s + GhB is replaced by αIoU, and the AP is increased by 1.6%. The main reason is that αIoU makes the bounding box positioning of the detector more accurate, and the ability to resist noise is enhanced. After the inference process is optimised using the SAHI strategy, the AP is Increased by 3.5%. Since the optimisation of the inference process do not involve changes in the network structure, the GFLOPs and the number of parameters does not change, but the inference speed is increased by 7 ms, mainly due to the inference an extra inference path is added during detection, which reduces the inference efficiency to a certain extent but can still meet the real-time requirements of the environment perception system.

Table 4. The PASCAL VOC dataset test results.

Download CSV Display Table

The visualisation of the test results is shown in . YOLOv5s did not detect small-scale pedestrians in figure (a) because the characteristic information of small-scale pedestrians in the image is insufficient. In figure (b), YOLOv5s miss inspections during the detection of occluded pedestrians and small-scale pedestrians. The reason is that the pedestrian characteristics are insufficient due to the occluded. The reason why IVP-YOLOv5 can detect the small-scale pedestrian in figure (a) and the obscured pedestrian in figure (b) is that IVP-YOLOv5 applies the SAHI strategy, which amplifies the pixel information in the image and makes it easier for the detector to capture features.

Figure 6. The visual image of the PASCAL VOC dataset test.

Based on ablation experiments and detection results, IVP-YOLOv5 has good detection performance and significantly improves small-scale and occluded pedestrians’ detection effect. Although the detection speed is lower than that of YOLOv5s, it can still meet the real-time requirements of the environment perception system.

4.3. Experiment with real traffic road datasets

The object detection dataset is very different from the actual road scene in terms of image quality, scene environment, objects, etc. Considering that intelligent vehicles face real traffic road environments, the actual traffic road scene datasets BDD100 K and Nuscens are used to test and compare different detection methods. shows the detection performance of different detection methods on the BDD100 K and Nuscens datasets. The AP and small-scale pedestrian APs indicators of IVP-YOLOv5 are higher than YOLOv5s, YOLOX-s, and PPYOLOE-s, which shows that IVP-YOLOv5 has relatively excellent detection accuracy. In addition, IVP-YOLOv5 also has lower GFLOPs and the number of parameters than other detectors (YOLOV5s, YOLOX-s, PPYOLOE-s), which indicates that IVP-YOLOv5 has relatively low computational complexity and power consumption. In summary, IVP-YOLOv5 has good detection accuracy and can improve the detection performance of small-scale pedestrians while maintaining better lightweight. Although IVP-YOLOv5 has a slower single-image detection speed than other detectors, it can meet real-time requirements. The reason for the slower single-image detection is that we applied SAHI inference based on YOLOv5s. Before feeding the image into the network model, the slicing aided process is added, which divides the image into multiple sliced images. In addition, the results of the full inference and the multiple sliced image inference are computed jointly in the post-processing part, which both increases the complexity of the whole detection process. The comparison of pedestrian detection results on the BDD100 K dataset under the same conditions is shown in . The AP of IVP-YOLOv5 is higher than other detection methods (SCAM-YOLOv5, RetinaNet, ICBAM-CenterNet), and the GFLOPs and the number of parameters are relatively low, which indicates that IVP-YOLOv5 has advanced detection performance.

Table 5. The detection performance of different detection methods on BDD100 K and Nuscens datasets.

Download CSV Display Table

Table 6. The comparison of pedestrian detection results of other detection methods on the BDD100 K dataset.

Download CSV Display Table

In the BDD100 K dataset, compared with IVP-YOLOv5, YOLOv7 has 0.7% higher AP and 0.4% lower APs. In the Nuscens dataset, YOLOv7 has 0.5% higher AP and 0.2% lower APs than IVP-YOLOv5. It shows that the detection accuracy of YOLOv7 is better than IVP-YOLOv5, but the small-scale pedestrians are a little worse than IVP-YOLOv5. The main reason for this is that the large and medium-scale pedestrian features extracted by YOLOv7 through the backbone network are relatively rich, and there are many detailed features that can be used after dimensionality reduction. However, the backbone network extracts fewer features of small-scale pedestrians, and after dimensionality reduction processing, fewer detailed features are available. In addition, in the traffic road dataset, the number of medium and small-scale pedestrians accounts for most of the total number of pedestrians. Therefore, YOLOv7 shows good detection results for large and medium-scale pedestrians, but it is difficult for small-scale pedestrian detection. The number of parameters, GFLOPs, and detection speed of IVP-YOLOv5 are 65.5M, 249.5, and 5 ms lower than YOLOv7, which shows that IVP-YOLOv5 has lower computational complexity and power consumption, and faster detection speed. From the application point of view, when the detection performance is not much different, the lower the complexity and calculation power consumption, the higher the detection efficiency, indicating that the detector is better. Since the application platform intelligent vehicle has high detection efficiency and computational complexity requirements, IVP-YOLOv5 is better than YOLOv7.

A comparison chart of the detection effect before and after improvement is shown in . YOLOv5s do not detect small-scale pedestrians at a greater distance in figure (a) and have low confidence in small-scale pedestrians in the backlit scene on the left side of the image (detection confidence of 0.50 and 0.37, respectively). The missed detection of distant small-scale pedestrians is mainly due to insufficient small-scale pedestrian feature information, as well as the fact that the right side of the road ahead is more crowded with pedestrians, and the occlusion between people blocked some of the pedestrian features. The low confidence level of pedestrian detection in the backlight scene is due to the small scale of pedestrians here, and the insufficient lighting in the backlight scene leads to the unobvious pedestrian characteristics. IVP-YOLOv5 can detect small-scale pedestrians in the distance, and the detection confidence of small-scale pedestrians in backlit scenes is significantly improved (detection confidence is 0.73, 0.53, respectively). In figure (b), the image scene is cloudy, YOLOv5s can detect motion-blurred pedestrians, but the confidence level is only 0.28, and the detection of small-scale pedestrians is missed. IVP-YOLOv5 has a confidence level of 0.74 for motion-blurred pedestrian detection and can detect small-scale pedestrians at long distances. According to the experimental results of IVP-YOLOv5 on BDD100 K and Nuscens datasets, IVP-YOLOV5 can improve the accuracy of pedestrian detection in traffic road scenes, especially obviously improving the effect of small-scale pedestrians, and also has good robustness to detect motion blurred pedestrians and pedestrians in backlight scenes.

Figure 7. Comparison chart of detection effect before and after improvement.

The IVP-YOLOv5 was tested using video data collected from a campus traffic road scenario to verify the detection effect in a real traffic road scenario. The test results are shown in . In figure (a), pedestrians move in different postures, crouching, standing, and sitting pedestrians in the car. IVP-YOLOv5 can detect pedestrians with different posture movements, and the confidence of detection for standing pedestrians is significantly higher than that for crouching pedestrians. The pedestrians in figure (b) are obscured by the green belt, but IVP-YOLOv5 is still able to detect pedestrians with good confidence scores, while pedestrians in the backlit scene in figure (c) show good detection results, and pedestrians at different scales in figures (d), (e) and (f) also achieve relatively good detection confidence scores. The test results from the campus traffic road scene data validate that IVP-YOLOv5 is good at detecting pedestrians at different scales and robust in detecting pedestrians with different postures, backlit scene pedestrians, and partially obscured pedestrians.

Figure 8. Real campus traffic road scenario detection results.

4.4. Real vehicle road test

Convert IVP-YOLOv5 into TensorRT (Chen et al., Citation2022) inference engine and deploy it on an intelligent vehicle experimental platform to verify the detection effect of IVP-YOLOv5 on real roads. is the intelligent vehicle experimental platform. The experimental platform is equipped with an industrial computer, a human interaction panel available for development, a Hikvision monocular camera, and other hardware facilities. The software environment for the experiment is the operating system Ubuntu 18.04, Python 3.6, Pytorch 1.7, CUDA 10.2, and TensorRT 7.1. The test scenario was a campus traffic road, the test duration was about 20 min, and the vehicle speed was about 15 km/h. The vehicle drives according to the campus traffic road rules, and the test results are shown in . All images were randomly intercepted while driving, and pedestrians at different scales in the traffic road scenario could be detected, and the real-time frame rate during detection was maintained at 44 FPS, meeting the requirements of an intelligent vehicle environment awareness system. According to the pedestrian detection test experiment on on-campus traffic roads, IVP-YOLOv5 can maintain good real-time performance and has excellent detection results. It is not only conducive to improving the intelligent vehicle environment perception system, but also has great significance for the research and application of intelligent monitoring, UAV (unmanned aerial vehicle) detection and identification, industrial inspection, and other fields.

Figure 9. Intelligent vehicle experiment platform and camera location.

Figure 10. Intelligent Vehicle Test Results.

5. Conclusion

In this paper, we propose a YOLOv5s-based intelligent vehicle-pedestrian detection method (IVP-YOLOv5). The computational complexity of the network model is reduced by replacing the BottleNeck CSP module with the Ghost-BottleNeck module. Meanwhile, Alpha-IoU and SAHI are applied to improve the accuracy of the pedestrian detection method. The experimental results on the BDD100 K data set show that the GFLOP and the number of parameters of IVP-YOLOv5 are 10.5 and 4.9M, respectively, and the AP of pedestrian detection reaches 67.1%, achieving a very competitive performance.

According to the ablation experiment results, after we apply the SAHI strategy based on the improved method, the detection effect is significantly improved, but the inference efficiency will be reduced. So for the next task, we will focus on parameter tuning and optimisation acceleration of SAHI to improve inference efficiency.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

Research supported by the Natural Science Foundation of Hebei Province, China [No.F2021402011].

References

Ahmad, M., Ahmed, I., & Jeon, G. (2021). An IoT-enabled real-time overhead view person detection system based on cascade-RCNN and transfer learning. Journal of Real-Time Image Processing, 18(4), 1129–1139. https://doi.org/10.1007/s11554-021-01103-0
Web of Science ®Google Scholar
Akyon, F. C., Altinuc, S. O., & Temizel, A. (2022). Slicing Aided Hyper Inference and Fine-tuning for Small Object Detection. arXiv preprint arXiv:2202.06934.
Google Scholar
Arkin, E., Yadikar, N., Muhtar, Y., & Ubul, K. (2021). A survey of object detection based on CNN and transformer. 2021 IEEE 2nd international conference on pattern recognition and machine learning (PRML) (pp. 99-108), IEEE.
Google Scholar
Bhavya Sree, B., Yashwanth Bharadwaj, V., & Neelima, N. (2021). Smart innovation, systems and technologies. In A. Reddy, D. Marla, M. N. Favorskaya & S. C. Satapathy (Eds.), Intelligent manufacturing and energy sustainability, 213 (pp. 475–483). Springer.
Google Scholar
Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, J., & Beijbom, O. (2020). Nuscenes: A multimodal dataset for autonomous driving. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11621-11631), CVPR.
Google Scholar
Cai, Y., Li, H., Yuan, G., Niu, W., Li, Y., Tang, X., Ren, B., & Wang, Y. (2021, May). Yolobile: Real-time object detection on mobile devices via compression-compilation co-design. Proceedings of the AAAI conference on artificial intelligence (Vol. 35, pp. 955-963), AAAI.
Google Scholar
Chen, Y., Yang, J., Wang, J., Zhou, X., Zou, J., & Li, Y. (2022, September). An improved YOLOv5 real-time detection method for aircraft target detection. 2022 27th international conference on automation and computing (ICAC) (pp. 1-6), IEEE.
Google Scholar
Deng, C., Wang, M., Liu, L., Liu, Y., & Jiang, Y. (2022). Extended feature pyramid network for small object detection. IEEE Transactions on Multimedia, 24, 1968–1979. https://doi.org/10.1109/TMM.2021.3074273
Web of Science ®Google Scholar
Dewi, C., Chen, R. C., Jiang, X., & Yu, H. (2022). Deep convolutional neural network for enhancing traffic sign recognition developed on Yolo V4. Multimedia Tools and Applications, 81(26), 37821–37845. https://doi.org/10.1007/s11042-022-12962-5
Web of Science ®Google Scholar
Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430.
Google Scholar
Han, K., Wang, Y., Tian, Q., Guo, J., Xu, C., & Xu, C. (2020). Ghostnet: More features from cheap operations. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1580-1589), CVPR.
Google Scholar
Haris, M., & Glowacz, A. (2021). Road object detection: A comparative study of deep learning-based algorithms. Electronics, 10(16), 1932. https://doi.org/10.3390/electronics10161932
Web of Science ®Google Scholar
He, J., Erfani, S., Ma, X., Bailey, J., Chi, Y., & Hua, X. S. (2021). Alpha-IoU: A family of power intersection over union losses for bounding box regression. Advances in Neural Information Processing Systems, 34, 20230–20242.
Google Scholar
Hong, Q., Liu, F., Li, D., Liu, J., Tian, L., & Shan, Y. (2022). Dynamic Sparse R-CNN. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4723-4732), CVPR.
Google Scholar
Hurtik, P., Molek, V., Hula, J., Vajgl, M., Vlasanek, P., & Nejezchleba, T. (2022). Poly-YOLO: Higher speed, more precise detection and instance segmentation for YOLOv3. Neural Computing and Applications, 34(10), 8275–8290. https://doi.org/10.1007/s00521-021-05978-9. s00521-021-05978-9.
Web of Science ®Google Scholar
Jian, W., & Lang, L. (2021, March). Face mask detection based on transfer learning and PP-YOLO. 2021 IEEE 2nd international conference on Big data, artificial intelligence and internet of things engineering (ICBAIE) (pp. 106-109), IEEE.
Google Scholar
Joseph, E. C., Bamisile, O., Ugochi, N., Zhen, Q., Ilakoze, N., & Ijeoma, C. (2021). Systematic advancement of YOLO object detector for real-time detection of objects. 2021 18th international computer conference on wavelet active media technology and information processing (ICCWAMTIP) (pp. 279-284), IEEE.
Google Scholar
Li, B., Xiao, C., Wang, L., Wang, Y., Lin, Z., Li, M., An, W., & Guo, Y. (2022a). Dense nested attention network for infrared small target detection. IEEE Transactions on Image Processing, 14(8), 1–1. https://doi.org/10.1109/TIP.2022.3199107
Google Scholar
Li, G., Fan, W., Xie, H., & Qu, X. (2022b). Detection of road objects based on camera sensors for autonomous driving in various traffic situations. IEEE Sensors Journal, 22(24), 24253–24263. https://doi.org/10.1109/JSEN.2022.3219884
Web of Science ®Google Scholar
Liu, Y., Sun, P., Wergeles, N., & Shang, Y. (2021). A survey and performance evaluation of deep learning methods for small object detection. Expert Systems with Applications, 172, 114602. https://doi.org/10.1016/j.eswa.2021.114602
Web of Science ®Google Scholar
Lu, X., Li, B., Yue, Y., Li, Q., & Yan, J. (2019a). Grid r-cnn. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7363-7372), CVPR.
Google Scholar
Lu, X., Li, B., Yue, Y., Li, Q., & Yan, J. (2019b). Grid R-CNN plus: Faster and better. ArXiv, abs/1906.05688.
Google Scholar
Lv, H., Yan, H., Liu, K., Zhou, Z., & Jing, J. (2022). YOLOv5-ac: Attention mechanism-based lightweight YOLOv5 for track pedestrian detection. Sensors, 22(15), 5903. https://doi.org/10.3390/s22155903
PubMed Web of Science ®Google Scholar
Maity, M., Banerjee, S., & Chaudhuri, S. S. (2021). Faster R-CNN and YOLO based vehicle detection: A survey. 2021 5th international conference on computing methodologies and communication (ICCMC) (pp. 1442-1447), IEEE.
Google Scholar
Pal, S. K., Pramanik, A., Maiti, J., & Mitra, P. (2021). Deep learning in multi-object detection and tracking: State of the art. Applied Intelligence, 51(9), 6400–6429. https://doi.org/10.1007/s10489-021-02293-7
Web of Science ®Google Scholar
Parekh, D., Poddar, N., Rajpurkar, A., Chahal, M., Kumar, N., Joshi, G. P., & Cho, W. (2022). A review on autonomous vehicles: Progress, methods and challenges. Electronics, 11(14), 2162. https://doi.org/10.3390/electronics11142162
Web of Science ®Google Scholar
Patel, H., Prajapati, K., Sarvaiya, A., Upla, K., Raja, K., Ramachandra, R., & Busch, C. (2022). Depthwise convolution for compact object detector in nighttime images. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 379-389), CVPR.
Google Scholar
Run-Hua, H., Luo, Q., Zhijian, Y., Yang, W., Guotong, L., & Jianfan, L. (2022). SCAM-YOLOv5: Improved YOLOv5 based on spatial and channel attention module. International conference on computer engineering and networks (pp. 1001–1008), Springer.
Google Scholar
Song, Y., Wang, L., Wang, H., & Li, M. (2022). Lane detection based on IBN deep neural network and attention. Connection Science, 34(1), 2671–2688. https://doi.org/10.1080/09540091.2022.2139352
Web of Science ®Google Scholar
Su, B., Zhang, H., Wu, Z., & Zhou, Z. (2022). FSRDD: An efficient few-shot detector for rare city road damage detection. IEEE Transactions on Intelligent Transportation Systems, 23(12), 24379–24388. https://doi.org/10.1109/TITS.2022.3208188. 3208188.
Web of Science ®Google Scholar
Sun, P., Zhang, R., Jiang, Y., Kong, T., Xu, C., Zhan, W., Tomizuka, M., Li, L., Yuan, Z., Wang, C., & Luo, P. (2021). Sparse R-CNN: End-to-end object detection with learnable proposals. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14454-14463), CVPR.
Google Scholar
Sun, W., Dai, L., Zhang, X., Chang, P., & He, X. (2022). RSOD: Real-time small object detection algorithm in UAV-based traffic monitoring. Applied Intelligence, 52(8), 8448–8463. https://doi.org/10.1007/s10489-021-02893-3
Web of Science ®Google Scholar
Tian, G., Liu, J., Zhao, H., & Yang, W. (2022). Small object detection via dual inspection mechanism for UAV visual images. Applied Intelligence, 52(4), 4244–4257. https://doi.org/10.1007/s10489-021-02512-1
Web of Science ®Google Scholar
Wang, C. Y., Bochkovskiy, A., & Liao, H. Y. M. (2022a). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv preprint arXiv:2207.02696.
Google Scholar
Wang, C. Y., Yeh, I. H., & Liao, H. Y. M. (2021). You only learn one representation: Unified network for multiple tasks. arXiv preprint arXiv:2105.04206.
Google Scholar
Wang, J., Chen, Y., Dong, Z., & Gao, M. (2022b). Improved YOLOv5 network for real-time multi-scale traffic sign detection. Neural Computing and Applications, 1–13. https://doi.org/10.1007/s00521-022-08077-5
Google Scholar
Westhofen, L., Neurohr, C., Koopmann, T., Butz, M., Schütt, B., Utesch, F., Neurohr, B., Gutenkunst, C., & Böde, E. (2023). Criticality metrics for automated driving: A review and suitability analysis of the state of the art. Archives of Computational Methods in Engineering, 1–35. https://doi.org/10.1007/s11831-022-09788-7
Web of Science ®Google Scholar
Xu, S., Wang, X., Lv, W., Chang, Q., Cui, C., Deng, K., Wang, G., Dang, Q., Wei, S., Du, Y., & Lai, B. (2022). PP-YOLOE: An evolved version of YOLO. arXiv preprint arXiv:2203.16250.
Google Scholar
Yang, C., Huang, Z., & Wang, N. (2022). QueryDet: Cascaded sparse query for accelerating high-resolution small object detection. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13668-13677), CVPR.
Google Scholar
Ye, T., Qin, W., Li, Y., Wang, S., Zhang, J., & Zhao, Z. (2022). Dense and small object detection in UAV-vision based on a global-local feature enhanced network. IEEE Transactions on Instrumentation and Measurement, 71, 1–13. https://doi.org/10.1109/TIM.2022.3196319
Web of Science ®Google Scholar
Zablocki, É, Ben-Younes, H., Pérez, P., & Cord, M. (2022). Explainability of deep vision-based autonomous driving systems: Review and challenges. International Journal of Computer Vision, 130(10), 2425–2452. https://doi.org/10.1007/s11263-022-01657-x. 2-01657-x.
Web of Science ®Google Scholar
Zaidi, S. S. A., Ansari, M. S., Aslam, A., Kanwal, N., Asghar, M., & Lee, B. (2022). A survey of modern deep learning based object detection models. Digital Signal Processing, 126, 103514. https://doi.org/10.1016/j.dsp.2022.103514
Web of Science ®Google Scholar
Zhang, H., Xiao, L., Cao, X., & Foroosh, H. (2022aaa). Multiple adverse weather conditions adaptation for object detection via causal intervention. IEEE Transactions on Pattern Analysis and Machine Intelligence, https://doi.org/10.1109/TPAMI.2022.3166765
Google Scholar
Zhang, W., Cheng, T., Wang, X., Zhang, Q., & Liu, W. (2022baa). Featurized Query R-CNN. arXiv preprint arXiv:2206.06258.
Google Scholar
Zhang, X., Zhao, C., Luo, H., Zhao, W., Zhong, S., Tang, L., Peng, J., & Fan, J. (2022bba). Automatic learning for object detection. Neurocomputing, 484, 260–272. https://doi.org/10.1016/j.neucom.2022.02.012
Web of Science ®Google Scholar
Zhang, Y., Guo, Z., Wu, J., Tian, Y., Tang, H., & Guo, X. (2022bbb). Real-time vehicle detection based on improved YOLOv5. Sustainability, 14(19), 12274. https://doi.org/10.3390/su141912274
Web of Science ®Google Scholar
Zhao, Y., Shi, Y., & Wang, Z. (2022). The improved YOLOv5 algorithm and its application in small target detection. In international conference on intelligent robotics and applications (pp. 679–688), Springer.
Google Scholar
Zhu, Y., Xia, Q., & Jian, W. (2022). SRDD: A lightweight end-to-end object detection with transformer. Connection Science, 34(1), 2448–2465. doi:10.1080/09540091.2022.2125499
Web of Science ®Google Scholar

IVP-YOLOv5: an intelligent vehicle-pedestrian detection method based on YOLOv5s

Abstract

1. Introduction