403
Views
0
CrossRef citations to date
0
Altmetric
Research Article

A feature enhancement FCOS algorithm for dynamic traffic object detection

, &
Article: 2321345 | Received 26 Oct 2023, Accepted 15 Feb 2024, Published online: 01 Mar 2024

Abstract

The development of object detection plays an important role in the realisation of fully autonomous driving, and the feature extraction is the key step for object detection. There has been significant difference and scale variation of object features for different road traffic participants (RTPs), meanwhile traditional Convolutional Neural Networks (CNNs) was difficult to extract object features efficiently for small targets. In order to improve the ability of feature extraction, a RTP object detection method combining dynamic convolution and feature enhancement was proposed. The Fully Convolutional One-Stage (FCOS) object detection algorithm was used as baseline. First, the dynamic convolution module was designed in the backbone network to identify different object features to the maximum extent. Second, a dual attention module was designed to filter object feature information while reducing the amount of computation. Finally, in the detection part, the feature expression ability of shallow network was further enhanced by multi-scale feature fusion module, and the effectiveness of the proposed algorithm was verified using Cityscapes dataset. The experimental result indicated that mAP increased by 2.3% compared with baseline. This study can improve the efficiency of RTP detection and contribute to the industrialisation of intelligent connected vehicles.

1. Introduction

In recent years, with the continuous improvement of deep learning techniques, massive efforts have been made in various fields such as computer vision (CV), natural language processing (NLP), risk management, and recommender systems. Object detection, as one of the major tasks in computer vision, plays a vital role in environment perception (Zhou et al., Citation2023), traffic regulation (Yan et al., Citation2023), and safety guarantee for intelligent connected vehicles. Along with the constantly improvement in camera performance and computational power, object detection algorithms have obtained rapid development.

Current object detection algorithms can be mainly categorised into two types: two-stage methods based on candidate region and single-stage methods based on regression. Representative examples of two-stage methods are Faster R-CNN (Ren et al., Citation2015; Liu et al., Citation2021), Cascade R-CNN (Cai & Vasconcelos, Citation2021), and D2Det (Cao et al., Citation2020), which first select candidate region and then utilise methods such as SoftMax for target region classification. On the other hand, single-stage methods include YOLO (Redmon et al., Citation2016), SingleShot Multibox (SSD) (Liu et al., Citation2016), CenterNet (Rampriya et al., Citation2022), FCOS (Tian et al., Citation2019), and RetinaNet (Lin et al., Citation2017a), which extract target features through end-to-end networks and directly conduct bounding box regression and classification, the targeted object localisation and classification are treated as a single task. Moreover, the anchor mechanism in Faster R-CNN network was first mentioned by SSD and YOLO, which can effectively utilise objects with different scales and aspect ratios, improving the utilisation rate of the dataset and the effectiveness of feature extraction remarkably. It also adopted the Intersection-over-Union (IOU) to distinguish between positive and negative samples. Since the single-stage methods can handle both classification and localisation tasks simultaneously, they exhibit higher detection efficiency compared with the two-stage methods. However, the combined nature of these tasks can impact the detection accuracy. Nevertheless, with the rapid development of single-stage methods, the accuracy has greatly improved in recent years. Luo et al. (Citation2021) proposed a simplified dual-path feature pyramid network to improve the robustness of the model to multi-scale and partially occluded objects, but it did not perform well under dense congested road conditions. Sparse-R-CNN, a multi-scale fusion structure proposed by Cao et al. (Citation2021), improved the backbone network ResNet to enhance its multi-scale expression ability, and introduced the attention mechanism to solve the target detection task. Song et al. (Citation2020) proposed that the localisation and classification tasks of target detection focus on different places of interest. In order to detect small traffic targets, Gao et al. (Citation2022) proposed an adaptive attention spatial feature fusion module to fuse different feature maps on each scale, but still used a coupled detection head, which did not distinguish the regions of interest in regression and classification tasks. Ge et al. (Citation2021) used decoupling detection heads in YOLOX to distinguish classification and localisation tasks and improve the detection performance of the network.

In addition, the size setting of the anchor frame in the anchor-based detection algorithm seriously affects the detection accuracy and versatility of the algorithm. However, The Anchor-free detection algorithm abandons the prior frame, avoids the influence of the anchor frame setting on the accuracy of the model, takes into account the accuracy and speed, and improves the versatility of the algorithm (Shi et al., Citation2022). Therefore, the anchor-free algorithm is widely applied in recent years. Law and Deng (Citation2018) proposed CornerNet by imitating the idea of human key point detection (Papandreou et al., Citation2018), using a pair of key points instead of the predicted target frame. However, the ability to obtain global information of CornerNet is weak, and the corner matching algorithm is complex. Duan et al. (Citation2019) proposed CenterNet, which builds upon CornerNet by adding centre point detection to enhance the matching of corner points and improve recognition accuracy. However, the amount of calculation was increased. Tian et al. (Citation2019) proposed fully convolutional one-stage object detection (FCOS) algorithm, which combined object detection and semantic segmentation theory, and adopted the idea of Anchor free. A feature layer for regression classification is generated by the FCOS algorithm using a simple feature pyramid network (FPN) structure. However, there is still room for improvement in the performance of small target detection with low feature information content and low resolution.

In conclusion, several challenges regarding camera perception for on-board or roadside applications remain: (1) due to variations in camera angles, target distributions, and imaging ranges, the scale of the same type of targets vary significantly, and the background is relative complex; (2) there is diversity of target types in the complex traffic environment, while the feature extractor parameters are shared for different targets, resulting in high inter-class similarity between different targets. Furthermore, different target types exhibit substantial variations in appearance features; (3) it is difficult to improve the performance of RTP object detectors, particularly in scenarios with small targets, dense target distributions, and occlusions.

To address the abovementioned issues and further enhance the performance of detectors in on-board or roadside backgrounds, an RTP object detection method combined dynamic convolution and feature enhancement fusion is proposed, using FCOS (Tian et al., Citation2019) as the baseline model. Compared with the baseline and existing model, the proposed method achieves a significant improvement in mean average precision (mAP) and is validated through ablation experiments.

For this paper, the main contributions are as follows:

  1. We introduce a dynamic convolution module called Dy-Conv module in the backbone network, which incorporates the concept of variable functions into the feature extractors for different objectives, thus alleviating the difficulty of classification between different classes of samples.

  2. We propose a dual-attention module for filtering object feature information while reducing the amount of computation. The module utilises spatial attention and channel attention to suppress background noise and address key features.

  3. We propose a multi-scale feature fusion module, which is used to further improve the feature representation of the shallow network, and validated the effectiveness of the proposed algorithm on Cityscapes dataset and MS-COCO benchmark.

2. Literature review

2.1. Single-stage methods based on regression

Single-stage methods based on regression are a type of object detection algorithm that uses a single network to predict both the presence and location of objects in an image. These methods involve regressing bounding box coordinates and class probabilities directly from image pixels. Examples of single-stage methods based on regression include YOLO (You Only Look Once), RetinaNet, and SSD (Single Shot Detector). These methods have the advantage of being fast and efficient, as they only require a single pass through the network to make predictions.

The YOLOv4 network was proposed by Bochkovskiy et al. (Citation2020), which was a fast target detection method based on single-stage regression. Gašparović et al. (Citation2022) trained and tested six different deep learning CNN detectors, including five YOLO-based architectures (YOLOv4, YOLOv4-Tiny, CSP-YOLOv4, YOLOv4@Resnet, YOLOv4@DenseNet), and one on the Faster Region-based CNN (RCNN) architecture. The results showed that YOLOv4 was superior to other models in underwater pipeline target detection. RetinaNet was proposed by Lin et al. (Citation2017a), which was a target detection method based on single-stage regression. This method used a new loss function called Focal Loss, which could solve the class imbalance problem encountered in dense target detection. The experimental results showed that the method had achieved excellent results in instance segmentation and human pose estimation. An action trajectory recognition algorithm based on improved EfficientDet was proposed by Liang (Citation2023), and a spatial attention mechanism was added to the backbone network. Compared with the traditional algorithm, it had higher accuracy and stronger robustness. Li et al. (Citation2023) devised an efficient gated convolutional recurrent network (GCR-Net) with residual learning to dynamically extract dependency patterns of raw genomic sequences in an efficient fusion strategy and successfully improved the performance of the TIS prediction. A powerful Vision Transformer-based Generative Adversarial Network (Transformer-GAN) was proposed by Yang et al. (Citation2022a) for enhancing low-light images. A spatiotemporal context-aware neural model called ACNet was devised by Guo et al. (Citation2022) for Poly(A) signal prediction based on co-occurrence embedding.

In addition, the development of single-stage methods in the field of image restoration has also been relatively advanced (Chen et al., Citation2023a; Yang et al., Citation2023; Chen et al., Citation2023b) in recent years. An effective image inpainting method using generative adversarial networks was proposed by Chen et al. (Citation2023c), which was composed of two mutually independent generative confrontation networks. Chen et al. (Citation2023d) proposed an improved two-stage image inpainting network based on parallel network and contextual attention. However, these methods may suffer from lower accuracy compared to two-stage methods. This is because they use a simpler network architecture and rely solely on regression to predict object locations, without explicitly considering object proposals or using region-based features. Overall, single-stage methods based on regression are a popular choice for real-time object detection applications where speed is critical, but may not be suitable for tasks that require high-precision localisation or detection of small objects.

2.2. Two-stage methods based on region proposals

The two-stage method based on region proposal is one of the most commonly used and successful methods in the field of object detection. This kind of method is mainly divided into two stages: the first stage is to use the existing algorithm to generate several candidate boxes, namely “Region Proposals”; the second stage uses a convolutional neural network to classify and locate these candidate boxes. It is worth noting that the two-stage method based on region proposal has higher accuracy than other methods, but the speed is relatively slow. Therefore, in recent years, there have been many improvements to the method, such as Fast R-CNN, Faster R-CNN, and Mask R-CNN. Fast R-CNN is an improved version of R-CNN. It proposes a Region of Interest Pooling layer on the basis of R-CNN, which can avoid separate convolution operations on each candidate box. The speed of this method has been greatly improved, and the accuracy is also slightly higher than that of R-CNN (Girshick, Citation2015). An object detection framework called Faster R-CNN was proposed by Ren et al. (Citation2015), which was a method based on two-stage region proposal. This method used a sub-network named Region Proposal Network (RPN) to directly generate candidate boxes, which greatly reduced the computational complexity and achieves real-time detection. In addition, the method also introduced the Anchor mechanism, which could better adapt to the target detection of different scales and aspect ratios. He et al. (Citation2017) proposed a target detection and instance segmentation framework named Mask R-CNN, which was also a method based on two-stage region proposal. This method added a segmentation head on the basis of Faster R-CNN, which could segment the detected objects at the pixel level and improve the accuracy and accuracy of detection. These improvements are mainly to improve the speed and accuracy of the method by optimising the model structure or using newer technologies. Chen et al. (Citation2023e) proposed an image restoration method combining Semantic Priors and Deep Attention Residual Group. The image restoration method mainly consists of Semantic Priors Network, Deep Attention Residual Group, and Full-scale Skip Connection. A target detection framework called Cascade R-CNN was proposed by Cai and Vasconcelos (Citation2018), which was also a method based on two-stage region proposal. It added a cascade classifier based on Faster R-CNN, which improves the accuracy of the model by cascade. A lane detection algorithm based on instance segmentation was proposed by Yang et al. (Citation2022b). Based on BiSeNet V2, a two-branch neural network model for lane line image segmentation was designed. In general, the two-stage method based on region proposal has been widely used in the field of target detection, and is still developing and improving.

2.3. RTP object detection

Target detection for road traffic participants (RTPs), which primarily includes motor vehicles, non-motorized vehicles, and pedestrians, is a crucial aspect of perception tasks for intelligent connected vehicles. Whether the perception module is on-board or roadside, the use of cameras to detect surrounding environments or RTPs in traffic scenarios becomes essential. Therefore, researchers have conducted extensive studies on RTP target detection algorithms for on-board or roadside applications, yielding significant achievements. The SSD (Liu et al., Citation2016) detector model utilises multi-scale features to better capture the characteristics of large-scale targets but performs less effectively on small targets. Lin et al. (Citation2017b) proposed the Fusion Pyramid Network (FPN) module, which integrated multi-scale features and introduced context-related information to improve model performance on both large and small targets. In the context of small-scale target detection, Liu and Huang (Citation2018) introduced the RefineNet module, which further extracted context-aware multi-scale features and achieves better model performance. Addressing the issue of imbalanced positive and negative samples in detection datasets, Hu et al. (Citation2018) proposed the Focal Loss function. To handle the presence of significant background noise in the target environment, Zhong et al. (Citation2020) employed attention mechanisms to enhance key target features, enabling the feature extractor to capture more effective feature channels or regions and suppress background noise. Vu et al. (Citation2019) subsequently proposed the region proposal and Cascade Region Proposal Network (RPN) mechanisms, respectively, to address the problem of excessive negative examples at the sampling level and achieve high Average Recall (AR). Furthermore, Wang and Zhang (Citation2020) introduced the TPNet module, which strengthened the connection between high-level and low-level features, optimising multi-scale object detectors. In the context of complex traffic environment target detection, Song-tao and Shi-ru (Citation2018) introduced background suppression and spatiotemporal constraints to traditional interest point detection algorithms, reducing interference from irrelevant information and improving the real-time performance of traffic object detection. In Cascade RCNN, Hai et al. (Citation2020) introduced the FPN module, which integrated shallow and deep-level feature information and utilises the Generalised Intersection over Union (GIOU) loss to enhance the accuracy of traffic sign detection. Chen et al. (Citation2018) addressed the issue of ineffective differentiation between pedestrians and cyclists in existing detection methods by proposing multiple improvement strategies, including hard example mining, multi-level feature fusion, and input of multiple object candidate regions, achieving joint detection of pedestrians and cyclists. Liang (Citation2023) improved the EfficientDet algorithm by adding a spatial attention mechanism to the backbone network, which improved the accuracy and robustness. However, there is still relatively large promotion space for improvement of RTP object detection performance in the context of intelligent transportation. How to adopt attention mechanisms and their variants (Redmon & Farhadi, Citation2018) with multi-scale information (Yuan et al., Citation2021) for better detector performance remains years of additional research.

3. FCOS object detection algorithm

The network architecture of FCOS is shown in Figure . C3, C4, and C5 denote the feature maps of the backbone network and P3 to P7 are the feature levels used for the final prediction. H × W is the is the height and width of feature maps. “/s” (s = 8,16, … , 128) is the downsampling ratio of the feature maps at the level to the input image.

Figure 1. Network architecture of FCOS.

Figure 1. Network architecture of FCOS.

The FCOS (Fully Convolutional One-Stage Object Detection) algorithm utilises the ResNet50 backbone network for feature extraction, as shown in Table . It adopts a feature pyramid structure to achieve object classification and localisation, making it a relatively new object detection algorithm. Compared to the two-stage networks of the Faster R-CNN series, the FCOS algorithm is a one-stage detection architecture. Similar to regression tasks, it extracts target features through an end-to-end network and directly performs bounding box regression and classification. It treats the tasks of object localisation and classification, which are separate in two-stage methods, as a single task to solve.

Table 1. Specific structure of the feature extraction network.

In comparison to the SSD (Liu et al., Citation2016; Tang et al., Citation2019) and YOLO (Redmon et al., Citation2016) series of object detection algorithms, the FCOS algorithm, although belonging to the one-stage approach like the latter two, eliminates the need for predefined anchor boxes. Instead, it relies solely on the fully convolutional architecture of ResNet, FPN, and the Head modules. This effectively reduces the computational complexity associated with anchor box computations.

In the FCOS network, the Feature Pyramid Network (FPN) scales the feature maps to different resolutions. The shallow feature maps contain more information about small objects, while the deep feature maps preserve more information about larger objects. The FPN structure integrates higher-level semantic information with lower-level positional information, which helps to represent the object features more effectively.

Since FCOS performs object detection in a per-pixel prediction manner, it requires the division of all pixels into positive and negative samples. The multi-level prediction method used in the FCOS network detects objects of different scales in different feature layers, which effectively separates ambiguous samples.

FCOS is a popular anchor-free detection model. Three branches, such as the classification, centerness, and regression branch is contained. The classification branch employs a point classification loss function, while the centerness branch utilises a point centerness loss function that determines whether or not a point belongs to the target's centre. Lastly, the regression branch is responsible for handling target location regression.

Point classification loss functions for the feature map: (1) L(px,y)=1Nposx,yLcls(px,y,cx,y)(1) where Lcls is focal loss; Npos is the number of positive samples; px,y is the classification score of (x, y) points in the feature map; cx,y is the basic fact of the point (x, y) classification.

Centerness loss function for feature map: (2) L(cx,y)=1Nposx,yLcx,y>0Lcls(cx,y,cx,y)(2) where Lcls is the cross-entropy loss; cx,y is the centerness score in the point (x, y) in the feature map; cx,y is the ground truth of centerness of the point (x, y).

Regression loss function for feature map: (3) L(tx,y)=1Nposx,yLcx,y>0Lreg(tx,y,tx,y)(3) where Lreg is the IOU loss; tx,y is the regression results in the point (x, y) in the feature map; tx,y is the ground truth of regression of the point (x, y). Lcx,y>0 is the indicator function, being 1 if cx,y>0 and 0 otherwise.

The total loss function in the FCOS is as follows: (4) L(px,y,tx,y,cx,y)=L(px,y)+L(tx,y)+L(cx,y)(4) In the detection part of FCOS, a Center-ness branch is added to regress parameters such as t*, l*, r*, and b*, as shown in Figure , in order to calculate the centrality expressed by Center-ness. If a location (x, y) falls into any GT box, it is a positive sample and regresses the distance between this point and the bounding box t*, l*, r*, and b* (Figure ), as shown in the following equation: (5) {l=xx0(i)t=yy0(i)r=x1(i)xl=y1(i)x(5)

Figure 2. Illustrates the meanings of various parameters in the Center-ness branch.

Figure 2. Illustrates the meanings of various parameters in the Center-ness branch.

The definition of Center-ness is given in Equation (Equation6), which reduces the weight of predicted bounding boxes that are far from the true centre of the object, effectively suppressing low-quality predictions. (6) Centerness=min(l,r)max(l,r)×min(t,b)max(t,b)(6) The FCOS detector, similar to addressing semantic segmentation, innovatively solves the object detection problem through per-pixel prediction. It only requires a single post-processing operation, non-maximum suppression (NMS), to remove low-quality predictions and improve the performance of the detector. The network is simple yet achieves good accuracy. Therefore, this algorithm is based on the improvement of the FCOS algorithm.

4. DF-FCOS algorithm

To address the existing issues, a dynamic traffic object detection method based on dynamic convolution and feature enhancement fusion (referred to as DF-FCOS) is proposed, using FCOS as the baseline. The DF-FCOS algorithm, as shown in Figure , consists of three parts: the dynamic convolution module (Dy-Conv), the double attention module (DAM), and the multiscale feature fusion module (MSF), all the numbers are computed with an 800 × 1024 input.

Figure 3. Illustrates the framework of the DF-FCOS algorithm.

Figure 3. Illustrates the framework of the DF-FCOS algorithm.

4.1. Dynamic convolution module (Dy-Conv)

In the FCOS detector network, ResNet50 is used as the backbone for extracting target features. However, for different types of objects, the convolutional kernel parameters in the feature extraction layers are shared among all types of objects. This leads to high inter-class similarity between different types of objects, resulting in a decrease in the representation capability of the model for feature extraction of different objects.

To solve the shortcomings of high inter-class similarity of FCOS detectors and fully utilise the appearance features of different types of objects studied in this paper, a Dy-Conv module (Yang et al., Citation2019) is introduced, where the convolutional kernel parameters vary with the type of the object. As shown in Figure , this module learns a specific convolutional kernel parameter for each class example by replacing the traditional convolutional layers in the original backbone. This proposes a target feature extraction method based on dynamic convolution mechanism.

Figure 4. Dy-Conv Module.

Figure 4. Dy-Conv Module.

The conventional convolution, as the basic unit of CNN, assumes that the convolutional kernel parameters are shared among all samples. The convolutional kernel parameters are trained and applied uniformly to all input samples. Assuming that the objects under study are divided into three classes: motor vehicles, non-motor vehicles, and pedestrians, the traditional convolutional kernel represents an authoritative expert Conv in identifying these three domains of motor vehicle (x1), non-motor vehicle (x2), and pedestrian (x3). The input x and the output satisfy the equation: output = Conv(xi). Therefore, when receiving features from different classes of objects in the upper layers, this “absolute fairness” would cause a significant reduction in inter-class differences when representing these classes with the trained convolutional kernel parameters, despite the significant differences in their features.

In the designed Dy-Conv module of this paper, the previously mentioned three domains’ authoritative expert is replaced with the most authoritative experts within each domain: Conv1 for motor vehicle recognition, Conv2 for non-motor vehicle recognition, and Conv3 for pedestrian recognition. In the Dy-Conv module, each convolutional kernel has the same dimension as the traditional static convolutional kernel. The convolutional kernel parameters are obtained by transforming the input, described as: output = Concat(W1Conv1(xi), W2Conv2(xi), W3Conv3(xi)), where Concat represents the concatenation operation of convolutional layers.

The key aspect lies in determining the weight distributions of the three domain experts, W1, W2, and W3, which means that the hyperparameters determined after training tend to vary with the input type. This leads to a consensus among the three experts that the more the input resembles a particular domain, the greater the authority of that expert's decision, as shown in Figure .

Figure 5. Dy-Conv Decision Mechanism Illustration.

Figure 5. Dy-Conv Decision Mechanism Illustration.

The introduction of attention mechanism is crucial in the Dy-Conv module. The feature maps from the upper layers not only contain target feature information but also potentially include a significant amount of background noise features. As the features are compressed continuously with deeper convolutional layers, the amount of target-related crucial information decreases. This is particularly detrimental to small targets as the background noise can directly overshadow the subtle features of small targets. Therefore, it is necessary to suppress non-relevant target information in order to enhance the significance of target features, enabling the experts to make more accurate and reliable identifications, thereby updating the weight information. Subsequently, the output from the previous layer is connected through a fully connected layer followed by the softmax function for exponential normalisation. The output consists of three-dimensional parameters, satisfying the constraints shown in Equation (Equation7), which can be described as Equation (Equation8). (7) i=13Wi=1(7) (8) Wopt=argW(Sfmax(FC(RELu(FC(GAP(xi))))R))(8) where, R is a matrix that learns routing weights, allowing the use of global receptive field context to adapt to local receptive fields and map the aggregated inputs to n experts.

For the ResNet50 framework in the FCOS network, we replace the standard static convolutional layers in the backbone with Dy-Conv modules to establish a Dy-Conv-based target feature extraction method.

4.2. Dual attention mechanism module (DAM)

To better capture small targets and efficiently extract their features, we design a dual attention module for key feature information extraction, as shown in Figure . This module consists of two parallel branches: the Channel Attention (CA) module within the left dashed box and the Spatial Attention (SA) module within the right dashed box. The CA module is used to extract crucial spatial information, while SA is used to suppress noise information. The key information extraction is then performed by reweighting the input feature map with the attention information, maximising the amplification of crucial feature information and suppressing background noise.

Figure 6. DAM Module.

Figure 6. DAM Module.

This process can be described by Equation (Equation9). (9) F=FFSAFCA(9) where, denotes element-wise multiplication, FSA represents the spatial attention map, and FCA represents the channel attention map. Before the multiplication, the sizes of the outputs from both branches have been adjusted to RH × W × C.

In the CA submodule, first, we use global average pooling (GAP) to sum up and average the pixel values of each channel feature map to generate the channel feature map. Then, we design a multilayer perceptron (MLP) to calculate the importance weights for each channel, replacing the multilayer CNN. The MLP consists of two fully connected layers (FC) and a ReLU layer. Finally, we use the sigmoid function to generate a one-dimensional attention distribution map. Additionally, we employ the rectified linear unit (ReLU) to enhance the model's nonlinearity. The reduce ratio is set to 16, resulting in the generated channel attention map FCA ∈ RC/r × 1 × 1. This process can be described as FCA = σ(MLP(GAP(F))), where σ represents the sigmoid function.

In the SA submodule, first, we use a 1 × 1 convolution to compress the channels, reducing the channel dimension and computational complexity. Then, we use two 3 × 3 convolutions to extract spatial information for key regions. The 3 × 3 convolutions can maintain the receptive field while reducing computational load. Finally, we use the sigmoid function to generate a two-dimensional spatial attention map FS ∈ RH × W. This process can be described as Equation (Equation10). (10) FSA=σ(f23×3(f13×3(f03×3(F))))(10) where, f represents the convolution operation, superscripts indicate the size of the convolution filters, and subscripts represent the filter indices.

4.3. Multi-scale fusion module (MSF)

Furthermore, the original detector has employed a method of merging the upsampled feature maps from the subsequent layer by a factor of 2 and the feature maps from the previous layer to extract features from different layers. However, for small targets, the prediction primarily relies on shallow features, and as the depth of the feature layers increases, the information of small targets becomes overshadowed by noise. Therefore, to fully utilise the semantic information and texture features in the shallow feature maps, we design an upward-enhanced Multi-Scale Fusion (MSF) module, as shown in Figure .

Figure 7. MSF Module.

Figure 7. MSF Module.

The MSF module consists of four branches, including not only 1 × 1 convolution, 3 × 3 convolution, and 5 × 5 convolution, but also scale enhancement using 3 × 3 pooling layers and parallel convolution kernels of different sizes. This allows us to capture target features from different receptive fields and fuse multi-scale receptive field features to efficiently utilise the shallow feature maps. First, we merge the multi-scale feature maps obtained from the 3 × 3 convolution, 5 × 5 convolution, and 3 × 3 pooling layers. Then, we use a 1 × 1 convolution to compress the channels. Finally, we sum the aforementioned output with the feature map obtained after a 1 × 1 convolution to generate the enhanced feature map after multi-scale fusion. The MSF module maximises the utilisation of shallow feature maps and enhances the information within them, which is crucial for small object detection.

5. Experimental results and analysis

To validate the effectiveness of the proposed improvements to the FCOS algorithm, corresponding experiments for analysis are conducted in this section.

Experiment 1: In the ResNet50 backbone of FCOS, Dy-Conv is used exclusively to replace the traditional Convolution (Conv) for feature extraction.

Experiment 2: The designed feature enhancement methods (DAM combined with MSF) are used for object detection.

Experiment 3: Dy-Conv is used to replace the traditional Convolution (Conv) for feature extraction, while the designed feature enhancement methods are employed for detection.

5.1. Dataset and experimental settings

5.1.1. Dataset

Since the research focus of this paper is on traffic participants, the CityScapes dataset, which consists of urban landscape data from the perspective of vehicles, is selected for training and validation purposes. The dataset contains various sample categories, and the distribution of classes and their quantities are shown in Figure . The object detection is focused on the seven common categories in CityScapes, including pedestrians, riders, cars, trucks, buses, motorcycles, and bicycles.

Figure 8. Distribution of Sample Categories in CityScapes Dataset.

Figure 8. Distribution of Sample Categories in CityScapes Dataset.

Although the CityScapes dataset contains eight categories of traffic-related objects, the ”train” category is excluded because trains are not commonly encountered objects in road traffic environments.

Furthermore, due to the limited number of samples for categories such as trucks and buses (less than 500 in the training set and less than 100 in the test set), the evaluation results for these categories may exhibit some instability compared to categories with larger sample sizes. Therefore, the results for these categories should be interpreted with caution.

The CityScapes dataset is commonly used for semantic segmentation purposes. Initially, the dataset provides annotated images only for semantic segmentation, as shown in Figure , and does not include xywh labels for object detection. Therefore, in this study, a Python script is used to preprocess the original dataset by processing the polygon segmentation data provided for each object. Based on the vertices obtained from the convex hull of the polygons, a minimum bounding box that fully covers the object contours is computed. These bounding boxes are then output as localisation labels for object detection.

Figure 9. Illustration of CityScapes Segmentation Dataset Conversion.

Figure 9. Illustration of CityScapes Segmentation Dataset Conversion.

For this experiment, the evaluation metrics used are the Average Precision (AP) and mean Average Precision (mAP) for object detection. The calculation criteria for these metrics are as follows: AP is typically denoted as @.50:.05:.95, which means calculating the AP values at IOU thresholds ranging from 50% to 95% with a step size of 5%, and then taking the mean of these AP values.

5.1.2. Experimental settings

The software and hardware platform information used in this experiment is presented in Table .

Table 2. Software and hardware environment of the experimental platform.

Training Environment: The improved backbone network was partially pretrained to accelerate the convergence speed and ensure the overall performance of the network. The algorithm was then trained using the CityScapes dataset. The SGD optimiser was employed with a batch size of 16. The training process consisted of 80,000 iterations, and the initial learning rate was set to 0.005. The gradient descent settings at different stages during training are presented in Table . At 40,000 iterations, the learning rate was reduced from 0.005–0.0005. At 60,000 iterations, the learning rate was further reduced from 0.0005–0.00005, and the training continued until 80,000 iterations.

Table 3. Gradient settings during training process.

5.2. Performance evaluation on cityscapes

Experiment 1: In the FCOS backbone network ResNet50, Dy-Conv was used instead of traditional convolution (Conv) for feature extraction. To fully utilise the appearance features of different types of objects, a target feature extraction method based on dynamic convolution mechanism was proposed by replacing the traditional convolution layers in the original backbone.

The experimental results, as shown in Table , indicate an overall improvement of 1.4% in mAP. Specifically, in traffic scenes, pedestrian detection and detection of various object categories achieved improvements of 1.9%, 0.9%, 2.1%, 1.1%, 1.5%, 1.0%, and 0.9%, respectively. This demonstrates that when Dy-Conv is used for feature extraction in the backbone network instead of traditional convolution, the updates of convolution kernel parameters are more targeted for different types of objects, reducing the inter-class similarity and enhancing the network's ability to represent features for different objects.

Table 4. Comparison of detection accuracy before and after algorithm improvement.

Experiment 2: The designed feature enhancement method (combining DAM with MSF) was employed in FCOS for object detection. To better capture small objects and efficiently extract their features, a dual attention module and a multi-scale enhancement module were designed to extract crucial feature information, magnify key feature information to the maximum extent, and suppress background noise.

The experimental results are shown in Table , with an overall mAP improvement of 1.2%. This indicates that the design of the feature enhancement module is effective. Specifically, among all types of objects, pedestrians, riders, motorcycles, and bicycles showed the best improvement, with average detection accuracy increases of 1.8%, 1.6%, 1.5%, and 1.3% respectively. The improvements for cars, trucks, and buses were relatively smaller, with increases of 0.7%, 0.8%, and 0.7% respectively. This suggests that through the feature enhancement strategy, the weight of small objects in the feature map has been increased. While larger objects can be detected in larger feature maps, the feature representation of smaller objects may be weakened or even ignored. In this experiment, the dual attention mechanism and multi-scale feature enhancement strategy focused on the features of small objects. By suppressing background noise and capturing crucial information, the feature representation of small objects was further improved. This increased sensitivity of the detector to smaller objects, validating the effectiveness of the designed module.

Experiment 3: In the backbone network ResNet50 of FCOS, Dy-Conv is used to replace the traditional convolution Conv for feature extraction, while the designed feature enhancement method (combining DAM and MSF) is applied for object detection. The improved network not only enhances the feature representation capability of the backbone network but also highlights the crucial feature map information inputted into the detection module. As shown in Figure , the attention map of the DF-FCOS network is converted into a heatmap to reflect the significance of feature map information in the detection module. It can be observed that the features of small and distant objects are preserved and focused by the network.

Figure 10. Represents the heatmap of attention in the DF-FCOS network.

Figure 10. Represents the heatmap of attention in the DF-FCOS network.

As shown in Table , the DF-FCOS algorithm based on dynamic convolution and feature enhancement achieves an overall mAP improvement of 2.3% compared to the baseline, indicating a significant enhancement in performance. The original FCOS detector is compared with the improved DF-FCOS detector, and the experimental results are shown in Figure , which presents comparison of detection results between the original FCOS detector and the improved DF-FCOS detector for different scenarios (a, b, c, d).

Figure 11. Comparison of detection results before and after algorithm improvement (a). Scene 1 (b). Scene 2 (c). Scene 3 (d). Scene 4.

Figure 11. Comparison of detection results before and after algorithm improvement (a). Scene 1 (b). Scene 2 (c). Scene 3 (d). Scene 4.

The performance improvement of the enhanced DF-FCOS algorithm is mainly manifested in the detection of dense objects that were not detected by the original algorithm. These include occluded pedestrians, densely placed bicycles, and cars. Additionally, the improved algorithm detects smaller distant objects that were not detected by the original algorithm, particularly vehicles with smaller feature scales in the distance.

Specifically, in Figure (a), the original FCOS algorithm failed to detect a small car in the distance that was partially occluded. In Figure (b), on the left side, bicycles and cars are densely arranged, with most of the car features being occluded and having fewer features. FCOS only detected one bicycle, and a small distant car in the right frame was not detected. In Figure (c), the motorcycle and its rider cause occlusion of the frontal targets, and the enclosed blue box represents the occluded pedestrian. In Figure (d), the dense crowd of pedestrians causes mutual occlusion, and both the pedestrian and car targets were distant and were not detected by FCOS. However, when using the improved DF-FCOS algorithm for detection, it can be observed that the DF-FCOS algorithm detects many targets that were not detected by FCOS, leading to an improvement in detection performance.

5.3. Comparation with the SOTA

DF-FCOS is compared with other SOTA object detectors on cityscapes dataset and test − dev split of MS-COCO benchmark, and the results are shown in Table . However, in practice, the speed and accuracy of the model's inference is often uncontrollable and will vary depending on the software and hardware. Therefore, the experimental platform used during the experiment is the same as Table . The AP of DF-FCOS on cityscapes dataset and MS-COCO benchmark is 52.6% and 51.4%, respectively. Compared with other SOTA object detectors, DF-FCOS shows the best performance.

Table 5. Comparison of the speed and accuracy of different object detectors with SOTA.

6. Conclusion

Focusing on the challenges of varying target scales, complex backgrounds, and poor feature representation in RTP object detection, which result in a high rate of false alarms, poor detection performance for small-sized objects, and the complexity of changing background environments, this paper designs a framework that incorporates dynamic convolutional feature extraction, dual attention mechanisms and multi-scale feature enhancement to improve the detection performance for RTP targets based on the FCOS (Fully Convolutional One-Stage) object detection algorithm.

In the backbone network, a dynamic convolution-based method is designed for effectively target feature extraction, which captures features for different objects, reduces inter-class similarity, and improves the representation capability of the target feature extraction network.

In the detection network, a dual attention module is designed that can capture key features from different scales to alleviate the negative impact of background noise, further enhancing the representation capability of the feature extraction network.

Furthermore, to better preserve the features of smaller targets, a multi-scale feature fusion module is designed for addressing the feature maps of small objects, making the detector more sensitive to small targets.

Through the analysis of ablation experiment results, the RTP object detection model designed in this paper shows significant improvements in detection performance compared with the baseline object detection model. The proposed improvement modules further enhance the feature representation capability of the network and improve the feature extraction effectiveness for RTP targets, thereby contributing to the improvement of small object detection performance.

With the industrialisation of intelligent connected vehicles, the development of RTP detection plays an important role in achieving fully automatic driving. The proposed method in this study can more accurately identify and locate RTPs by improving the feature extraction process, and provide reliable perception ability for the fully automatic driving system. At the same time, this method can improve the effect of small target detection, introduce dynamic convolution and feature enhancement structure, and significantly improve the safe driving ability of intelligent vehicles in complex road environment. In addition, this method can reduce the amount of calculation, filter background noise, and processe key features by designing a dual attention module, which has higher real-time performance and is suitable for large-scale deployment of intelligent vehicle systems. Finally, in the detection part, the multi-scale feature fusion module further enhances the feature expression ability of the shallow network, and provides important support for accurate decision-making and behaviour planning of intelligent vehicle systems.

However, there are also some limitations in this paper. Firstly, the influence of the optimisation of the attention module on the object detection performance needs to be further explored. Although the module has achieved remarkable performance in improving feature representation and small target detection, the applicability to specific scenarios or datasets should be verified. Secondly, the influence of the embedding position of the attention module on the object detection performance needs to be further explored. Current research may only involve some common embedding location strategies, but different data sets, target scales and complexity may have different effects. Therefore, future research can develop more embedding location strategies, and carry out targeted experiments and analysis to further improve the detection performance of occluded targets. In addition, the lightweight traffic target detection network for mobile intelligent devices can be further designed in the future, and the number of model parameters without reducing the performance of the algorithm can be simplified. Designing a traffic target detection network that is compatible with multi-tasks in bad weather is also an important direction for subsequent research.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Declaration of Interest

All authors declare that they have no actual or potential competing financial interests or personal relationships.

Additional information

Funding

This work was supported by the National Natural Science Foundation of China [grant number: 52202413], Natural Science Foundation of Jiangxi Province [grant number: 20232BAB214093] and Natural Science Foundation of Jiangsu Province [grant number: BK20220243].

References

  • Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M. (2020). Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934.
  • Cai, Z., & Vasconcelos, N. (2018). Cascade R-CNN: Delving into high quality object detection. 2018 IEEE/CVF conference on Computer Vision and Pattern Recognition, 18-23 June, 6154–6162.
  • Cai, Z., & Vasconcelos, N. (2021). Cascade R-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5), 1483–1498. https://doi.org/10.1109/TPAMI.2019.2956516
  • Cao, J., Cholakkal, H., Anwer, R. M., Khan, F. S., Pang, Y., & Shao, L. (2020). D2det: Towards high quality object detection and instance segmentation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14-19 June, 11482–11491.
  • Cao, J., Zhang, J., & Jin, X. (2021). A traffic-sign detection algorithm based on improved sparse R-CNN. IEEE Access, 9, 122774–122788. https://doi.org/10.1109/ACCESS.2021.3109606
  • Chen, W., Xiong, H., Li, K., & Li, X. (2018). Concurrent pedestrian and cyclist detection based on deep neural networks. Automotive Engineering, 40(6), 726–732.
  • Chen, Y., Xia, R., Yang, K., & Zou, K. (2023a). GCAM: Lightweight image inpainting via group convolution and attention mechanism. International Journal of Machine Learning and Cybernetics, 1–11.
  • Chen, Y., Xia, R., Yang, K., & Zou, K. (2023b). MFFN: Image super-resolution via multi-level features fusion network. The Visual Computer, 1–16.
  • Chen, Y., Xia, R., Yang, K., & Zou, K. (2023d). DGCA: High resolution image inpainting via DR-GAN and contextual attention. Multimedia Tools and Applications, 1–21.
  • Chen, Y., Xia, R., Yang, K., & Zou, K. (2023e). DARGS: Image inpainting algorithm via deep attention residuals group and semantics. Journal of King Saud University - Computer and Information Sciences, 35(6), 101567. https://doi.org/10.1016/j.jksuci.2023.101567
  • Chen, Y., Xia, R., Zou, K., & Yang, K. (2023c). RNON: Image inpainting via repair network and optimization network. International Journal of Machine Learning and Cybernetics, 1–17.
  • Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., & Tian, Q. (2019). Centernet: Keypoint triplets for object detection. 2019 IEEE/CVF international conference on computer vision, Seoul, Korea (South), 27 Oct-2 Nov, 6569–6578.
  • Gao, E., Huang, W., Shi, J., Wang, X., Zheng, J., Du, G., & Tao, Y. (2022). Long-tailed traffic sign detection using attentive fusion and hierarchical group softmax. IEEE Transactions on Intelligent Transportation Systems, 23(12), 24105–24115. https://doi.org/10.1109/TITS.2022.3200737
  • Gašparović, B., Lerga, J., Mauša, G., & Ivašić-Kos, M. (2022). Deep learning approach for objects detection in underwater pipeline images. Applied Artificial Intelligence, 36(1), 2146853. https://doi.org/10.1080/08839514.2022.2146853
  • Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). Yolox: Exceeding yolo series in 2021[J]. arXiv preprint arXiv:2107.08430.
  • Girshick, R. (2015). Fast r-cnn. 2015 IEEE international conference on computer vision, 7-13 Dec, 1440–1448.
  • Guo, Y., Zhou, D., Li, P., Li, C., & Cao, J. (2022). Context-Aware poly (A) signal prediction model via deep spatial–temporal neural networks. IEEE Transactions on Neural Networks and Learning Systems, 1–13.
  • Hai, W., Kuan, W., Yingfeng, C., Ze, L., & Long, C. (2020). Traffic sign recognition based on improved cascade convolution neural network. Automotive Engineering, 42, 1256–1262.
  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. 2017 IEEE international conference on computer vision, 22-29 Oct, 2961–2969.
  • Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. 2018 IEEE conference on computer vision and pattern recognition, 18-23 June, 7132–7141.
  • Law, H., & Deng, J. (2018). Cornernet: Detecting objects as paired keypoints. 2018 European conference on computer vision (ECCV), 8-14 Sept, 734–750.
  • Li, W., Guo, Y., Wang, B., & Yang, B. (2023). Learning spatiotemporal embedding with gated convolutional recurrent networks for translation initiation site prediction. Pattern Recognition, 136, 109234. https://doi.org/10.1016/j.patcog.2022.109234
  • Liang, H. (2023). Improved EfficientDet algorithm for basketball players' upper limb movement trajectory recognition. Applied Artificial Intelligence, 37(1), 2225906. https://doi.org/10.1080/08839514.2023.2225906
  • Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. 2017 IEEE conference on computer vision and pattern recognition, 21-26 July, 2117–2125.
  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. 2017 IEEE international conference on computer vision, 22-29 Oct, 2980–2988.
  • Liu, S., & Huang, D. (2018). Receptive field block net for accurate and fast object detection. 2018 European conference on computer vision (ECCV), 8-14 Sept, 385–400.
  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., & C. Berg, A. (2016). Ssd: Single shot multibox detector. Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 Oct, 2016, Proceedings, Part I 14. Springer International Publishing, 21–37.
  • Liu, X., Ghazali, K. H., Han, F., & Mohamed, I. I. (2021). Automatic detection of oil palm tree from UAV images based on the deep learning method. Applied Artificial Intelligence, 35(1), 13–24. https://doi.org/10.1080/08839514.2020.1831226
  • Luo, J., Fang, H., Shao, F., Hu, C., & Meng, F. (2021). Vehicle detection in congested traffic based on simplified weighted dual-path feature pyramid network with guided anchoring. IEEE Access, 9, 53219–53231. https://doi.org/10.1109/ACCESS.2021.3069216
  • Papandreou, G., Zhu, T., Chen, L. C., Gidaris, S., Tompson, J., & Murphy, K. (2018). Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. 2018 European conference on computer vision (ECCV), 8-14 Sept, 269–286.
  • Rampriya, R. S., Suganya, R., Nathan, S., & Perumal, P. S. (2022). A comparative assessment of deep neural network models for detecting obstacles in the real time aerial railway track images. Applied Artificial Intelligence, 36(1), 2018184. https://doi.org/10.1080/08839514.2021.2018184
  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. 2016 IEEE conference on computer vision and pattern recognition, 27-30 June, 779–788.
  • Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, 91–99.
  • Shi, H., Lai, R., Li, G., & Yu, W. (2022). Visual inspection of surface defects of extreme size based on an advanced FCOS. Applied Artificial Intelligence, 36(1), 2122222. https://doi.org/10.1080/08839514.2022.2122222
  • Song, G., Liu, Y., & Wang, X. (2020). Revisiting the sibling head in object detector. 2020 IEEE/CVF conference on computer vision and pattern recognition, 14-19 June, 11563–11572.
  • Song-tao, D., & Shi-ru, Q. U. (2018). Traffic object detection based on deep learning with region of interest selection. China Journal of Highway and Transport, 31(9), 167–174.
  • Tang, Q., Liu, S., Li, J., Hu, Y. (2019). PosNeg-balanced anchors with aligned features for single-shot object detection. arXiv preprint arXiv:1908.03295.
  • Tian, Z., Shen, C., Chen, H., He, T. (2019). Fcos: Fully convolutional one-stage object detection. 2019 IEEE/CVF international conference on computer vision, 27 Oct-2 Nov, 9627–9636.
  • Vu, T., Jang, H., Pham, T. X., & Yoo, C. D. (2019). Cascade rpn: Delving into high-quality region proposal network with adaptive convolution. Advances in Neural Information Processing Systems, 1432–1442.
  • Wang, K., & Zhang, L. (2020). Single-shot two-pronged detector with rectified iou loss. 2020 28th ACM International Conference on Multimedia, 12 Oct, 1311–1319.
  • Yan, L., Jia, L., Lu, S., Peng, L., & He, Y. (2023). LSTM-based deep learning framework for adaptive identifying eco-driving on intelligent vehicle multivariate time-series data. IET Intelligent Transport Systems, 00, 1–17.
  • Yang, B., Bender, G., Le, Q. V., et al. (2019). Condconv: Conditionally parameterized convolutions for efficient inference. Advances in Neural Information Processing Systems, 1307–1318.
  • Yang, S., Yunpeng, L., & Yu, L. (2022a). Lane detection based on instance segmentation of BiSeNet V2 backbone network. Applied Artificial Intelligence, 36(1), 2085321. https://doi.org/10.1080/08839514.2022.2085321
  • Yang, S., Zhou, D., Cao, J., & Guo, Y. (2022b). Rethinking low-light enhancement via transformer-GAN. IEEE Signal Processing Letters, 29, 1082–1086. https://doi.org/10.1109/LSP.2022.3167331
  • Yang, S., Zhou, D., Cao, J., & Guo, Y. (2023). LightingNet: An integrated learning method for low-light image enhancement. IEEE Transactions on Computational Imaging, 9, 29–42. https://doi.org/10.1109/TCI.2023.3240087
  • Yuan, S., Wang, K., Shan, Y., & Yang, J. (2021). Multi-scale object detection method based on multi-branch parallel dilated convolution. Journal of Computer-Aided Design & Computer Graphics, 33(6), 864–872. https://doi.org/10.3724/SP.J.1089.2021.18537
  • Zhong, Q., Li, C., Zhang, Y., Xie, D., Yang, S., & Pu, S. (2020). Cascade region proposal and global context for deep object detection. Neurocomputing, 395, 170–177. https://doi.org/10.1016/j.neucom.2017.12.070
  • Zhou, T., Liu, W., Zhang, M., & Jia, J. (2023). Optimization of AEB decision system based on unsafe control behavior analysis and improved ABAS algorithm. IEEE Transactions on Intelligent Transportation Systems, 1–14.