Full article: Active learning for deep object detection by fully exploiting unlabeled data

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

Object detection is a challenging task that requires a large amount of labeled data to train high-performance models. However, labeling huge amounts of data is expensive, making it difficult to train a good detector with limited labeled data. Existing approaches mitigate this issue via active learning or semi-supervised learning, but there is still room for improvement. In this paper, we propose a novel active learning method for deep object detection that fully exploits unlabeled data by combining the benefits of active learning and semi-supervised learning. Our method first trains an initial model using limited labeled data, then uses self-training and data augmentation strategies to train a semi-supervised model using labeled and unlabeled data. We then select query samples based on informativeness and representativeness from the unlabeled data to further improve the model through semi-supervised training. Experimental results on commonly used object detection datasets demonstrate the effectiveness of our approach, outperforming state-of-the-art methods.

KEYWORDS:

Introduction

As a fundamental task in computer vision, object detection (Carion et al., Citation2020; Zhu et al., Citation2021) aims to accurately obtain bounding boxes in an image that contain different categories of objects. Recently, deep object detection methods (Girshick et al., Citation2014; Girshick, Citation2015; Redmon et al., Citation2016; Redmon & Farhadi, Citation2017; W. Liu et al., Citation2016) have received significant attention due to their wide applications in autonomous driving, video surveillance and face detection. These methods are trained on a large amount of fully labeled images containing the objects (Cordone et al., Citation2022; Kim et al., Citation2019; Shen et al., Citation2022)’ categories and locations. However, this kind of supervision is difficult to obtain as it requires immense time and manual effort to label each object in an image. Moreover, in some cases, this manual annotation process might have cost implications. Therefore, it makes improving model performance when a huge amount of labeled data is not available a challenging problem in object detection.

In many real-world applications, labeled data is scarce due to the high cost of manually labeling data. For example, it takes 7 s to 42 s to complete the precise labeling of a single object rectangle in object detection. Active learning and semi-supervised learning are the most prominent methods for utilising unlabeled samples to improve the performance of models. Specifically, active learning tries to select valuable unlabeled samples based on some query strategies for manually labeling and trains effective models with as few labeled samples as possible. While semi-supervised learning tends to explore information in unlabeled data, it uses both labeled data and unlabeled data for training.

Active learning has been widely used in object detection tasks in recent years. For example, (Kao et al., Citation2018) proposed two methods: Localization Tightness with the classification information (LT/C), which is based on the overlapping ratio between the region proposals, and the final prediction and Localization Stability with the classification information (LS + C) based on the variation of predicted object locations when input images are corrupted by noise. (Sinha et al., Citation2019) trained an adversarial network to discriminate samples between unlabeled and labeled data. (Yu et al., Citation2022) proposed an active learning object detection method, Consistency-based Active Learning for Object Detection (CALD), which queries informative samples based on consistency. These active learning methods effectively improve the models performance to a certain extent based on some query strategies, but they do not make full use of unlabeled data.

Semi-supervised learning as another way to utilise unlabeled samples has recently been used in object detection. (Jeong et al., Citation2019) propose a novel consistency-based semi-supervised learning algorithm (CSSL) for object detection. (Sohn et al., Citation2020) proposed a semi-supervised learning object detection framework STAC which combines self-training and consistency regularisation based on the data augmentations. (Y.-C. Liu et al., Citation2021) used the teacher-student mutual learning scheme Unbiased Teacher (UT) to implement a semi-supervised learning object detection task. These methods improve the performance of the model with limited labeled data, but in the learning process, it is very likely to generate a large number of noise samples, causing the model to learn wrong information. We summarise some of the comparative methods used in the experiments in the paper in Table .

Table 1. Representative-related algorithms. (“Inf.”, “Rep.”, “ SL”, and “SSL” are the abbreviations of informativeness, representativeness, supervised learning, and semi-supervised learning, respectively.)

Download CSV Display Table

Although so many methods have been proposed, most of the existing algorithms still had some drawbacks and did not combine active learning and semi-supervised learning to improve the performance of the detector further. In this work, we observe that the mechanisms of active learning and semi-supervised learning complement each other. They are used to solve similar problems but have different characteristics. In general, active learning can obtain reliable training samples through query strategies to improve model performance, but it needs to introduce expert knowledge. Semi-supervised learning does not rely on external knowledge. It can use labeled data and unlabeled data for training, but once using misclassified data, may damage the performance of the model. Considering the characteristics of these two methods, we try to combine them in object detection.

Based on the above observations, we propose a new active learning method for deep object detection by fully exploiting unlabeled data with the characteristics of active learning and semi-supervised learning simultaneously. Specifically, we use all labeled data to train an initial object detector based on the Faster RCNN model in the first stage. The second stage is a semi-supervised learning process; we apply strong and weak data augmentations to unlabeled data and use the initial model to predict the strong augmentations data and generate pseudo labels. Then we use weak augmentations data and the corresponding pseudo labels to train a new detector. In the third stage, we inference all the unlabeled data through the new detector and select several samples which is the most informative. Then we choose samples with the most representative from the informative part as the query samples. Finally, after annotating the selected samples and updating the labeled and unlabeled data, we train the final model through the semi-supervised learning step again.

In order to validate our method, we test on multiple data sets: PASCAL VOC 2007, PASCAL VOC 2012, MS-COCO, PlantDoc, PKLot, Hard Hat Workers, Oxford Pets and Synthetic Fruit. For each dataset, we use 5% of the data as a labeled data set and the remainder as an unlabeled data set; in active learning part, we select 2.5% of the data from the unlabeled set as the query samples every time. In the experiment, we find that our method's mean Average Precision (mAP) is significantly higher than other latest methods. The results demonstrate the effectiveness of active learning combined with semi-supervised learning in object detection and better performance compared with the state-of-the-art approaches.

Contributions. The main contributions of this paper are summarised as follows:

We observed and pointed out that there is a complementary relationship between active learning and semi-supervised learning. We can fully use unlabeled data in the query process of active learning, and use the more reliable training samples obtained from the query to complete the semi-supervised training process better.
We propose an object detection method combining the characteristics of active learning and semi-supervised learning. In the semi-supervised learning part, we apply data augmentation and train labeled and unlabeled data based on the consistency principle. For the active learning, we use the informativeness and representativeness principle to query valuable samples from the unlabeled data; after labeling these samples, we train a better detector. To demonstrate the effectiveness of our method, we also compare our method with the latest methods on several popular data sets.

Organisation. The rest of this article is organised as follows:

In Section 1, we introduce the preliminaries of active learning and semi-supervised learning. In Section 2, we describe the detailed steps of our method and present the experimental results in Section 3. Finally, we conclude this article in Section 4.

1 Preliminaries

In this section, we first review active and semi-supervised learning, then propose the complementary mechanisms of active and semi-supervised learning that we observed.

1.1 Brief review of active learning and semi-supervised learning

Active Learning: Active learning (Parvaneh et al., Citation2022; Yoo & Kweon, Citation2019; Yuan et al., Citation2021) aims to achieve the expected performance of the target model with as few labeled samples as possible, thereby significantly reducing the cost of labeling samples. Generally, active learning first actively selects or generates the most valuable samples through appropriate strategies. Then, the experts annotate these samples and add them to the training data set. The core of active learning is the query strategy. Uncertainty sampling is one of the most common strategies. The idea is to select the sample with the model uncertainty. It mainly includes three methods. The least confidence strategy uses the class score with the highest sample prediction confidence as the sample information, but ignores the distribution information of the remaining class prediction scores. The margin sampling strategy represents the informativeness of a sample by predicting the difference between the two classes with the highest Confidence. The entropy strategy considers the probability distribution of all classes.

In order to save the cost of annotation, active learning is gradually applied in many fields, including object detection (Sener & Savarese, Citation2018). Variational Adversarial Active Learning (VAAL) (Sinha et al., Citation2019) trains an adversarial network to discriminate samples between unlabeled and labeled data. (Yu et al., Citation2022) proposed an active learning object detection method, CALD, considering the consistency of both bounding box and predicted class distribution when augmentation is applied to overcome the challenges brought by inconsistencies between classification and detection. Another method (Kao et al., Citation2018) proposes Localization Tightness with the classification information (LT/C) and Localization Stability with the classification information (LS + C). LT/C is based on the overlapping ratio between the region proposals and the final prediction, and LS + C is based on the variation of predicted object locations when input images are corrupted by noise. Most of the above-mentioned methods require additional human experience and not fully use unlabeled data to train models.

Compared to the existing works in active learning, our method combines informativeness and representativeness as sample query strategies. Informativeness is responsible for querying samples with considerable uncertainty, and representativeness is responsible for querying samples with significant variance. In addition, a semi-supervised learning process is added to the training process to fully use the information of unlabeled data to train a better model.

Semi-supervised Learning: Semi-supervised learning is another way to alleviate the shortage of labeled samples. First, the model is pre-trained with a small number of labeled samples in the target field. Then, without manual labeling, both labeled data and unlabeled data are used to train the model. The most commonly used semi-supervised learning methods in deep learning include self-training and consistency regularisation. The basic idea of the self-training is to first train a basic model on labeled samples, then use the basic model to predict unlabeled samples and keep the results with higher accuracy in the prediction results as pseudo labels. Finally, combine the pseudo-labeled samples with the labeled samples, and use standard supervised learning methods to retrain the model. The core idea of consistency regularisation is that for an input sample, even if it is interfered with by small noises, its output should be consistent because the sample is slightly disturbed, which is equivalent to a small swing in the data distribution space, and closely adjacent points in the distribution space should have the same label.

Recently semi-supervised learning has made some development and application in the field of object detection (Berthelot et al., Citation2019; Miyato et al., Citation2019; P. Tang et al., Citation2021; Y. Tang et al., Citation2016). (Jeong et al., Citation2019) propose a novel consistency-based semi-supervised learning algorithm for object detection that can be applied to single-stage detectors and two-stage detectors. (Sohn et al., Citation2020) proposed a semi-supervised learning framework for object detection that seamlessly extends the class of state-of-the-art semi-supervised learning methods for classification based on self-training and augmentation-driven consistency regularisation. (Y.-C. Liu et al., Citation2021) proposed a simple yet effective method, Unbiased Teacher, to address the pseudo-labeling bias issue caused by class-imbalance existing in ground-truth labels and the overfitting issue caused by the scarcity of labeled data and used teacher-student mutual learning scheme to implement a semi-supervised learning object detection task.

Most of the above-mentioned methods make full use of unlabeled data for training and do not require manual labeling in the process, but it is very likely to introduce a large number of noise samples during the learning process, resulting in the model learning wrong information. Compared to these works in semi-supervised learning, our method uses data augmentation to make differences in image data and completes semi-supervised training through consistency regularisation and self-training. Additionally, we add the process of active learning. Using the query strategies, valuable samples are queried and labeled manually, and then the data set is updated for training. Reliable query samples reduce the possibility of misclassification during training, ultimately improving the model's performance.

1.2 Complementary mechanisms of active learning and semi-supervised learning

In this subsection, we observe that the mechanisms of active learning and semi-supervised learning are complementary, as shown in Figure .

Figure 1. The mechanisms of active learning and semi-supervised learning.

Active learning and semi-supervised learning, as two methods to mitigate the shortage of labeled samples, have gradually attracted the attention of researchers. Although these two methods are used to solve similar problems, there is still a difference in practical applications. One of the advantages of semi-supervised learning over active learning methods is that it can make full use of the information from unlabeled samples and does not require additional manual labeling costs. But at the same time, it also brings some problems. Some misclassified samples during training will cause the model to learn the wrong information. It is worth noting that the active learning and semi-supervised learning methods have similar iterative training processes, with complementary advantages and disadvantages. In our method, we consider adding the semi-supervised model to the active learning process to make full use of unlabeled data while further training with samples queried by active learning to reduce the possibility of misclassification. Therefore, the effective combination of active learning and semi-supervised learning methods will possibly reduce the cost of labeling more significantly.

There have been many research methods to improve the model by combining these two methods. (McCallum & Nigam, Citation1998) combined the committee-based active learning algorithm and the semi-supervised learning algorithm based on the maximum expected to realise the label assignment of unlabeled data. They used the Bayesian classifier for text classification tasks to apply the maximum expectation algorithm to each committee member. (Rei, Citation2017) applied active semi-supervised learning to sequence labeling tasks. (Gu et al., Citation2021) propose an enhanced active learning risk bound, which could be better than the one in due to fully considering the unlabeled data. In the related work of object detection, active learning combined with semi-supervised learning has gradually attracted attention because of its practical value. (Rhee et al., Citation2017) combined active learning and semi-supervised learning to learn object detection tasks from noisy data. (Wang et al., Citation2017) proposed a switchable sample selection mechanism to determine whether samples are labeled or used autonomous learning methods to discover pseudo labels to minimise the cost of labeling automatically. Although there have been many methods, there are still few work related to object detection in active semi-supervised learning, which belongs to a new field and remains to be explored.

Compared to the existing works combining active learning and semi-supervised learning, our work focuses on issues related to object detection. Specifically, our method exploits the complementary relationship between active learning and semi-supervised learning. First, in the semi-supervised learning part, we use data augmentation to make differences between unlabeled samples, and generate pseudo-label via consistency regularisation and self-training. In this part, we can fully use unlabeled data in the semi-supervised learning process, which was not available in previous active learning methods. Then, in the active learning part, our method uses both informative and representative criteria to query samples, which ensures that the samples we query are the most likely to be misclassified and represent the majority of classes. By manually labeling the query samples in this part, we effectively solve the problem that the previous semi-supervised learning methods may generate misclassified samples, which may lead to poor model training results.

2 The proposed approach

In this section, based on the complementary mechanism of active learning and semi-supervised learning, we propose combining them in object detection. Specifically, we first use a small number of confidently labeled samples to train a detector model, then train on unlabeled data in a semi-supervised learning way. After that, retrain the next model by adding the batch of samples selected by the current detector. In semi-supervised learning, we combine data augmentation and pseudo label to train labeled and unlabeled data simultaneously. In active learning, we combine informativeness and representativeness as query strategies to select samples for further training. It mainly includes three steps. The general process is shown in Figure .

Figure 2. Overview of our method. (a) Initial step: we use labeled data to train an initial model based on Faster RCNN. (b) Semi-supervised learning step: we use strong and weak data augmentation on the unlabeled data, generate pseudo labels for strong augmentation part using inference, then use weak augmentation part and pseudo labels to train a model. (c) Active learning step: we use a semi-supervised model to predict the unlabeled data, query sample set Q with informative and representative principles, then update the data sets and start a new semi-supervised learning process to train a new model.

2.1 Network structure

In this paper, we choose Faster RCNN as the primary detection model. All experiments are based on Faster R-CNN to complete the comparison and verification. Its network structure contains three sub-networks: primary feature extraction network, region proposal network (RPN), and classification regression sub-network, which will be introduced separately in the following.

The basic feature extraction network part is also called the Head. The primary function of this part is to extract features of the image and use it for the generation of candidate regions, and the extraction and classification of candidate regions features. Common Head includes VGG, Resnet, Xception, etc. In this paper, we use the Resnet50 network as the Head. The Resnet network is a deep residual network. Resnet50 has two basic blocks, namely Conv Block and Identity Block. The input and output dimensions of Conv Block are different, so they cannot be connected in series. Its function is to change the dimension of the network; the input and output dimensions of the Identity Block are the same and can be connected in series to deepen the network.

The RPN is the key to Faster RCNN. The function of RPN is to generate a series of region proposal boxes and to initially return to the object position if there is an object in the proposal box. The mapping of the proposed frame on the original image is called an anchor. By setting anchors of different scales and areas, $k$ different anchor boxes are obtained, and each anchor is mapped to a 512-dimensional low-dimensional feature vector, which is input into the classification layer and regression layer, respectively for end-to-end training. The classification layer is responsible for predicting whether an object exists in the anchor box, and 2 $k$ confidence scores are obtained. The regression layer is responsible for regressing the position of the bounding box to obtain 4 $k$ coordinate position parameters. Finally, the results of the classification and regression layers are combined to obtain the object region proposal box.

The loss function of the RPN network is as follows: (1) $L (P_{k}, B_{k}) = \frac{1}{N_{c l s}} \sum_{k} L_{c l s} (P_{k}, Y_{k}) + \frac{1}{N_{r e g}} \sum_{k} Y_{k} L_{r e g} (B_{k}, \hat{B_{k}})$ (1) where $k$ represents the anchor number, $P_{k}$ is the set of probabilities for the $k$ -th anchor to predict whether there is an object, $Y_{k} = {y_{k}^{m}}_{m = 0}^{1}$ is the truth set corresponding to whether there is an object. If the Intersection over Union (IoU) between the anchor and any marked truth object frame is greater than 0.7, $y_{k}^{0} = 0$ , $y_{k}^{1} = 1,$ and it is a positive sample. If IoU is less than 0.3, then $y_{k}^{0} = 1$ , $y_{k}^{1} = 0$ and it is a negative sample; the remaining anchors are ignored. $B_{k}$ is the corresponding predicted coordinate position, $\hat{B_{k}}$ is the ground truth of the coordinate.

$L_{c l s}$ is the loss function of the classification part, it adopts the cross-entropy loss function, and its general form is expressed as: (2) $L = - \sum_{m} y^{m} \log (P^{m})$ (2) $L_{r e g}$ is the loss function of the regression part, for each value in each coordinate $(x, y, w, h)$ , a smooth L1 loss function is used to predict, the general form of which is as follows: (3) $s m o o t h_{L 1} (x) = {\begin{matrix} 0.5 x^{2} & i f | x | < 1 \\ | x | - 0.5 & o t h e r w i s e \end{matrix}$ (3) The classification regression sub-network receives the features output by the basic feature network Head and the object proposal region output by the RPN extracts the features of the object proposal region through the region of interest pooling layer and separately performs the category-related classification and the regression of the rectangular frame position to get candidate objects.

2.2 Initial step

There is a small number of the labeled data set $D_{l} = {x_{i}^{l}, y_{i}^{l}}_{i = 1}^{N_{l}}$ , and a large number of unlabeled data sets $D_{u} = {x_{i}^{u}}_{i = 1}^{N_{u}}$ , where $y^{l}$ is the ground truth that includes locations and object categories of the image $x^{l}$ , $N_{l}$ and $N_{u}$ are the number of labeled and unlabeled data( $N_{l} ≪ N_{u}$ ). Based on the Faster RCNN framework, we use the labeled data $D_{l}$ to train an initial model $M_{0}$ .

The loss function of the classification regression network is as follows: (4) $L_{s} (P_{k}, B_{k}) = \frac{1}{N_{c l s}} \sum_{k} L_{c l s} (P_{k}, Y_{k}) + \frac{1}{N_{r e g}} \sum_{k} Y_{k} L_{r e g} (B_{k}, \hat{B_{k}})$ (4) where $P_{k}$ and $B_{k}$ are the predicted category probability and locations, respectively, $k$ is the index of objects, $L_{c l s}$ is the cross entropy loss and $L_{r e g}$ is the smooth-L1 loss, $Y_{k}$ is the ground-truth.

2.3 Semi-supervised learning step

In this step, we train both labeled data and unlabeled data. First, we apply weak and strong data augmentations to the unlabeled data set $D_{u}$ , obtain $D_{u}^{w}$ and $D_{u}^{s}$ . For each image, we apply colour transformation and geometric transformation. Then apply cutout operation to complete strong data augmentation.

After that, we perform an inference on strong augmentation data $D_{u}^{s}$ of the object detector from the initial model and generate pseudo labels based on confidence scores greater than 0.7. Then we combine weak augmentation data $D_{u}^{w}$ with the corresponding pseudo-labeled data to generate new labeled data. Finally, we use new data to train a new model $M_{1}$ .

The network is trained by jointly minimising two losses as follows: (5) $L = L_{s} (P_{k}, B_{k}) + λ_{u} L_{u} (P_{k}^{*}, B_{k}^{*})$ (5) where $λ_{u}$ is a hyperparameter to balance supervised loss and unsupervised loss, $P_{k}^{*}$ and $B_{k}^{*}$ are the predicted category probability and locations of the unlabeled data, note that in $L_{u}$ , we use pseudo labels instead of ground-truth. The network is shown in Figure .

Figure 3. Network of our method.

For unlabeled data, the central part is the same as Faster RCNN. First, the feature map is extracted through CNN, and the proposal extraction frame is generated through the RPN network, then, the proposal feature maps are extracted through the ROI pooling layer and sent to the subsequent fully connected layer. Finally, the object category and location are obtained through the softmax and regression layers. The difference is that pseudo labels are used when calculating the classification loss instead of ground-truth.

2.4 Active learning step

In this part, we consider both the informativeness and representativeness of query samples. For informativeness, the uncertain samples tend to have a higher amount of information, so we use $M_{1}$ to predict the unlabeled data set $D_{u}$ , and calculate the uncertainty of each image. Here we give three methods to query an uncertain set $Q_{1}$ .

Least Confidence (LC): Least Confidence assumes that it only pays attention to the classification score of the single category that the model predicts best. If the score is low, it is considered that the model's prediction for the sample is the most uncertain; that is, the Confidence is the smallest, so the sample is selected. We select the uncertain sample set by the following formula: (6) $x_{L C} = \arg max_{x \in D_{u}} max_{k \in [1, \dots, N]} (1 - p^{\hat{y_{k}}})$ (6) where $\hat{y_{k}} = \arg max_{n \in [1, \dots, C]} p_{k}^{n})$ , $C$ is the number of categories, $N$ is the number of bounding boxes in an image. We first calculate the uncertainty of all objects in images, select the largest uncertainty from the $N$ candidate objects as the score corresponding to the entire image, and finally sample the image with the largest uncertainty.

Margin Sample (MS): Unlike the Least Confidence strategy, which only considers the category information with the highest prediction classification score, the Margin Sample expresses the uncertainty of the sample by considering the absolute value of the prediction score residual. If the value is large, the uncertainty is low, making it easier to predict the sample. On the contrary, it means that the probability of the existing model predicting the sample into two categories is low, and it is difficult to classify the sample accurately, and the uncertainty is considerable. It selects the informative of the samples by the difference between the two categories with the highest prediction confidence. The formula is defined as follows: (7) $x_{M S} = \arg min_{x \in D_{u}} min_{k \in [1, \dots, N]} (p_{k}^{\hat{y_{k}}, 1} - p_{k}^{\hat{y_{k}}, 2})$ (7) where $\hat{y_{k}}, 1$ and $\hat{y_{k}}, 2$ have the highest prediction confidence.

Entropy Strategy (ES): Further, as the number of data sets categories increases, the Margin Sample will ignore the output distribution information of more remaining categories. Information entropy is a standard method of measuring signal uncertainty in information theory. It measures uncertainty according to the probability distribution of all categories of output. It considers all categories’ probability distribution and chooses the sample with the most extensive entropy information. The sampling method is formulated as follows: (8) $x_{E S} = \arg max_{x \in D_{u}} max_{k \in [1, \dots, N]} - \sum_{n = 1}^{C} p_{k}^{n} \log p_{k}^{n}$ (8) Then for representative, for the samples with significant differences are highly representative, we calculate the similarity of any two images in $Q_{1}$ and select several samples with the lowest similarity as a query set $Q_{2}$ .

Finally, we label the $Q_{2}$ and add them from the unlabeled data set to the labeled data set; then we train the updated data sets with the semi-supervised learning process and obtain the final model $M_{3}$ .

In general, for a given small number of labeled data sets and many unlabeled data sets, we perform supervised training on the labeled data based on the Faster RCNN algorithm to obtain an initial detector model. Its performance may not be ideal. Then in the semi-supervised learning part, we first apply strong and weak data augmentation on the unlabeled data set, respectively. For the two parts of data, the operation will be different. For the part of the strong augmentation data, we use the initial detector to predict them and get some prediction results. For these prediction results, we select some results with a confidence level higher than the set threshold and assign corresponding pseudo labels to them. We combine these pseudo labels with the weak augmentation data and train again through Faster RCNN, where the ground-truth part is replaced with pseudo labels. Next is the process of active learning; in this part, we operate on the original unlabeled data set, and predict them with the model obtained by the semi-supervised learning process to obtain some new prediction results. For these results, we first select a part of the samples with higher uncertainty through the informative criterion. These samples are relatively more likely to be mispredicted, so they need to be corrected by humans. In the selected uncertain sample set, we need to select samples with large differences according to the representative principle. Finally, the selected sample part is handed over to manual labeling. The labeled samples will be re-added to the labeled data set and deleted from the unlabeled data set. After that, we start a new round of semi-supervised learning process to retrain the updated two-part data set and obtain the final model.

3 Experiments

In this section, we present the experimental setup, then provide the experimental results and discussions.

3.1 Experimental setup

3.1.1 Design of experiments

In the experiments, to verify the effectiveness of our method, we compare it with several state-of-the-art approaches. In object detection tasks, mAP is usually used as a criterion for evaluating the quality of the model. By counting the mAP of models trained through different methods, we want to verify that object detection's effect will increase by our method, which combines semi-supervised learning and active learning. We also use AR as the evaluation criterion, which is the Average of all recalls of IoU from 0.5–1.0.

We list the approaches compared in the experiments as follows.

SL: Train the labeled data set in a supervised learning method on the Faster RCNN model.
VAAL: An active learning object detection method that trains an adversarial network to discriminate samples between unlabeled and labeled data.
CALD: Query informative samples based on consistency.
STAC: Based on the STAC semi-supervised object detection framework, use the labeled dataset and the unlabeled data set for semi-supervised training on the different data sets.
UT: Semi-supervised object detection framework named Unbiased Teacher uses the labeled data set and the unlabeled data set for semi-supervised training on the object detection task.
ALSSL: Our proposed method is a combination of semi-supervised learning and active learning. he active learning method is based on LC, MS, and ES principles, respectively.

3.1.2 Implementation

We use Faster RCNN as the basic detection model for all the experiments, the ImageNet-pretrained model initialises the network weights initialises the network weights. For each dataset, we take 5% of the data as the labeled set and use the rest data as an unlabeled set. For supervised learning, we only use the labeled set for training. For active learning, based on the initial supervision model, we use three query strategies on the unlabeled set, select 2.5% of the data to add to the labeled set, and train a new model. In semi-supervised part, we use the labeled set and the unlabeled set simultaneously for semi-supervised training. In our method, we use three query strategies of active learning based on semi-supervised training results, select 2.5% of the data from the unlabeled set to add them to the labeled set and perform the semi-supervised training again.

3.1.3 Datasets

Table summarises data sets used in our experiments. For each data set, we take 5% of the data as the labeled data set and use the rest data as an unlabeled set.

Table 2. The datasets used in the experiments.

Download CSV Display Table

PASCAL VOC: This data set contains 20 categories and two versions of 2007 and 2012. VOC 2007 is split into three subsets: 2,601 images for training, 2,510 images for validation, and 4,952 for testing. In VOC 2012, the specific distribution is 5,717 images for training, 5,823 images for validation, and 5,585 for testing.

MS-COCO: This data set contains 80 categories with challenging aspects, including dense objects and small objects with occlusion. It has 118,287 images for training or validation and 5,000 images for testing.

PlantDoc: PlantDoc is a data set of 2,569 images across 13 plant species and 31 classes (diseased and healthy) for image classification and object detection.

PKLot: The PKLot data set contains 12,416 images of parking lots extracted from surveillance camera frames, and it has three categories.

Hard Hat Workers: This data set is an object detection data set of workers in workplace settings that require a hard hat. Annotations also include examples of just “person” and “head,” When an individual may be present without a hard hart, it consists of 7041 images with four categories.

Oxford Pets: This data set is a collection of images and annotations labeling various breeds of dogs and cats; it contains three categories and 3,680 images.

Synthetic Fruit: This data set contains 6,000 images generated with the process described in Roboflow's How to Create a Synthetic Dataset tutorial.

3.1.4 Algorithm performance evaluation criterion

Deep object detection algorithms usually use two indicators: accuracy and recall, to evaluate the model's performance. The formula is defined as follows: (9) $\begin{aligned} P r e c i s i o n & = \frac{T P}{T P + F P} \end{aligned}$ (9) (10) $\begin{aligned} R e c a l l & = \frac{T P}{T P + F N} \end{aligned}$ (10) $P r e c i s i o n$ reflects the proportion of true positive samples in the correct samples determined by the object detection model, and Recall reflects the proportion of positive samples in the total positive samples determined by the object detection model to be correct. $T P$ represents the object that was detected correctly (actually, a positive sample is detected as a positive example), $F P$ is the number of negative examples detected as an object (actually a negative example, but the detection is a positive example), $F N$ represents the number of missed objects (actually it is a positive sample but the test is negative).

For a certain target category, the average precision (AP) can be used to express the detection effect of the algorithm, which is defined as: (11) $A P = \int_{0}^{1} P (r) d r$ (11) where $P$ is the accuracy rate, and $r$ is the recall rate. For the average accuracy rate of $M$ categories, it is expressed by the mean value of the average accuracy rate, which is defined as: (12) $m A P = \frac{1}{M} (\sum_{i}^{M} A P_{i})$ (12) Average Recall (AR) is the average of all recalls of IoU on [0.5,1.0], which is defined as: (13) $A R = 2 \int_{0.5}^{1} r e c a l l (o) d o = \frac{2}{n} \sum_{i = 1}^{n} max (I o U (g t_{i}) - 0.5, 0)$ (13) where $n$ is the number of overlaps between all GroundTruth bboxes in each picture and the DetectionResult bbox closest to GroundTruth bbox, that is, maxDets in the coco indicator. $A R$ is an indicator that measures whether the positioning of a model detection is accurate.

3.3 Results and discussion

In Figures and , we report the performance of our method and compare it with state-of-the-art methods on different data sets. In Tables , we show the comparison of mAP, $A P^{0.5}$ , AR (max = 1) and AR (max = 10) obtained by different methods on different datasets and the comparison of mAP obtained on VOC2007 in terms of the ratio of annotated, respectively, where mAP represents AP at IoU = 0.5:0.95(primary challenge metric), $A P^{0.5}$ represents AP at IoU = 0.5(PASCAL VOC metric), AR (max = 1) represents AR given 1 detection per image, AR (max = 10) represents AR given 10 detection per image. The table results showthat the labeled data is 12.5%.

Figure 4. mAP(0.5:0.95) on data sets.

Figure 5. mAP(0.5) on datasets.

Table 7. Comparison of mAP obtained on VOC2007 in terms of the ratio of annotated.

Download CSV Display Table

Our method's performance is reported in Figures and , where we compare it to state-of-the-art methods on different datasets. Tables show a comparison of mAP, $A P^{0.5}$ , AR (max = 1), and AR (max = 10) obtained by different methods on different datasets. Additionally, we compare the mAP obtained on VOC2007 in terms of the ratio of annotated data. It is noteworthy that the labeled data represents only 12.5% of the total data.

Given the lack of extensive research in deep active learning and semi-supervised learning of object detectors, we primarily compared our method with supervised models, active learning object detection methods, and semi-supervised learning object detection frameworks. To ensure fair comparison, we used 5% of the data as the labeled set and the remaining data as the unlabeled set for all experiments.

For supervised learning, only the labeled set was used for training. For active learning, we applied three query strategies to the unlabeled set based on the initial supervised model, selected 2.5% of the data, and added it to the labeled set to train a new model. This process led to an improvement in the mAP of the model.

In the semi-supervised learning part, we used data augmentation and consistency criterion to train the model on both the labeled and unlabeled sets simultaneously. We then used active learning, applying three query strategies to the previous results, selecting 2.5% of the data from the unlabeled set, and adding it to the labeled set. Finally, we retrained the semi-supervised model. In terms of experimental results, in the VOC 2007, our method has improved by 4.6% compared with the standard method, and slightly improved compared with the latest methods such as UT, VAAL, etc.

4 Conclusion

We propose a novel object detection method that combines semi-supervised learning and active learning to save on the high costs of labeling while utilising unlabeled data to enhance detection model performance. By applying data augmentation to unlabeled data and training based on consistency principles, we obtain pseudo labels. For active learning, we query several samples from the unlabeled data based on the informativeness and representativeness principles. After labeling these samples, we train a better detector with the updated data. We have verified the effectiveness of our method through experiments, though improvements can still be made for class-imbalance issues or with new models based on Graph Convolutional Networks. Our approach has significant industrial significance in scenarios where labeled data is scarce or expensive. It can reduce annotation costs, which is a major bottleneck for many applications, including autonomous driving. The method can reduce the number of labeled data required and thus decrease the time and cost needed for development and deployment. To further enhance the applicability of our proposed approach in industrial settings, transfer learning can be investigated. Pre-training a model on a large dataset and fine-tuning on a smaller, task-specific dataset can help improve the performance of the deep object detection model, particularly in scenarios with limited labeled data. In addition to reducing labeling costs, our proposed approach also has the potential to improve the accuracy and generalisation ability of the deep object detection model. By utilising both labeled and unlabeled data, our approach can effectively capture the underlying structure of the data and learn more robust and discriminative features.

In industrial settings, the proposed approach can be particularly useful for applications that require continuous updating of the deep object detection model, such as in surveillance systems, robotics, and industrial automation. By reducing the annotation cost and improving the performance of the model, our approach can enable more efficient and accurate object detection, leading to increased productivity and safety.

Further research can be conducted to explore the use of our proposed approach in other computer vision tasks, such as semantic segmentation, instance segmentation, and image classification. By incorporating active learning and semi-supervised learning principles, our approach can potentially enhance the performance of these tasks and reduce annotation costs.

In summary, our proposed approach of combining active learning and semi-supervised learning for object detection has significant industrial significance and has the potential to reduce annotation costs, improve performance, and enable more efficient and accurate object detection in various industrial applications.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., & Raffel, C. (2019). MixMatch: A Holistic Approach to Semi-Supervised Learning (arXiv:1905.02249). arXiv. http://arxiv.org/abs/1905.02249.
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-End Object Detection with Transformers (arXiv:2005.12872). arXiv. http://arxiv.org/abs/2005.12872.
Google Scholar
Cordone, L., Miramond, B., & Thierion, P. (2022). Object Detection with Spiking Neural Networks on Automotive Event Data (arXiv:2205.04339). arXiv. http://arxiv.org/abs/2205.04339.
Google Scholar
Girshick, R. (2015). Fast R-CNN. 2015 IEEE International Conference on Computer Vision (ICCV)., 1440–1448. https://doi.org/10.1109/ICCV.2015.169
Google Scholar
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. 2014 IEEE Conference on Computer Vision and Pattern Recognition, 580–587. https://doi.org/10.1109/CVPR.2014.81
Google Scholar
Gu, B., Zhai, Z., Deng, C., & Huang, H. (2021). Efficient active learning by querying discriminative and representative samples and fully exploiting unlabeled data. IEEE Transactions on Neural Networks and Learning Systems, 32(9), 4111–4122. https://doi.org/10.1109/TNNLS.2020.3016928
PubMed Web of Science ®Google Scholar
Jeong, J., Lee, S., Kim, J., & Kwak, N. (2019). Consistency-based Semi-supervised Learning for Object detection. NeurIPS. https://proceedings.neurips.cc/paper/2019/hash/d0f4dae80c3d0277922f8371d5827292-Abstract.html.
Google Scholar
Kao, C.-C., Lee, T.-Y., Sen, P., & Liu, M.-Y. (2018). Localization-Aware Active Learning for Object Detection (arXiv:1801.05124). arXiv. http://arxiv.org/abs/1801.05124.
Google Scholar
Kim, S., Park, S., Na, B., & Yoon, S. (2019). Spiking-YOLO: Spiking Neural Network for Energy-Efficient Object Detection (arXiv:1903.06530). arXiv. http://arxiv.org/abs/1903.06530.
Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single Shot MultiBox Detector (卷 9905, 页 21–37). https://doi.org/10.1007/978-3-319-46448-0_2
Google Scholar
Liu, Y.-C., Ma, C.-Y., He, Z., Kuo, C.-W., Chen, K., Zhang, P., Wu, B., Kira, Z., & Vajda, P. (2021). Unbiased Teacher for Semi-Supervised Object Detection (arXiv:2102.09480). arXiv. http://arxiv.org/abs/2102.09480.
Google Scholar
McCallum, A., & Nigam, K. (1998). Employing EM and Pool-Based Active Learning for Text Classification. ICML. http://www.kamalnigam.com/papers/emactive-icml98.pdf.
Google Scholar
Miyato, T., Maeda, S.-I., Koyama, M., & Ishii, S. (2019). Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 1979–1993. https://doi.org/10.1109/TPAMI.2018.2858821
PubMed Web of Science ®Google Scholar
Parvaneh, A., Abbasnejad, E., Teney, D., Haffari, R., Hengel, A. v. d., & Shi, J. Q. (2022). Active Learning by Feature Mixing (arXiv:2203.07034). arXiv. http://arxiv.org/abs/2203.07034.
Google Scholar
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 779–788. https://doi.org/10.1109/CVPR.2016.91
Google Scholar
Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, Faster, Stronger. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6517–6525. https://doi.org/10.1109/CVPR.2017.690
Google Scholar
Rei, M. (2017). Semi-supervised Multitask Learning for Sequence Labeling (arXiv:1704.07156). arXiv. http://arxiv.org/abs/1704.07156.
Google Scholar
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
PubMed Web of Science ®Google Scholar
Rhee, P. K., Erdenee, E., Kyun, S. D., Ahmed, M. U., & Jin, S. (2017). Active and semi-supervised learning for object detection with imperfect data. Cognitive Systems Research, 45, 109–123. https://doi.org/10.1016/j.cogsys.2017.05.006
Web of Science ®Google Scholar
Sener, O., & Savarese, S. (2018). Active Learning for Convolutional Neural Networks: A Core-Set Approach (arXiv:1708.00489). arXiv. http://arxiv.org/abs/1708.00489.
Google Scholar
Shen, L., Zhong, M., & Chaojie, Y. (2022, August). Conversion of CNN to SNN for aerospace object detection tasks. In 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), IEEE, 919–923. https://doi.org/10.1109/PRAI55851.2022.9904132
Google Scholar
Sinha, S., Ebrahimi, S., & Darrell, T. (2019). Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 5971–5980. https://doi.org/10.1109/ICCV.2019.00607
Google Scholar
Sohn, K., Zhang, Z., Li, C.-L., Zhang, H., Lee, C.-Y., & Pfister, T. (2020). A Simple Semi-Supervised Learning Framework for Object Detection (arXiv:2005.04757). arXiv. http://arxiv.org/abs/2005.04757
Google Scholar
Tang, P., Ramaiah, C., Wang, Y., Xu, R., & Xiong, C. (2021). Proposal learning for semi-supervised object detection. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2290–2300. https://doi.org/10.1109/WACV48630.2021.00234
Google Scholar
Tang, Y., Wang, J., Gao, B., Dellandrea, E., Gaizauskas, R., & Chen, L. (2016). Large scale semi-supervised object detection using visual and semantic knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2119–2128. https://doi.org/10.1109/CVPR.2016.233
Google Scholar
Wang, K., Zhang, D., Li, Y., Zhang, R., & Lin, L. (2017). Cost-Effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12), 2591–2600. https://doi.org/10.1109/TCSVT.2016.2589879
Web of Science ®Google Scholar
Yoo, D., & Kweon, I. S. (2019). Learning Loss for Active Learning (arXiv:1905.03677). arXiv. http://arxiv.org/abs/1905.03677.
Google Scholar
Yu, W., Zhu, S., Yang, T., & Chen, C. (2022). Consistency-based Active Learning for Object Detection (arXiv:2103.10374). arXiv. http://arxiv.org/abs/2103.10374.
Google Scholar
Yuan, T., Wan, F., Fu, M., Liu, J., Xu, S., Ji, X., & Ye, Q. (2021). Multiple instance active learning for object detection (arXiv:2104.02324). arXiv. http://arxiv.org/abs/2104.02324.
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable DETR: Deformable Transformers for End-to-End Object Detection (arXiv:2010.04159). arXiv. http://arxiv.org/abs/2010.04159.
Google Scholar

Active learning for deep object detection by fully exploiting unlabeled data

Abstract

Introduction

Table 1. Representative-related algorithms. (“Inf.”, “Rep.”, “ SL”, and “SSL” are the abbreviations of informativeness, representativeness, supervised learning, and semi-supervised learning, respectively.)

1 Preliminaries

1.1 Brief review of active learning and semi-supervised learning

1.2 Complementary mechanisms of active learning and semi-supervised learning