287
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Monocular 3D object detection with thermodynamic loss and decoupled instance depth

ORCID Icon, & ORCID Icon
Article: 2316022 | Received 05 Jun 2023, Accepted 02 Feb 2024, Published online: 13 Feb 2024

Abstract

Monocular 3D detection is to obtain the 3D information of the object from the image. The mainstream methods mainly use L1 loss or L1-like loss to control the instance depth prediction. However, these methods have not achieved satisfactory results. One of the main reasons is that L1 loss or L1-like loss does not accurately reflect the fit between the predicted instance depth and the corresponding ground truth. Another of the main reason is that the instance depth on the RGB image hard to be directly learned in the network. In order to solve the above problems, a novel thermodynamic loss based on the principle of free energy minimisation and a novel depth decoupling method are proposed in this paper. The proposed method is called the monocular 3D object detection network with thermodynamic loss and decoupled instance depth (TDN). In TDN, the optimisation of the instance depth prediction is regarded as the thermodynamic process. Therefore, the thermodynamic loss is designed according to the principle of free energy minimisation. TDN decouples the instance depth into three different depths. By combining the thermodynamic loss and the different types of depths, we can obtain the final instance depth.

1. Introduction

3D detection aims to locate and identify objects in 3D space, and is mainly used to help machines perceive and understand the surrounding environment. 3D detection is one of the foundations of 3D perception. In order to precisely locate the position of an object in 3D space, some methods use LiDAR sensors (Liu et al., Citation2023; Shin et al., Citation2019; Wang et al., Citation2023; Yan et al., Citation2018; Yang et al., Citation2018; Zhou & Tuzel, Citation2018) or stereo matching (Chen et al., Citation2018, Citation2023; Li et al., Citation2019; Xu & Chen, Citation2018) to obtain accurate 3D information of objects. Although 3D detection methods based on LiDAR or stereo matching can achieve higher detection accuracy, the sensors used by these methods are expensive, power-consuming and the overall system is relatively complex. Monocular 3D object detection is a branch of 3D detection. Compared with the above detection methods, the monocular detection method only uses the camera to complete the 3D detection task. Because monocular 3D detection has the advantages of low cost, easy maintenance and strong generalisation ability, it has attracted more and more attention from industry and academia and has become a research hotspot.

Since the camera cannot directly obtain the 3D information of the objects, monocular 3D object detection is the challenging task. Depth estimation (Wang et al., Citation2021a) is the core task of monocular 3D detection. Since the image only provides 2D visual information and lacks direct depth information, it is difficult to accurately estimate the depth information of the objects. According to the different depth estimation methods adopted, the monocular 3D detection methods are mainly divided into the methods based on geometric constraints and the methods based on pseudo lidar (Jiang et al., Citation2022; Miao et al., Citation2023). The geometric constraint-based methods mainly use keypoint regression (Liu et al., Citation2020), geometry constraints (Lu et al., Citation2021) and shape constraints (Liu et al., Citation2021) to estimate the depths of objects. The methods based on pseudo-radar mainly use the depth maps to generate pseudo point cloud, and then estimate the depths of the objects through the pseudo point cloud data (Chen et al., Citation2022; Liu et al., Citation2022). Although these methods have achieved some results, the accuracy of the depths estimated by these methods still cannot meet the actual requirements.

Currently, monocular 3D object detection methods mainly use L1 loss or L1-like loss to control the prediction of the instance depth. Since L1 loss or L1-like loss only calculates the distance between the predicted instance depth and the ground truth (GT), it ignores the implicit relationship between the predicted instance depth and GT. Hence, it is difficult for L1 loss or L1-like loss to truly reflect the fitting degree between the predicted instance depth and the ground truth. It is necessary to design a new loss that reflects the actual fitting situation. Since the instance depth on the image is not intuitive, it is difficult for the network to learn effectively. It is necessary to design a new representation of the instance depth to reduce the learning pressure of the network.

The instance depth prediction is a complex optimisation problem and it is difficult to find the optimal solution. Since the network learning situation reflected by L1 loss or L1-like loss is different from the actual situation, it further increases the difficulty of finding the optimal solution, and the optimisation algorithm is more likely to fall into a local optimum. The optimisation process has certain similarities with the thermodynamic process. During the optimisation process, the behaviour of optimising the loss function is similar to the behaviour of reducing the internal energy in a thermodynamic system. The goal of optimisation is to minimise loss, while the goal of reducing internal energy in thermodynamics is to bring the system to an equilibrium state. These two goals are very similar. Therefore, the theories from thermodynamics can be used to guide the optimisation process of the network.

Since the distribution of the predicted instance depth is relatively complex, directly predicting instance depth will bring greater network learning pressure to the network, making it difficult for the network to predict accurately. According to the principle of divide and conquer method, the instance depth is decoupled into multiple different depths for prediction respectively, that is, the complex distribution is decomposed into multiple distributions for separate learning, which reduces the learning pressure of the network and improves the accuracy of network prediction.

Based on the above considerations, this paper proposes a novel thermodynamic loss based on the principle of free energy minimisation and a novel depth decoupling method. The proposed method is called the monocular 3D object detection network with thermodynamic loss and decoupled instance depth (TDN). In TDN, the instance depth optimisation process is analogised to the thermodynamic process and follows the principle of free energy minimisation. On this basis, we propose a novel thermodynamic loss based on the principle of free energy minimisation to control the instance depth optimisation process. The proposed thermodynamic loss uses free energy to measure the fit of the predicted instance depth and GT. Compared with L1 loss or L1-like loss, the proposed thermodynamic loss can effectively improve the accuracy of the instance depth prediction. In order to enhance the learning ability of the network to the instance depth, TDN proposed a new depth decoupling method. In TDN, the instance depth is decoupled into the instance surface depth (surface depth), the surface offset depth (offset depth) and the instance error depth (error depth). TDN uses different branches to predict the corresponding depth. The multi-branch prediction method effectively reduces the learning pressure of the network.

We conduct experiments on the KITTI 3D object detection dataset (Geiger et al., Citation2012). Experimental results demonstrate that TDN can achieve state-of-the-art (SOTA) performance on the monocular 3D object detection and bird's eye view (BEV) tasks.

The contributions of this paper are summarised as follows:

  • We analogise the instance depth optimisation process to a thermodynamic process and introduces the principle of free energy minimisation to design the instance depth loss. We proposed a new thermodynamic loss based on the principle of free energy minimisation to allow the network to more accurately fit the instance depth.

  • We proposed a new depth decoupling method. Our method decouples the instance depth into three different depths, and the network uses three branches to learn the corresponding depth distributions respectively. Our method effectively reduces the pressure of network learning.

2. Related works

Monocular 3D object detection aims to predict the 3D bounding boxes of objects from a single image. Recently, monocular 3D object detection has been extensively studied with many results. Monocular 3D object detection can be roughly divided into geometric constraint-based methods and depth-based methods.

2.1. Geometric constraints-based method

Since monocular 3D object detection is difficult to directly predict the instance depth, some methods introduce geometric constraints to predict depth. The geometric prior mainly uses the key point information of the object or the projection of the object on the plane to construct the depth information. By introducing geometric prior constraints, the depth prediction is transformed into the regression problem.

Movi-3D (Simonelli et al., Citation2020) used geometric information to generate virtual views to assist detection tasks, reducing the variation in visual size due to distance. M3D-RPN (Brazil & Liu, Citation2019) proposed the depth-aware convolution and sets a novel 3D anchor point to alleviate the problem of 3D parameter prediction and improve the accuracy of 3D detection. MonoPair (Chen et al., Citation2020) obtains more accurate object pose and shape information through the geometric spatial variation relationship between objects in two adjacent frames.

But the above methods rely heavily on feature engineering, which hinders their further extension to general scenarios. To solve this problem, some methods are proposed. MonoCon (Liu et al., Citation2022) adds the auxiliary task of learning monocular context to training to aid 3D object detection. In the tasks, a set of well-posed projected 2D supervision signals are used for supervised training. M3OD (Li et al., Citation2022a) obtains the multiple depths based on different attributes of the object and selects the optimal depth from them. With the introduction of uncertainty, most current depth prediction methods are based on uncertainty. GupNet (Lu et al., Citation2021) proposes the geometry uncertainty projection module to reduce the aggravation of the depth error amplification caused by the height error in the geometric projection formula. DID-M3D (Peng et al., Citation2022a, Citation2022b) decouples the instance depth into the visual surface depth and the attribute depth. This depth decoupling method can improve the accuracy of depth prediction by associating the depths with visual cues provided by images. DCD (Li et al., Citation2022b) uses the key point information of the object on the 2D image to constrain the relationship between the depth values of different parts of the object in 3D space, and then generates multiple depth candidates to improve the 3D detection accuracy.

2.2. Depth-based methods

Depth-based methods can also be called pseudo-lidar methods. The accuracy of the depth map has a direct impact on the3D detection accuracy. Many approaches use the depth maps directly in the framework to enhance the 3D object detection accuracy. The depth-based methods (Ma et al., Citation2020; Reading et al., Citation2021; Wang et al., Citation2020) use the depth maps to introduce depth information of objects in monocular 3D detection. Among these methods, the depth map effectively improves the 3D detection accuracy. Ding et al. (Citation2020) proposed depth-guided dynamic-depth-wise-dilated LCN, which overcomes the limitation of conventional 2D convolutions and narrows the gap between image representation and 3D point cloud representation. MonoDAJD (Lei et al., Citation2021) used the consistency-aware joint detection mechanism to detect objects in the image and objects in the depth map, and utilises localisation information in the depth detection to optimise detection results. ROI-10D (Manhardt et al., Citation2019) uses the extract fused feature maps from the 2D detector and the monocular depth prediction network to regress 3D bounding boxes. UR3D (Shi et al., Citation2020) proposed the distance-normalised unified representation which reduces the nonlinearity of distance by fusing the images and the estimated depth maps to significantly eases the distance learning. Ouyang et al. (Citation2021) proposed the adaptive depth-guided instance normalisation layer and the dynamic depth transformation module to estimate the 3D properties of objects and improve the accuracy of depth prediction. Wang et al. (Citation2021b) combined the generated 2D bounding boxes and the estimated depth maps as the input of the 3D detector to achieve 3D detection. CaDDN (Reading et al., Citation2021) obtains more accurate depth estimation by predicting the category of each object and the depth distribution within each category, and uses the gridded depth prediction to generate high-quality BEV images for 3D detection. Pseudo-Stereo-3D constructed the pseudo-stereo image by using the left view to generate the virtual right view, and proposed the disparity dynamic convolution to generate the feature of the virtual right view. Then, Pseudo-Stereo-3D used the stereo 3D Detector to detect the constructed pseudo-stereo image.

SGM3D (Zhou et al., Citation2022) utilised 3D features extracted from the stereo images to enhance the features learned from monocular images. SGM3D also proposed the multi-granularity domain adaptation module (MG-DA) and the IoU matching-based alignment module (IoU-MA) to enforce the network to mimic the stereo representations based on monocular images. Although these depth-based methods can improve the performance of the network, these methods is relatively complex and requires the additional network to generate the depth maps. Hence, these methods usually require more training data and longer inference time.

It can be seen that many geometric constraints methods and depth-based methods have been proposed and achieve impressive results in monocular 3D detection task. As a comparison, TDN introduces thermodynamic theory as a design guide for the instance depth loss, and proposes a novel depth decoupling method to improve the learning efficiency of the network. It has great advantages on improving the accuracy of 3D detection.

3. Method

3.1. Overview

For monocular 3D detection, inaccurate depth estimation is the main reason for the decline in detection accuracy. Hence, the depth estimation is the most critical issue in monocular 3D detection. Since it is difficult to infer 3D information from 2D image, improving the accuracy of depth estimation is a challenging task. In order to improve the accuracy of depth prediction, this paper proposes the monocular 3D object detection network with thermodynamic loss and decoupled instance depth (TDN). The core concept of TDN is to use the principle of free energy minimisation to guide depth estimation. Therefore, a novel thermodynamic loss based on the principle of free energy minimisation and the corresponding depth decoupling method are proposed in TDN. The new loss function can better reflect the fitting degree of the prediction depth and GT. The decoupled depth also improves the learning efficiency of the network. These two methods effectively improve the accuracy of depth prediction. The overview of the proposed TDN model is shown in Figure .

Figure 1. The framework of TDN.

Figure 1. The framework of TDN.

As shown in Figure , the overall network architecture can be divided into a two-branch network: the 2D detection branch and the 3D detection branch. The 2D detection branch is used to generate 2D bounding boxes, and the 3D detection network uses the RoI (regions of interest, RoI) features to generate 3D bounding boxes.

The keypoints of our approach are described in detail as follows.

3.2. 2D detection branch

The 2D detection backbone of TDN is built on CenterNet (Zhou et al., Citation2019). Like CenterNet, TDN also uses DLA-34(Yu et al., Citation2018) as the backbone to extract the features of the objects. As shown in Figure , TDN includes three detection branches to detect the information of 2D bounding boxes: (1) the heatmap branch is used to indicate the coarse locations and confidences of the objects in the given image; (2) the offset branch is used to refine the coarse locations to the accurate keypoint; (3) the 2D size branch predicts the size of each box. Different from CenterNet, the keypoint for the 2D detection backbone of TDN is defined as the centre of the 3D bounding box’s projection on the image, rather than the centre of the 2D bounding box. Since the 2D information of the object needs to be used to restore the 3D information of the object, the 2D information of the object must be strongly correlated with the 3D information of the object. Compared with the centre of the 3D bounding box’s projection on the image, the centre of the 2D bounding box does not have a strong relationship with the 3D bounding box, which results in a larger error in the recovered 3D bounding box. The loss functions of the 2D detection are the same as those of CenterNet.

To obtain the object-level features, RoIAlign (Ren et al., Citation2017) is used to crop and resize the RoI features. For monocular the monocular depth estimation, the RoI features lack location and size cues (Van Dijk & De Croon, Citation2019). Therefore, the normalised coordinate map is concatenated with each RoI feature in a channel-wise manner to generate the input for the 3D detection branch. The coordinate map is the coordinates formed by projecting the upper left corner and lower right corner of the bounding box generated by CenterNet into 3D space. The main function of the coordinate map is to generate the coarse information of the 3D bounding box based on the information of the bounding box generated by CenterNet. This coarse information is beneficial to the subsequent generation of the accurate information of the 3D bounding box.

3.3. 3D detection branch

3D detection branch is used to predict the information of the 3D bounding boxes of the objects in 3D space. The information of 3D bounding box includes the 3D centre location of 3D bounding box (x3d,y3d,z), dimension (h,w,l) and the orientation θ, where z is the instance depth, h is the height of 3D bounding box, w is the width of 3D bounding box, l is the length of 3D bounding box. The 3D bounding box is parameterised as (x3d,y3d,z,h,w,l,θ).

Let (u2d,v2d) denote the coordinates of 2D bounding box generated by 2D detection branch, (u3d,v3d) denote the coordinates of the centre of the 3D bounding box’s projection on the image. The 3D projected centre offset branch is used to predict the offset (δ3du,δ3dv) of the centre of the 3D bounding box’s projection on the image. (u3d,v3d) is formulated as follows: (1) {u3d=u2d+δx3duv3d=v2d+δy3dv(1)

Let (x3d,y3d,z)T denote the coordinates of the centre of the 3D bounding box in the camera coordinate system. The camera intrinsic parameter matrix K is used to project the point (u3d,v3d) to the camera coordinate system. The 3D point (x3d,y3d,z)T is defined as: (2) {x3d=(u3dcx)zfxy3d=(v3dcy)zfyz=depth(2) where (cx,cy) is the pixel location of the camera centre, fx and fy represent the horizontal and vertical focal lengths of the camera, respectively.

The loss function Loffset3d of the 3D projected centre offset branch can be defined as: (3) Loffset3d=L1(o,ogt)(3) where ogt is the ground-truth offset, o is the predicted offset, L1() represents L1 loss and “•” indicates the input.

For the orientation, the network predicts the observation angle and uses the multi-bin loss (Mousavian et al., Citation2017).Lθ. The 3D size branch predicts the size of 3D bounding box. Like SMOKE (Liu et al., Citation2020), a pre-calculated category-wise average size [h¯,w¯,l¯]T is used to recover object size, where h¯ is the average height of all of 3D bounding boxes in the whole dataset, w¯ is the average width of all of 3D bounding boxes in the whole dataset and l¯ is the average length of all of 3D bounding boxes in the whole dataset. In fact, the 3D size branch outputs the residual size offset [δh,δw,δl]T. TDN utilises the residual size offset and the pre-calculated category-wise average size to generate the size of 3D bounding box. The size of 3D bounding box is formulated as follows: (4) [hwl]=[h¯+δhw¯+δwl¯+δl](4) The loss function Lsize3d of the 3D size branch can be defined as: (5) Lsize3d=L1(w,wgt)+L1(h,hgt)+L1(l,lgt)(5) where wgt, hgt and lgt are the size of the ground-truth 3D box, w, h and l are the size of the predicted 3D box.

3.4. Decoupled depth

In the monocular 3D detection task, the instance depth is the depth of the centre point of the 3D bounding box. TDN uses the depth value in the depth map at the centre point coordinates of the 2D bounding box to assist depth estimation. The pixel information in the depth map is expressed as the depth distance from the camera to the surface of the object, which is slightly different from the distance from the camera to the centre of the object. This error will lead to a decrease in the accuracy of depth estimation.

Therefore, inspired by DID-M3D, we propose a new depth decoupling method. In our method, the instance depth is decoupled into three parts: the instance surface depth patch (surface depth) ds, the surface offset depth patch (offset depth) do and the instance error depth patch (error depth) de, ds,de,doR77. The sum of the surface depth patch and the offset depth patch constitutes the depth value at the centre point coordinates of the 3D bounding box on the depth map. The error depth patch is the error between the depth value at the centre point coordinates of the 3D bounding box and the real depth of the centre point of the 3D bounding box. The decoupled depth is shown in Figure . The instance depth patch is defined by: (6) dins=ds+do+de(6) where dinsR77.

Figure 2. Visualisation of decoupled depth.

Figure 2. Visualisation of decoupled depth.

The three different detection heads in the 3D detection branch are used to regress these depths. Since the distribution of the instance depth is a coupling of multiple different depth distributions, using only a single detection head will cause greater network learning pressure and make it difficult to effectively learn the complex distribution. After decoupling the instance depth, the corresponding pixel in the depth map has a stronger correlation with the decoupled depth, and each detection head only requires to learn the depth with a single distribution, which reduces the learning pressure of the network and improves the learning efficiency of the network.

Since the surface depth is difficult to directly regress, the surface depth is discretized. Like CaDDN, linear-increasing discretization (LID; Tang et al., Citation2021) is used to discretize the surface depth. LID is defined as: (7) dc=dmin+dmaxdminD(D+1)di(di+1)(7) where dc is the continuous surface depth value, (dmin,dmax) is the full surface depth range to be discretized, D is the number of the depth bins and diis the depth bin index. In fact, LID divides the full surface depth range into D bins, and dc represents the surface depth corresponding to theith bin. With the LID method, the surface depth prediction is transformed from a regression task to a classification task. Hence, the surface depth head is used to predict the bin to which the surface depth belongs and output the surface depth corresponding to the bin.

Since each bin contains a range of the surface depth values and dc is the numerical representation corresponding to the bin, there is a big difference between the surface depth value obtained by classification and the surface depth value obtained by regression. Therefore, it is necessary to convert the surface depth value obtained by classification into an actual predicted surface depth value. The conversion formula can be expressed as follows: (8) ds=dmin+dmaxdminD(D+1)dc(dc+1)(8) According to Equation (8), dc is converted to the actual predicted surface depth ds.

The offset depth patch dorefers to the offset of the predicted surface depth patch ds relative to the corresponding depth in the depth map. The surface offset depth patch do is defined by: (9) do=ddepthds(9) where ddepth is the depth (the depth value is padded to 77) in the depth map.

The error depth patch de refers to the error between the centre point of the 3D bounding box and the corresponding depth in the depth map. The error depth patch de is defined by: (10) de=dgtddepth(10) where dgt is the ground truth instance depth. The offset depth patch head and the error depth patch head are used to directly regress do and de.

We employ an uncertainty-aware regression loss to control the regression at all depths after decoupling. (11) Ld=2|didgti|σi+logσi(11) where di is the ith depth, dgti is the ith GT depth patch corresponding to the ith depth patch and σi is the uncertainty patch corresponding to the ith depth patch. Let σpatch is the uncertainty patch of the instance depth patch (σpatch is the sum of three uncertainty patches corresponding to the three depth patches). In order to obtain the final instance depth, σpatchneeds to be converted into the confidence patch p(patch). The probability matrix is defined by: (12) p(patch)=exp(σpatch)(12)

The instance depth patch outputted by TDN is then combined with the corresponding confidence score patch to convert to the final instance depth. The final instance depth is defined by: (13) d=dinsppatchppatch(13)

The confidence corresponding to the instance depth is defined by: (14) p3d=ppatchppatchppatch(14) (15) score=p3d×p2d(15) where the p2d is the 2D confidence.

3.5. Thermodynamic loss based on the principle of free energy minimisation

Due to the lack of the depth information in the detection process, the L1 loss or L1-like loss is difficult to accurately reflect the fit of the network. This makes it difficult for the network to accurately predict depth information. In order to solve the above problems, TDN analogises the depth prediction to a thermodynamic process and proposes a new thermodynamic loss function based on the principle of free energy minimisation.

In TDN, the optimisation of depth prediction is analogous to a closed system that exchanges energy with its surroundings. In a closed system, the exchange of energy can be expressed as a change in free energy. The free energy F of the closed system is calculated by: (16) F=EHT(16) where E is the energy of the system, H is the thermodynamic entropy of the system, and T is the temperature. The free energy follows the principle of free energy minimisation. The principle of free energy minimisation is briefly introduced as follows:

For a closed system that exchanges heat with the outside while maintaining a constant temperature, the state of the system always changes spontaneously in the direction of decreasing free energy and the system reaches equilibrium when the free energy reaches a minimum.

According to the principle of free energy minimisation, any change in the system can be regarded as the result of competition between E and H, and the temperature T determines the relative weights of the two. This competition mechanism effectively coordinates the conflict between E and H, allowing the system to eventually reach a stable state of low energy and low entropy.

Like a closed system, the optimisation of depth prediction also follows the principle of free energy minimisation. Therefore, the principle of free energy minimisation id used to design the instance depth loss function.

The thermodynamic loss function based on the principle of free energy minimisation is formulated as follows: (17) Ltl=ET||ddgt||2s+1(17) where E represents the energy in the optimisation process, T is temperature parameter and is a constant, d is the predicted instance depth, s is the instance depth uncertainty, dgt is the GT depth.

According to the energy-based model (EBM; Joseph et al., Citation2021), we design a novel the energy calculation method. The energy calculation method is formulated as follows: (18) E=Tlogi=1nexp(|dgtdi|)(18) Where di belongs to the depth value in the instance depth patch dins. Since the uncertainty s of the instance depth is used to indicate the quality of the instance depth prediction, it is similar to the role of entropy in thermodynamics. The uncertainty s is formulated as follows: (19) s=mean(σpatch)(19)

The s is used to design the entropy in the thermodynamic loss. The entropy is formulated as follows: (20) H=1||ddgt||2s+1(20)

When the predicted instance depth and the GT instance depth gradually approach, the entropy gradually decrease and the energy also gradually decreases. In this case, the thermodynamic loss also reduced. It shows that thermodynamic loss obeys the principle of free energy minimisation. Figure shows the relationship between the energy and the entropy.

Figure 3. Visualisation of optimisation process. Darker colours represent higher energy, lighter colours represent lower energy.

Figure 3. Visualisation of optimisation process. Darker colours represent higher energy, lighter colours represent lower energy.

As shown in Figure , the depth values in the left patch are quite different and far from the GT values. This shows that the left patch has greater energy. As optimisation proceeds, the patch releases energy to the outside and the entropy gradually decreases. The depth values in the right patch tend to be consistent and close to the GT value. This shows that the right patch has greater energy. This means that the right patch is already in a stable state of low energy and low entropy.

The overall loss of TDN is: (21) L=LHeatmap+Loffset2d+Lsize2d+Ld+Ltl+Lsize3d+Loffset3d+Lθ(21)

Although our approach is inspired by DID-M3D, our approach is significantly different from DID-M3D. The main differences between our approach and DID-M3D are summarised as follows:

  • Our method introduces the principle of free energy minimisation in thermodynamics as a guide for the instance depth optimisation process. Under the guidance of this concept, our method proposes the thermodynamic loss function based on the principle of free energy minimisation and uses it to optimise the instance depth prediction. In contrast, DID-M3D did not introduce any theories from other disciplines to guide instance depth optimisation, and still used the traditional the uncertainty-aware regression loss to optimisation the instance depth prediction.

  • Although our method also uses the idea of decoupling, the details of our implementation are completely different from DID-M3D. Our approach decouples the instance depth into three different depths while DID-M3D decouples the instance depth into two different depths. For the surface depth, our method uses LID to divide the depth of objects in the depth map into bins, and then regresses the label value of the bin to obtain the surface depth while DID-M3D directly uses the depth map to regress the surface depth. Our method reduces the difficulty of the surface depth regression and relieves the pressure of network learning. In order to further improve the accuracy of the surface depth prediction, we also designed the offset depth to fine-tune the predicted surface depth. In contrast, DID-M3D does not have the fine-tuned surface depth operations.

4. Experiment

Experiments are conducted to evaluate the performance of the proposed monocular 3D object detection method on the benchmark dataset. In this section, the experimental setup, dataset and metrics followed by the discussion of results are described.

4.1. Implementation details

Our experiments were conducted on an NVIDIA 2080Ti GPU using the pytorch framework. We train the network with 200 epochs and employ the Hierarchical Task Learning (HTL) training strategy. The batch size is 10 and the K in the HTL is set to 5. We use the ADAM solver (Kingma & Ba, Citation2015) with the initial learning rate 1e5. The learning rate increases to 1e3 in the first 5 epochs by employing the linear warm-up strategy and decays in epoch 100 and 150 with rate 0.1. T is a constant and is set to 3.8918.

4.2. Dataset and metrics

We conduct experiments on the KITTI 3D detection dataset. The KITTI dataset is the most commonly used dataset in the monocular 3D object detection task, consisting of 7481 training images and 7518 testing images. The KITTI dataset contains three categories: pedestrian, cyclist and car. The KITTI dataset only exposes the training sample labels and the testing sample labels are not available in the KITTI website. The testing sample labels are only used for online evaluation. To conduct ablations, the training samples are divided into 3712 training images and 3769 validation images. This data split is widely adopted by most previous works. We conduct ablation studies based on this split and also report the final results with the model trained on all 7481 images and tested by KITTI official server. The objects in the KITTI dataset are divided into the easy, moderate and hard level according to the object 2D box height, occlusion and truncation levels.

The 40-point interpolated average precision (AP40)(Qian et al., Citation2022) with IoU (intersection over union) at 0.7 is used as the evaluation metric. For 3D detection task, the average precision of 3D bounding boxes (AP403D) and the average precision of BEV (AP40bev) are used for comprehensive evaluation. The expression of AP40 is defined as follows: (22) AP=1|R|rRpinterp(r)(22)

Where R is the predefined set of recall positions and r is the recall. pinterp(r) is the interpolation function.

4.3. Results of car category on the KITTI test set

We compare our approach with some state-of-art the monocular 3D object detection methods in recent year. These methods are effective methods and have achieved some significant results in monocular 3D object detection. For a fair comparison, we use the implementation provided by the authors to train the selected baseline model.

The comparison results with other state-of-the-art monocular 3D object detection methods are shown in Table . The best results are shown in bold. As shown in Table , our method achieves superior results of Car category over previous methods. The 3D detection and BEV detection of TDN are (25.59, 34.70), (16.45, 22.89) and (13.42, 19.42) on the easy, medium and hard settings, respectively. Although DID-M3D is slightly higher than our method on the hard settings of 3D detection and BEV detection, TDN gives the relative improvements of (1.19, 1.75), (0.16, 0.13) on the easy and medium settings of 3D and BEV. For 3D detection and BEV detection, TDN gives the relative improvements of (0.9, 2.11) and (0.31, 1.63) on the easy and medium settings compared with MonoJSG (Lian et al., Citation2022). For the hard settings of AP40BEV, TDN gives the relative improvements of 1.24. In contrast, MonoJSG only outperforms TDN by 0.22 points on the hard settings of AP403D. It demonstrates that TDN is still promising and comparable with those state-of-the-art models. The thermodynamic loss and the decoupled instance depth can more accurately control depth prediction and effectively improve the accuracy of 3D detection.

Table 1. The results of car category on the KITTI test set.

4.4. Results of car category on the KITTI validation set

The comparison results on the KITTI validation set are shown in Table . As can be seen from Table , TDN outperforms all baseline methods on all evaluated protocols. Compared to the best competing method DID-M3D, our method improves (3.12, 4.12), (1.34, 2.07) and (0.41, 1.58) on the easy, medium and hard settings of 3D detection and BEV detection, respectively. Combined with the results in 3D/BEV detection, it can be concluded that the overall performance of TDN is better than that of the state-of-the-art methods in terms of the detection accuracy.

Table 2. The results of car category on the KITTI validation set.

4.5. Effect of each component of TDN

In this section, a set of ablation experiments are to verify the effectiveness of each component of TDN. The components that affect the performance of TDN are mainly the thermodynamic loss and the decoupled instance depth. In these experiments, we show the contribution of these components to the overall performance of TDN. For TDN, it should be shown that all components are useful for the final results. All ablation studies were performed on the KITTI validation set for the car category, and the main results are summarised in Table . In experiment (a), the decoupled instance depth and the thermodynamic loss are not used. The traditional L1 loss is used for depth prediction instead of the thermodynamic loss.

Table 3. Effect of each component on the performance of TDN.

From Table , it can be seen that the thermodynamic loss and the decoupled instance depth have a powerful influence on the performance of TDN. Compared with (b) and (a), the decoupled instance depth brings the relative improvements of (5.45, 2.83, 2.38) on the three settings of 3D detection and (4.57, 2.74, 1.6) on the three settings of BEV detection. Compared with (c) and (a), the thermodynamic loss brings the relative improvements of (4.54, 2.12, 1.02) on the three settings of 3D detection and (3.60, 2.34, 0.99) on the three settings of BEV detection. By comparing settings (ba) and (ca), we can find the decoupled instance depth and the thermodynamic loss can effectively and stably improve the overall performance.

In TDN, the decoupled instance depth enables the different branches of the network to learn the corresponding depth distributions, thereby effectively reducing network learning pressure and making training more stable. To further investigate the impact of the decoupled instance depth on network optimisation performance, we compared the convergence of the decomposed approach and the non-decomposed approach. Figure shows the convergence curves for TDN and TDN using the coupled instance depth. As shown in the Figure , it can be seen that the total depth loss using the decoupled instance depth converges slightly faster than the total depth loss using the coupled instance depth. It shows that decoupled instance depth mainly improves the accuracy of the depth prediction and slightly accelerates the convergence of the network.

Figure 4. The convergence curves for TDN and TDN using the coupled instance depth. The X-axis represents the epoch, and the y-axis represents the loss value.

Figure 4. The convergence curves for TDN and TDN using the coupled instance depth. The X-axis represents the epoch, and the y-axis represents the loss value.

Compared with the traditional L1 loss, the thermodynamic loss investigates the depth prediction process from an energy perspective and can better reflect the rules of the predicted depth distribution. Hence, the thermodynamic loss can more accurately control the depth prediction process. To further investigate the impact of different losses on network optimisation performance, we compared the convergence of the thermodynamic loss and the L1 loss. Figure shows the convergence curves for different losses. As shown in the Figure , it can be seen that the total depth loss using the thermodynamic loss converges faster than the total depth loss using the L1 loss. The thermodynamic loss significantly accelerates the convergence rate of the total depth loss and can get the optimal value within the given generations. Consequently, the thermodynamic loss is more suitable for the optimisation of depth prediction than the L1 loss.

Figure 5. The optimisation process of thermodynamic loss and L1 loss. The X-axis represents the epoch, and the y-axis represents the loss value.

Figure 5. The optimisation process of thermodynamic loss and L1 loss. The X-axis represents the epoch, and the y-axis represents the loss value.

4.6. Qualitative results

Figure shows the qualitative results on KITTI validation set. In Figure , the qualitative results on 3D detection and BEV detection are provided. The detection objects contain cars, pedestrians and cyclists. We can observe that for most cases, the model predictions are quite precise. Our method achieves high detection performance for both distant objects and slightly occluded objects.

Figure 6. Qualitative results on KITTI validation set. Red represents the GT value, green represents the predicted value.

Figure 6. Qualitative results on KITTI validation set. Red represents the GT value, green represents the predicted value.

However, our method has lower depth estimation accuracy for heavily occluded objects or small objects. This is a common dilemma for most monocular works. This is because the effect of 3D detection is mainly affected by the effect of 2D detection in monocular 3D detection.

5. Discussions

In monocular 3D detection, the depth prediction is the core task. Therefore, improving the accuracy of depth prediction can effectively improve the accuracy of 3D detection. In TDN, the proposed thermodynamic loss and the proposed decoupled instance depth are used to improve the accuracy of depth prediction. Experiments show that TDN can effectively improve the detection accuracy.

TDN regards the optimisation process of depth estimation as a process of energy exchange between a closed system and the outside. The proposed thermodynamic loss investigates the depth prediction process from an energy distribution perspective. L1 loss or L1 type loss only calculates the distance between the predicted depth and GT depth, without investigating the difference between the two from a distribution perspective. Since depth prediction is essentially a distribution prediction, the thermodynamic loss can more accurately reflect the actual situation of depth prediction.

Due to the lack of sufficient depth information in monocular imagery, it is difficult to directly regress the instance depth information. And there is a certain difference between the depth observed on the image and the actual depth. These circumstances increase the learning pressure of the network and make network training unstable. The decoupled instance depth splits the instance depth into 3 different depths and uses 3 network branches to predict these 3 depths respectively. This allows each network branch to fit only a single distribution, effectively reducing the learning pressure of the network. The decoupled depth can also highlight the error between the actual depth and the predicted depth in more detail. The ablation experiments have proven that the thermodynamic loss and the decoupled instance depth can effectively improve 3D detection accuracy.

For commonly used dataset, TDN can obtain better results than other baseline models. It shows that TDN has better detection ability and performs better than some state-of-the-art 3D object detection networks.

6. Conclusion

In this paper, we proposed TDN to solve the inaccurate depth estimation in for monocular 3D object detection. TDN introduces the concept of thermodynamics to design the thermodynamic loss function to control the depth regression and proposed a novel depth decoupling method to decouple the instance depth into three different depths. The combination of the thermodynamic loss and the decoupled instance depth investigate depth regression from a new perspective. The experimental results indicate that TDN can generate more accuracy depth information and detection accuracy. Comparisons with some state-of-the-art baseline methods, it demonstrates that our method is more effective and efficient in terms of the detection quality in most cases. The ablation experiments show that the effectiveness of each component of the model. In future work, we will further improve the depth prediction mechanism to achieve more accurate detection results.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

Hubei University of Technology Graduate Research Innovation Project (4306.22019). The work described in this paper was support by National Natural Science Foundation of China Foundation No.61300127.

References

  • Brazil, G., & Liu, X. (2019). M3D-RPN: Monocular 3D region proposal network for object detection. In Proceedings of the IEEE international conference on computer vision (pp. 9286–9295).
  • Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., & Urtasun, R. (2018). 3D object proposals using stereo imagery for accurate object class detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5), 1259–1272. https://doi.org/10.1109/TPAMI.2017.2706685
  • Chen, Y., Huang, S., Liu, S., Yu, B., & Jia, J. (2023). Dsgn++: Exploiting visual-spatial relation for stereo-based 3D detectors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4), 4416–4429. https://doi.org/10.1109/TPAMI.2022.3200725
  • Chen, Y., Tai, L., Sun, K., & Li, M. (2020). MonoPair: Monocular 3D object detection using pairwise spatial relationships. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 12090–12099).
  • Chen, Y.-N., Dai, H., & Ding, Y. (2022). Pseudo-stereo for monocular 3D object detection in autonomous driving. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 877–887).
  • Ding, M., Huo, Y., Yi, H., Wang, Z., Shi, J., Lu, Z., & Luo, P. (2020). Learning depth-guided convolutions for monocular 3d object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 11669–11678).
  • Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 3354–3361).
  • Jiang, C., Wang, G., Miao, Y., & Wang, H. (2022). 3D scene flow estimation on pseudo-LiDAR: bridging the gap on estimating point motion. arXiv. https://doi.org/10.48550/arXiv.2209.13130
  • Joseph, K. J., Khan, S., Khan, F. S., & Balasubramanian, V. N. (2021). Towards open world object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 5826–5836).
  • Kingma, D. P., & Ba, J. L. (2015). Adam: A method for stochastic optimization. In 3rd International conference on learning representations, ICLR 2015 - Conference track proceedings.
  • Lei, J., Guo, T., Peng, B., & Yu, C. (2021). Depth-assisted joint detection network for monocular 3D object detection. In Proceedings - International conference on image processing, ICIP (pp. 2204–2208).
  • Li, P., Chen, X., & Shen, S. (2019). Stereo R-CNN based 3D object detection for autonomous driving. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 7636–7644).
  • Li, Y., Chen, Y., He, J., & Zhang, Z. (2022a). Densely constrained depth estimator for monocular 3D object detection. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics) (pp. 718–734).
  • Li, Z., Qu, Z., Zhou, Y., Liu, J., Wang, H., & Jiang, L. (2022b). Diversity matters: Fully exploiting depth clues for reliable monocular 3D object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2781–2790).
  • Lian, Q., Li, P., & Chen, X. (2022). Monojsg: Joint semantic and geometric cost volume for monocular 3D object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 1060–1069).
  • Liu, W., Zhu, D., Luo, H., & Li, Y. (2023). 3D object detection with fusion point attention mechanism in LiDAR point cloud. Guangzi Xuebao/Acta Photonica Sinica, 52(9), 0912002. https://doi.org/10.3788/gzxb20235209.0912002
  • Liu, X., Xue, N., & Wu, T. (2022). Learning auxiliary monocular contexts helps monocular 3D Object detection. In Proceedings of the 36th AAAI conference on artificial intelligence, AAAI 2022 (pp. 1810–1818). https://doi.org/10.1609/aaai.v36i2.20074
  • Liu, Z., Wu, Z., & Toth, R. (2020). SMOKE: Single-stage monocular 3D object detection via keypoint estimation. In IEEE computer society conference on computer vision and pattern recognition workshops (pp. 4289–4298).
  • Liu, Z., Zhou, D., Lu, F., Fang, J., & Zhang, L. (2021). Autoshape: Real-time shape-aware monocular 3D object detection. In Proceedings of the IEEE international conference on computer vision (pp. 15621–15630).
  • Lu, Y., Ma, X., Yang, L., Zhang, T., Liu, Y., Chu, Q., Yan, J., & Ouyang, W. (2021). Geometry uncertainty projection network for monocular 3D object detection. In Proceedings of the IEEE international conference on computer vision (pp. 3091–3101).
  • Ma, X., Liu, S., Xia, Z., Zhang, H., Zeng, X., & Ouyang, W. (2020). Rethinking pseudo-LiDAR representation. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics) (pp. 311–327).
  • Manhardt, F., Kehl, W., & Gaidon, A. (2019). ROI-10D: Monocular lifting of 2D detection to 6D pose and metric shape. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2064–2073).
  • Miao, Y., Deng, H., Jiang, C., Feng, Z., Wu, X., Wang, G., & Wang, H. (2023). Pseudo-LiDAR for Visual Odometry. IEEE Transactions on Instrumentation and Measurement, 72, 1–9. https://doi.org/10.1109/TIM.2023.3315416
  • Mousavian, A., Anguelov, D., Koecka, J., & Flynn, J. (2017). 3D bounding box estimation using deep learning and geometry. In Proceedings - 30th IEEE conference on computer vision and pattern recognition, CVPR 2017 (pp. 5632–5640).
  • Ouyang, E., Zhang, L., Chen, M., Arnab, A., & Fu, Y. (2021). Dynamic depth fusion and transformation for monocular 3D object detection. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics) (pp. 349–364).
  • Peng, L., Liu, F., Yu, Z., Yan, S., Deng, D., Yang, Z., Liu, H., & Cai, D. (2022a). Lidar point cloud guided monocular 3D object detection. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics) (pp. 123–139).
  • Peng, L., Wu, X., Yang, Z., Liu, H., & Cai, D. (2022b). DID-M3D: Decoupling instance depth for monocular 3D object detection. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics) (pp. 71–88).
  • Qian, R., Lai, X., & Li, X. (2022). 3D object detection for autonomous driving: A survey. Pattern Recognition, 130, 108796. https://doi.org/10.1016/j.patcog.2022.108796
  • Reading, C., Harakeh, A., Chae, J., & Waslander, S. L. (2021). Categorical depth distribution network for monocular 3D object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 8551–8560).
  • Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
  • Shi, X., Chen, Z., & Kim, T.-K. (2020). Distance-normalized unified representation for monocular 3D object detection. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics) (pp. 91–107).
  • Shin, K., Kwon, Y. P., & Tomizuka, M. (2019). RoarNet: A Robust 3D object detection based on region approximation refinement. In IEEE intelligent vehicles symposium, proceedings (pp. 2510–2515).
  • Simonelli, A., Bulo, S. R., Porzi, L., Ricci, E., & Kontschieder, P. (2020). Towards generalization across depth for monocular 3D object detection. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics), 767–782.
  • Tang, Y., Dorn, S., & Savani, C. (2021). Center3D: Center-based monocular 3D object detection with joint depth understanding. In Lecture notes in computer science (including subseries Lecture notes in artificial intelligence and Lecture notes in bioinformatics) (pp. 289–302).
  • Van Dijk, T., & De Croon, G. (2019). How do neural networks see depth in single images? In Proceedings of the IEEE international conference on computer vision (pp. 2183–2191).
  • Wang, G., Tian, X., Ding, R., & Wang, H. (2021a). Unsupervised learning of 3D scene flow from monocular camera. In Proceedings - IEEE international conference on robotics and automation (pp. 4325–4331).
  • Wang, L., Zhang, L., Zhu, Y., Zhang, Z., He, T., Li, M., & Xue, X. (2021b). Progressive coordinate transforms for monocular 3D object detection. In Advances in neural information processing systems (pp. 13364–13377).
  • Wang, Q., Li, Z., Zhu, D., & Yang, W. (2023). LiDAR-only 3D object detection based on spatial context. Journal of Visual Communication and Image Representation, 93, 103805. https://doi.org/10.1016/j.jvcir.2023.103805
  • Wang, X., Yin, W., Kong, T., Jiang, Y., Li, L., & Shen, C. (2020). Task-aware monocular depth estimation for 3D object detection. In AAAI 2020 - 34th AAAI conference on artificial intelligence (pp. 12257–12264).
  • Xu, B., & Chen, Z. (2018). Multi-level fusion based 3D object detection from monocular images. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2345–2353).
  • Yan, Y., Mao, Y., & Li, B. (2018). Second: Sparsely embedded convolutional detection. Sensors (Switzerland), 18(10), 3337. https://doi.org/10.3390/s18103337
  • Yang, B., Luo, W., & Urtasun, R. (2018). Pixor: Real-time 3D object detection from point clouds. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 7652–7660).
  • Yu, F., Wang, D., Shelhamer, E., & Darrell, T. (2018). Deep layer aggregation. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 2403–2412).
  • Zhou, X., Wang, D., & Krahenbuhl, P. (2019). Objects as points. arXiv.
  • Zhou, Y., & Tuzel, O. (2018). Voxelnet: End-to-end learning for point cloud based 3D object detection. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (pp. 4490–4499).
  • Zhou, Z., Du, L., Ye, X., Zou, Z., Tan, X., Zhang, L., Xue, X., & Feng, J. (2022). Sgm3d: Stereo guided monocular 3D object detection. IEEE Robotics and Automation Letters, 7(4), 10478–10485. https://doi.org/10.1109/LRA.2022.3191849