Full article: Monocular vision guided deep reinforcement learning UAV systems with representation learning perception

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In recent years, numerous studies have applied deep reinforcement learning (DRL) algorithms to vision-guided unmanned aerial systems. However, DRL is not good at training deep networks in an end-to-end manner due to data inefficiency and lack of direct supervision signals. This paper provides a visual information dimension reduction scheme with representation learning as the visual perception module, which reduces the dimensions of high-dimensional visual information and retains its features related to UAV navigation. Combining such state representation learning with the DRL model can effectively reduce the complexity of the neural network required by DRL. Based on this scheme, we design three motion control models with a monocular camera as the main sensor and train them to control UAVs for obstacle avoidance tasks in a simulated environment. Experiments show that all these models achieve high obstacle avoidance ability after a certain period of training. In addition, one of them also enables the monocular vision guidance system to avoid obstacles in the blind spot of side vision.

KEYWORDS:

1. Introduction

Vision guided robotic systems (VGR) have become a hot area of research in recent years (Singh et al., Citation2022). Among the methods employed in this research area, machine learning algorithms play an extremely important role. In the field of visual perception, machine learning has achieved remarkable results in obstacle detection (He et al., Citation2021), semantic segmentation (Kumar et al., Citation2021), road detection (Caltagirone et al., Citation2019) and many application scenarios. In the field of path planning, computer vision-based machine learning algorithms have also achieved good results (Elkholy et al., Citation2020). Machine learning also performs well when dealing with vision-based robot control tasks (Xiao et al., Citation2022).

In vision guided robotic systems, visual guidance of unmanned aerial vehicles (UAV) (Lu et al., Citation2018) has become a significant focus of research, due to its complex environmental changes and manipulation difficulties. Among them, research based on deep learning has achieved excellent results in various practical tasks. For example, (Lee et al., Citation2020) designed a model called Faster R-CNN for tree trunk detection. Thus, the UAV can perform monocular navigation in tree plantations. The authors of (Padhy et al., Citation2018) trained a CNN-based model as a classifier for indoor environments, making UAVs capable of monocular navigation in indoor corridors. Deep reinforcement learning has also produced many excellent results in the field of robot navigation in recent years (Nachum et al., Citation2018) (Zhou & Ho, Citation2022). UAVs often need to fly in partially or completely unknown environments that cannot be accurately modelled mathematically, technology that enables UAVs to learn paths and movements autonomously is critical. Therefore, application of deep reinforcement learning to UAVs’ vision guided systems has become a hot research direction (Azar et al., Citation2021). However, DRL has an important limitation when using visual information as a state. The use of deep neural networks in DRL, allows the agent to gain the ability to observe the environment directly in high dimensions. Nonetheless, the shortcomings of neural networks, namely, low sample efficiency, requirement of a large amount of data for training and testing, and the instabilities of learning are also being simulataneously inherited. Therefore, when faced with high-dimensional data such as visual input, DRL models are often caught in a dilemma: On the one hand, a large amount of stable data is not available for DRL to quickly fit complex image recognition networks such as Yolo (Redmon et al., Citation2016) and visual transformers (Dosovitskiy et al., Citation2021). On the other hand, shallow neural network structure cannot handle the task of extracting effective features from visual information. Some researchers are also working to address this dilemma. As in (Ota et al., Citation2020), in order to solve the poor performance of directly using wide and deep neural networks in reinforcement learning, Actor and Critic networks with DenseNet structure are used to deal with high-dimensional states.

One potential solution to the aforementioned issues is the utilisation of representation learning. This method involves the acquisition of low-dimensional representations of visual information, which preserve properties that are essential to the task at hand. These representations are then used to learn optimal policies. Representation learning, as stated in (Bengio et al., Citation2013), is often used as an auxiliary tool in the field of computer vision. It has shown impressive results in areas such as dimensionality reduction, compression, data transformation, and semantic segmentation of visual data (Ruiz-del-Solar et al., Citation2018).

In this paper, we introduce a novel approach for utilising representation learning to perform dimensionality reduction on visual data in order to enhance the performance of deep reinforcement learning for visual navigation tasks. While training the DRL model, we train a convolutional variational autoencoder (VAE) (Pham & Le, Citation2020) with visual data collected in the environment to complete the task of generating depth maps from monocular RGB images. As DRL training progresses, the generation accuracy of Conv-VAE gradually increases. We replace the original RGB image with the latent code generated by the encoder part of the trained Conv-VAE. Using latent code as a state to participate in DRL training can greatly reduce the complexity of the neural network. Based on this method, we have designed three deep reinforcement learning models for UAV obstacle avoidance tasks in a simulated environment: single image model, continuous images model, and short-term memory model.

Experiments with these three models prove that after using VAE for visual dimensionality reduction, the DRL model has efficient convergence ability when faced with visual information, time-series visual information, and time-series vision and IMU fusion information. Experiments with the short-term memory model also prove that the model can make the monocular vision DRL algorithm perceive obstacles in the blind area of the current field of vision.

The key contribution of this paper is the development and evaluation of the learning state representations with VAE to train DRL models over monocular vision guided tasks. Specifically, the study proposes:

Three deep reinforcement learning models for monocular UAV visual obstacle avoidance.
A new idea for end-to-end monocular visual-spatial perception via the short-term memory state DRL model.
A new vision-based navigation algorithm that provides new ideas for robot control applications in indoor and other signal-constrained scenarios.

The rest of this paper is organised as follows: Section 2 describes related works on visual based navigation. Section 3 focuses on the 3 DRL models we have designed in this research. Section 4 explains the experiments and results of our study in detail. Finally, section 5 summarises the research results along with their limitations and proposes future research plans.

2. Related works

Based on improvements to discrete action spaces such as DQN (Silver et al., Citation2016), Double DQN (Hasselt et al., Citation2016) and Duelling DQN (Wang et al., Citation2016), the emergence of DRL algorithms using Actor Critic structure (Sutton et al., Citation1999), such as DDPG (Lillicrap et al., Citation2019), TD3 (Fujimoto et al., Citation2018) and SAC (Haarnoja et al., Citation2019), enables DRL algorithms to select actions with arbitrary values in the continuous action space, so that the precision meets the requirements of robot control. However, such DRL algorithms often require a lightweight Actor Critic network to achieve fast and high-quality convergence. Most of the neural networks used in continuous action space DRL are only 3–4 layers deep. The Actor Critic networks’ structure used in the original literatures for the three methods (DDPG, TD3 and SAC) mentioned above is shown in Table .

Table 1. Comparison of network structures used in 3 different continuous actions space DRL methods.

Download CSV Display Table

However, when faced with the problem of visual navigation, such a simple network structure is unable to cope with complex visual data because these networks cannot expand enough feature space to extract rich semantic information. To this end, researchers have proposed many solutions.

2.1. Manual dimensionality reduction

When the original version of DQN was used to train agents to play Atari games, researchers tried to manually compress the pixels of the game screen, crop the size of the game screen, and convert three-channel RGB images to single-channel grayscale images. There are many similar image dimensionality reduction methods in robotic visual navigation DRL field. In (Yue et al., Citation2019), the authors first convert the image into a grayscale image and then compress the image, and arrange the four most recent compressed images into a continuous image state. Using the image information processed in this way as the state to train the DQN model enables the robot to avoid two carton obstacles without collision.

There is more than one way of manual state dimensionality reduction. In (Anderson et al., Citation2019), the authors subtly reduced the state space from the current state of every pixel in an image frame to a set of nine one-hot encoded values. The essence of this method is to transfer the image information from continuous state space to discrete state space, so that the model only needs to understand the limited state to train the UAV to complete the simple runway navigation task.

Manually pre-processing images work well for simple tasks. However, in a complex training environment, it is difficult to manually process images to ensure that the key information in the images is not lost. Therefore, image processing methods based on machine learning are more used in complex visual navigation tasks.

2.2. Transfer learning

As image networks become more and more mature, various CNN networks have achieved impressive performance in computer vision fields such as image classification and object detection. This has also led to the emergence of research on transferring pre-trained image networks to deep reinforcement learning. The feature extraction effect of pre-trained image networks is often much better than that of manually dimensionally reduced images. Therefore, visual navigation work using such methods tends to achieve better performance in more complex environments.

In (Zieliński & Markowska-Kaczmar, Citation2021), the authors pre-trained Yolo for obstacle detection, and then transferred the pre-trained network into the A2C (Wu et al., Citation2017) model to train the underwater robot to pass through the gate. In (Devo et al., Citation2020), the author used multiple Resnet50 to form an object localisation network, pre-trained the network, and then transplanted the network into the IMPALA (Espeholt et al., Citation2018) model. This study successfully trained the ground robot to have certain exploration ability in mazes with different structures. Some studies do not use the existing image network structure but design their own CNN network pre-training model for navigation tasks. In (Ma et al., Citation2018), the authors designed a CNN-based saliency detection network and transferred it to an actor-critic (AC) architecture deep reinforcement learning. The study achieves performance that exceeds the state-of-the-art on UAV obstacle avoidance tasks.

Although transfer learning and DRL combined methods have produced excellent results, the pre-training of visual perception image networks requires a lot of time and relies on a large amount of manually labelled data. This approach essentially discards the inherent advantage of DRL algorithms in that they do not rely on datasets.

2.3. Representation learning for visual perception

Compared with transfer learning, representation learning can directly use the data (camera and other sensor data) collected by the DRL model from the training environment to train itself. An example of training a VAE to represent the input state of a DRL can be found in the work of (Caselles-Dupré et al., Citation2018), where researchers use the generative power of a VAE to remember and reuse past knowledge. This enables DRL agents to learn in continuously changing environments. In the field of robotic control, (Finn et al., Citation2016) proposed a Deep Spatial Autoencoder to generate compact state representations from images, thereby enabling DRL algorithms to train robotic arms for precise manipulation.

The study (Zhou et al., Citation2019) modelled the hippocampus as a visual perception module through a combination of representation learning and self-organising learning. The visual perception module perceives the environment topology and the robot head orientation from the environment image. Hierarchical reinforcement learning (Zhang et al., Citation2019) is used to train the robot to perform navigation tasks through the output of the visual perception module. Research (Shin et al., Citation2020) drives U-net to generate special optical flow topologies from environment images by using the DRL algorithm and the rewards generated during training. The output of the U-net will participate in the training of DRL as the state that determines the action of the DRL algorithm. The authors used this model to give UAVs the ability to avoid simple obstacles in real-world environments.

Microsoft recently used a framework called Cross-modal Variational Autoencoder (CM-VAE) to generate tightly bridged representations to model reality gaps (Bonatti et al., Citation2020). The perception module of the system compresses the input image into the above-mentioned low-dimensional representation, and outputs the position information of the gate that the UAV needs to pass through and the speed of the UAV. Using this representation model labelled dataset to train a supervised model for UAV navigation achieves human-level performance. This proves that VAE has great value as a DRL visual perception module. The study (Lange & Riedmiller, Citation2010) also proved the auxiliary ability of VAE model for DRL algorithm.

However, representation learning for dimensionality reduction does not seem to have the advantage of extracting effective features when obstacles have more diverse textures, shapes, and depth information. Therefore, in this paper we try to train Conv-VAE to generate depth maps from RGB images to filter the information in visual states that is irrelevant to the obstacle avoidance task. Moreover, a short-term memory state model is proposed to improve the VAE + DRL model's perception of depth information.

3. DRL models

3.1. Twin delayed deep deterministic policy gradient algorithm (TD3)

TD3 is a deterministic policy reinforcement learning algorithm suitable for high-dimensional continuous action spaces. Its optimisation objective is simple: maximise Q ( $S_{t}, A_{t}$ ). That is, to find the corresponding actions under different states, so that the score Q of the interaction ( $S_{t}, A_{t}$ ) between the agent and the environment is the highest. The TD3 algorithm contains a total of four networks, namely: Actor network, Actor target network, Critic network, and Critic target network. Actor network is responsible for the iterative update of the policy network parameters θ and selecting the current action $A_{t}$ according to the current state $S_{t}$ . Actor target network is responsible for selecting the optimal next action $A_{t + 1}$ based on the next state $S_{t + 1}$ sampled in the replay buffer based on experience. The network parameters θ’ are periodically copied from the θ. Critic network is responsible for the iterative update of the value network parameter w and computing the current Q value Q( $S_{t}, A_{t}$ ). The target Q value is computed as: (1) $y = R + γ Q^{'} (S_{t + 1}, A_{t + 1})$ (1) The Q'( $S_{t + 1}$ , $A_{t + 1}$ ) in Equation (1) obtained by critic target network is named the target Q value. The network parameters w’ are periodically copied from w. The overall structure of TD3 is shown in Figure .

Figure 1. TD3.

3.2. Single image model

Based on TD3, we designed the first single image model, which also serves as the basic algorithmic framework for all three of our models. We trained a VAE to generate depth maps from input RGB images concurrently with TD3 training. Then we used the encode network of VAE to reduce the dimensionality of the monocular RGB image to latent code. This latent code participates in the training of TD3 as the state $S_{t}$ . The current state $S_{t}$ is input into the actor network, and the action $A_{t}$ is obtained. At the same time, the environment feedbacks a reward $R_{t}$ . After executing $A_{t}$ , the UAV reaches a new state $S_{t + 1}$ . The agent stores the data generated by each interaction in the replay buffer. After that, we sample ( $S_{t}$ , $S_{t + 1}$ , $r_{t}$ , $A_{t}$ ) from the replay buffer and use the critical network to evaluate its Q value. We use this Q-value to scale the actor network and update the actor network in the direction that maximises the Q-value. The entire workflow of the single image model is shown in Figure .

Figure 2. Single image model.

The Conv-VAE structure we used is shown in Figure . The input is a monocular RGB image, the size is $178 \times 72$ , and the number of channels is 3. The structure consists of an encoder and a decoder. The encoder consists of a four-layer convolutional network, and the convolution kernels in each convolutional layer are ( $4 \times 4$ ). The latent code generated by the encoder will be reconstructed into ( $1024 \times 1 \times 3$ ) data as the input to the decoder. To successfully decode this data into a $178 \times 72$ depth map with 1 channel, the decoder's convolution kernels are ( $5 \times 7$ ), ( $6 \times 8$ ), ( $7 \times 8$ ) and ( $8 \times 6$ ).

Figure 3. Convolutional variational autoencoder.

The encoder in Conv-VAE is trained to convert RGB images into 32 latent codes. These 32 variables will participate in the use and update of actors and critic networks as states. The actor network consists of four fully connected layers. Its input is the 32 variables output by the encoder. The output is two values in the range −1–1, representing the speed of the UAV in the y direction (left and right) and the z direction (up and down).

The critic network consists of three fully connected layers. Its role is to estimate the Q value that can be obtained when a certain action value is selected in a certain state. Therefore, its input is a combination of states and actions, resulting is 34 variables. Furthermore, in order to mitigate the issue of overestimation of the Q-value, the TD3 algorithm employs a dual output layer within the critic network. This layer generates two distinct Q-values, the smaller one being utilised for the update of the critic network. The structure of the actor and critic network is shown in Figure .

Figure 4. Structure of actor net and critic net in the model.

3.3. Continuous images model

The workflow for continuous images model is similar to the single image models. The difference is that the input to the actor and critic networks is no longer the single latent code converted from a single image. Rather, it is a combination of latent codes converted from four consecutive RGB images on the route the UAV travelled. That is, a 2D array of shape $4 \times 32$ . The entire workflow of a single image model is shown in Figure .

Figure 5. Continuous images model.

To enable the actor and critic network to understand the timing relationship between each latent code, we let the 2D state array first pass through a single-layer LSTM with a hidden size of 8. The output of the LSTM will be a 32-length code. The new structure of actor net and critic net is shown in Figure .

Figure 6. Structure of actor net and critic net in the model with LSTM layer.

3.4. Short-term memory model

Using only four consecutive image information as input can only enable the agent to understand the temporal relationship between images during the learning process. However, spatial relationships are also very important in visual navigation tasks. To this end, we design a new state construction method, called short-term memory state. To intuitively include the position and pose changes between each image in the state, we fuse the latent code produced by the encoder with the data produced by the inertial measurement unit (IMU). As a common hardware module of UAVs, IMU can obtain the orientation, velocity, and acceleration information of UAVs in real time. In this model, we use four consecutive frames to form a short-term memory state and obtain 10 pose data from the IMU while the camera obtains each image frame. The data includes the velocity and acceleration of the UAV on the x, y, and z axis of inertial frame of reference and a quaternion ( $\cos \frac{O_{ω}}{2}$ , ( $O_{x}$ , $O_{y}$ , $O_{z}$ ) $\sin \frac{O_{ω}}{2}$ ) that represents the orientation. The $O_{ω}$ in this formula represents the rotation angle, and ( $O_{x}$ , $O_{y}$ , $O_{z}$ ) is a three-dimensional vector representing the rotation axis. These 10 variables and the 32 variables compressed by VAE will form a code of length 42 to represent the current state of each step. The current state of the agent $S_{t}$ and the three successive states that have passed before will be combined into the short-term memory state ( $S_{t}$ , $S_{t - 1}$ , $S_{t - 2}$ , $S_{t - 3}$ ). The complete state construction process is shown in Figure .

Figure 7. Short-term memory state construction.

The overall workflow of the short-term memory model is shown in Figure . It is also similar to the continuous images model. The difference is that the states result from the short-term memory state construction process above. Since the shape of the short-term memory state is 4 × 42, the input size of the LSTM layer of the actor and critic network will become 4 × 42.

Figure 8. Short-term memory model.

4. Experiments and results

The following experiments are designed to observe the training process of the three models: single image model, continuous images model and short-term memory model. We evaluate the three model training procedures in a simulation environment made by Airsim (Shah et al., Citation2017) and test the trained models in the same simulation environment and a rearranged environment.

4.1. Environment settings

The specific hardware and software settings of our experiment environment and work station are shown in Table .

Table 2. Environment Settings.

Download CSV Display Table

4.2. Simulation environment

We used UE4 and Airsim to construct a simulation environment for UAV flight training and testing. The simulation environment is a corridor that is 36 metres long, 5 metre wide and 5 metres high. Obstacles of different shapes appear in succession every 4 metres in the corridor as shown in Figure . The UAV takes off from the beginning of the corridor, maintains a forward speed of 1 m/s in the x direction, and is trained by our visual navigation models to control the speed in the y and z directions for obstacle avoidance training.

Figure 9. Simulation environment.

4.3. Reward shaping

To compare the performance of the three models, we use the same reward function for each model. Our reward function is very simple and falls into two cases as shown in Table . When a collision occurs after the action is taken, a penalty of −2.0 is fed back. When there is no collision after the action is taken, a reward of 0.2 is obtained.

Table 3. Reward function.

Download CSV Display Table

4.4. Training processes

In the training phase, we used the three models introduced in Section 3 to train the UAV agent. The agent was trained to perform obstacle avoidance tasks from the same starting point in the same simulation environment. According to the reward shaping discussed in Section 4.3, if the agent can avoid more and more obstacles as the training progresses, the reward will increase with training. To verify that the models maximised the reward, we recorded the average reward of every 20 flights during the training. The reward growth trends of the three models are shown in Figure .

Figure 10. Reward comparison of single image model, continuous image model and short-term memory model during training.

As can be seen from Figure , the rewards obtained by the three models as the training progresses are generally on the rise and gradually stabilise within a certain range. After 7,000 updates, the average reward of a single image model has gradually stabilised, but it is still not completely stable. The average reward of the continuous image model has basically stabilised at around 0. In contrast, the short-term memory model far exceeds the previous two models in terms of both the growth rate of the reward and the final maximum reward level. We believe that this is because the IMU data in the short-term memory state accurately represents the position and pose relationship between continuous monocular visual information, thereby helping the model to better understand the environment through the state.

In order to more intuitively monitor the obstacle avoidance ability of the agent, the convergence of the models improves., we also recorded the average moving distance of the last 20 flights during the training process. Results are shown in Figure .

Figure 11. Average moving distance comparison of single image model, continuous image model and short-term memory model during training.

The training results show that after 7,000 updates in a 36-metre-long training environment with 9 obstacles, the short-term memory model can achieve an average flight distance of about 18 metres. In comparison, the continuous image model can achieve an average flight of about 14 metres, while the single image model can only reach an average flight distance of about 10 metres This proves that the short-term memory model can indeed significantly improve the obstacle avoidance ability of the agent after training. We think that this is because the short-term memory state can clearly allow the agent to learn the position and posture information of obstacles in the route passed in a short time, so that it is easier to avoid the side obstacles in the blind spot of the vision. To verify this conjecture, we designed a special experiment during model testing to record the number of times the three models collided with the side obstacles during 100 test flights. The details of this experiment will be shown in sub-section 4.7.

4.5. Validating the role of VAE

The purpose of this experiment is to verify that using VAE for visual perception can indeed achieve better performance than directly combining the image network and DRL. We chose the combination of MobileNet V2 (Sandler et al., Citation2018) and TD3 model as the control group and the single image model to compare. MobileNet V2 is a lightweight and expressive image recognition network proposed by Google in 2018. It achieves higher or similar mAP and faster training speed than SSD300 (Liu et al., Citation2016), SSD512 (Liu et al., Citation2016),Yolo V2 and MobileNet V1 (Howard et al., Citation2017) in object detection task with fewer parameters. Since Mobile Net V2 has a lightweight structure (a relatively shallow and narrow model in Image Nets) while having good performance. In image networks, this model is most suitable to be combined with the DRL model. Therefore, we use it as a representative of the DRL + image network model to compare with our models.

In this experiment we train the MobileNet V2 + TD3 model for 7,000 epochs in the same environment with the same reward function and simultaneously record the model's reward change and the average flight distance change. Here we use the structure of the original Mobile Net V2 in the paper, as shown in Table . We directly graft Mobile Net to the first layer of Actor and Critic of TD3. The size of the output k of the Mobile Net is set to 32, in order not to change the original structure of the Actor Critic network (the input size is 32). The Mobile Net will be updated as Actor and Critic Net are updated.

Table 4. MobileNet V2 structure.

Display Table

As can be seen from Figure and Figure , after 7,000 updates, the rewards and average flight distance of the Mobile Net V2 + TD3 model still fluctuate greatly. It seems that the model does not converge very quickly. At the same time, the maximum average reward and moving distance it can achieve are only comparable to the single image model.

Figure 12. Reward comparison of single image model and MobileNet V2 + TD3.

Figure 13. Average moving distance comparison of single image model and MobileNet V2 + TD3.

4.6. Obstacle avoidance test

In order to test the obstacle avoidance ability of the three models after training, we use them to control the agent to fly 100 times in the simulation environment. During flights, we will record the number of times each obstacle is avoided and calculate the average flight distance and the average obstacle avoidance rate of all obstacles after 100 flights. We define the obstacle avoidance rate as the probability that the UAV can finally avoid an obstacle through a series of actions after detecting it. In the UAV test scene, the obstacles are arranged in the order ${O_{1}, O_{2} \dots . ., O_{n}, O_{n + 1} \dots \dots}$ . We assume that the number of times the UAV avoids the obstacle $O_{n}$ is $t_{n}$ , and the number of times it avoids the obstacle $O_{n - 1}$ is $t_{n - 1}$ , then the avoidance rate of the UAV for the $O_{n}$ is $R_{n}$ = $t_{n}$ / $t_{n - 1}$ . We can use Equation (2) to calculate the average obstacle avoidance rate for the UAV to fly M times in an environment with N obstacles: (2) $R = \frac{\frac{t_{1}}{M} + \sum_{n = 2}^{N} \frac{t_{n}}{t_{n - 1}}}{N}$ (2) Test results of the three models and the baseline model (Mobile Net V2 + TD3) are shown in Table , Table , Table and Table .

The test results show that the trained short-term memory model achieves an average flight distance of 22.6 metres in 100 flights. In contrast, the average flight distances of the continuous images model and the single image model are 19.8 and 14.7 metres, respectively. In terms of the average obstacle avoidance rate, the short-term memory model reached 86.4%, the continuous image model reached 81.6% and the single image model reached 73.8%. The Mobile Net V2 + TD3 model can only fly an average of 8.4 metres and achieve an average obstacle avoidance rate of 53.8%. It can be seen that the obstacle avoidance ability of our three models is stronger than the traditional image network + DRL method. The performance of the short-term memory model is indeed better than the other two models. If we compare the obstacle avoidance rate of each obstacle, we find that the short-term memory model has a significantly higher obstacle avoidance rate at obstacle 3 and obstacle 9 than the other two models.

We speculated this is because when the UAV passes through such door frame-type obstacles, there are obstacles in the blind areas of both left and right sides, and the UAV has a greater probability of colliding with the side obstacles. The short-term memory model does increase the UAV's ability to avoid side obstacles.

To verify the generality of the trained models, we rearrange the obstacles in the training environment to construct a new test environment. In this environment, the order of obstacles becomes: 7-5-8-9-2-1-6-4-3. As shown in Table , the average obstacle avoidance rate and flight distance decreased for all three models. Only the short-term memory model still has a 70% obstacle avoidance rate in the new environment. This proves that our model has some degree of overfitting for training environment.\

Table 9. Test results in rearranged environment.

Download CSV Display Table

4.7. Validating perception of side obstacles

To verify the speculation of the previous experiment, we counted the number of times the UAV collided with obstacles in the blind area of the side view in 100 flying tests. When a collision occurs, we consider it a side collision if there is no obstacle close enough to the UAV's front-facing camera.

We can see from Table that the short-term memory model does experience fewer side collisions over 100 flights. This also proves our speculation in the previous experiment.

Table 10. Side collision count in 100 flights of 3 models.

Download CSV Display Table

5. Conclusion

In this study, we aim to solve the dilemma that the deep reinforcement learning visual navigation system relies on the complex image network, while the complex network is not conducive to the convergence of the DRL model. For this, we use a method that uses Conv-VAE as the visual perception module and the DRL model to be trained simultaneously. Based on this method, a single image model, a continuous images model and a short-term memory model are proposed. Experiments show that the convergence and performance of these three models are better than DRL models that directly use image networks as actor and critic networks. Moreover, the short-term memory state including the spatiotemporal relationship can improve the ability of the DRL monocular vision navigation model to avoid side obstacles.

However, our study still has certain limitations. First, the single image model and continuous images model are less able to avoid particularly narrow door frame obstacles like obstacle 3 and 9. This situation is greatly improved when using the short-term memory model. We believe that this is because the short-term memory model has a stronger ability to perceive side collisions. However, even the short-term memory model did not achieve significantly better obstacle avoidance on obstacles 3 and 9 in the rearranged test environment. This implies our model has some degree of overfitting to the training environment. This is not only reflected in the short-term memory model's ability to avoid specific obstacles, but also in the degraded performance of the three models in the reconstructed test environment. In the future research, we plan to use environments that can randomly change for training, and allow the model to gain more diverse experience with obstacles during training. Besides, we also need to design a more reasonable noise scheme for visual navigation tasks. Secondly, the performance of the model when faced with obstacle 7 also has a significant drop. To this end, we examine Conv-VAE for the depth maps generated when facing each obstacle. It is found that the depth map generated by Conv-VAE cannot effectively display the outline of obstacle 7. We believe that this is because the surface of obstacle 7 is very smooth and has stronger light reflection ability, so that the perception module cannot effectively learn its depth information. In the future research, we will focus on learning state representations that have better perception capabilities for objects of different materials and textures.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

Anderson, W. C., Carey, K., Sturzinger, E. M., & Lowrance, C. J. (2019). Autonomous Navigation via a deep Q network with one-hot image encoding. 2019 IEEE International Symposium on Measurement and Control in Robotics (ISMCR). https://doi.org/10.1109/ismcr47492.2019.8955697
Google Scholar
Azar, A. T., Koubaa, A., Ali Mohamed, N., Ibrahim, H. A., Ibrahim, Z. F., Kazim, M., Ammar, A., Benjdira, B., Khamis, A. M., Hameed, I. A., & Casalino, G. (2021). Drone deep reinforcement learning: A review. Electronics, 10(9), 999. https://doi.org/10.3390/electronics10090999
Web of Science ®Google Scholar
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. https://doi.org/10.1109/tpami.2013.50
PubMed Web of Science ®Google Scholar
Bonatti, R., Madaan, R., Vineet, V., Scherer, S., & Kapoor, A. (2020). Learning visuomotor policies for aerial navigation using cross-modal representations. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). https://doi.org/10.1109/iros45743.2020.9341049
Google Scholar
Caltagirone, L., Bellone, M., Svensson, L., & Wahde, M. (2019). LIDAR–camera fusion for road detection using fully convolutional neural networks. Robotics and Autonomous Systems, 111, 125–131. https://doi.org/10.1016/j.robot.2018.11.002
Web of Science ®Google Scholar
Caselles-Dupré, H., Garcia-Ortiz, M., & Filliat, D. (2018, December 11). Continual state representation learning for reinforcement learning using Generative Replay. arXiv.org. Retrieved October 25, 2022, from https://arxiv.org/abs/1810.03880
Google Scholar
Devo, A., Mezzetti, G., Costante, G., Fravolini, M. L., & Valigi, P. (2020). Towards generalization in target-driven visual navigation by using deep reinforcement learning. IEEE Transactions on Robotics, 36(5), 1546–1561. https://doi.org/10.1109/tro.2020.2994002
Web of Science ®Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021, June 3). An image is worth 16(16 words: Transformers for image recognition at scale. arXiv.org. Retrieved June 3, 2022, from https://arxiv.org/abs/2010.11929
Google Scholar
Elkholy, H. A., Azar, A. T., Shahin, A. S., Elsharkawy, O. I., & Ammar, H. H. (2020). Path planning of a self driving vehicle using artificial intelligence techniques and machine vision. Advances in Intelligent Systems and Computing, 1153, 532–542. https://doi.org/10.1007/978-3-030-44289-7_50
Google Scholar
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., Legg, S., & Kavukcuoglu, K. (2018, June 28). Impala: Scalable distributed deep-RL with importance weighted actor-learner architectures. arXiv.org. Retrieved June 3, 2022, from https://arxiv.org/abs/1802.01561
Google Scholar
Finn, C., Tan, X. Y., Duan, Y., Darrell, T., Levine, S., & Abbeel, P. (2016). Deep spatial autoencoders for Visuomotor Learning. 2016 IEEE International Conference on Robotics and Automation (ICRA). https://doi.org/10.1109/icra.2016.7487173
Google Scholar
Fujimoto, S., van Hoof, H., & Meger, D. (2018, October 22). Addressing function approximation error in actor-critic methods. arXiv.org. Retrieved June 3, 2022, from https://arxiv.org/abs/1802.09477
Google Scholar
Haarnoja, T., Zhou, A., Hartikainen, K., Tucker, G., Ha, S., Tan, J., Kumar, V., Zhu, H., Gupta, A., Abbeel, P., & Levine, S. (2019, January 29). Soft actor-critic algorithms and applications. arXiv.org. Retrieved June 3, 2022, from https://arxiv.org/abs/1812.05905v2
Google Scholar
Hasselt, H. v., Guez, A., & Silver, D. (2016, February 1). Deep reinforcement learning with double Q-learning: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Guide Proceedings. Retrieved June 3, 2022, from https://dl.acm.org/doi/10.55553016100.3016191
Google Scholar
He, D., Zou, Z., Chen, Y., Liu, B., Yao, X., & Shan, S. (2021). Obstacle detection of rail transit based on deep learning. Measurement, 176, 109241. https://doi.org/10.1016/j.measurement.2021.109241
Web of Science ®Google Scholar
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017, April 17). MobileNets: Efficient convolutional neural networks for Mobile Vision Applications. arXiv.org. Retrieved January 23, 2023, from https://arxiv.org/abs/1704.04861
Google Scholar
Kumar, V. R., Klingner, M., Yogamani, S., Milz, S., Fingscheidt, T., & Mader, P. (2021). SynDistNet: Self-supervised monocular fisheye camera distance estimation synergised with semantic segmentation for autonomous driving. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). https://doi.org/10.1109/wacv48630.2021.00011
Google Scholar
Lange, S., & Riedmiller, M. (2010). Deep auto-encoder neural networks in reinforcement learning. The 2010 International Joint Conference on Neural Networks (IJCNN). https://doi.org/10.1109/ijcnn.2010.5596468
Google Scholar
Lee, H. Y., Ho, H. W., & Zhou, Y. (2020). Deep learning-based monocular obstacle avoidance for unmanned aerial vehicle navigation in tree plantations. Journal of Intelligent & Robotic Systems, 101, 1. https://doi.org/10.1007/s10846-020-01284-z
Web of Science ®Google Scholar
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2019, July 5). Continuous control with deep reinforcement learning. arXiv.org. Retrieved June 3, 2022, fromhttps://arxiv.org/abs/1509.02971
Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: Single shot multibox detector. Computer Vision – ECCV, 2016, 21–37. https://doi.org/10.1007/978-3-319-46448-0_2
Google Scholar
Lu, Y., Xue, Z., Xia, G.-S., & Zhang, L. (2018). A survey on vision-based UAV navigation. Geo-Spatial Information Science, 21(1), 21–32. https://doi.org/10.1080/10095020.2017.1420509
Web of Science ®Google Scholar
Ma, Z., Wang, C., Niu, Y., Wang, X., & Shen, L. (2018). A saliency-based reinforcement learning approach for a UAV to avoid flying obstacles. Robotics and Autonomous Systems, 100, 108–118. https://doi.org/10.1016/j.robot.2017.10.009
Web of Science ®Google Scholar
Nachum, O., Gu, S., Lee, H., & Levine, S. (2018, October 5). Data-efficient hierarchical reinforcement learning. arXiv.org. Retrieved January 22, 2023, from https://arxiv.org/abs/1805.08296v4
Google Scholar
Ota, K., Oiki, T., Jha, D. K., Mariyama, T., & Nikovski, D. (2020, June 27). Can increasing input dimensionality improve deep reinforcement learning? arXiv.org. Retrieved August 16, 2022, from https://arxiv.org/abs/2003.01629
Google Scholar
Padhy, R. P., Verma, S., Ahmad, S., Choudhury, S. K., & Sa, P. K. (2018). Deep neural network for autonomous UAV navigation in indoor corridor environments. Procedia Computer Science, 133, 643–650. https://doi.org/10.1016/j.procs.2018.07.099
Google Scholar
Pham, D., & Le, T. (2020). Auto-encoding variational Bayes for inferring topics and visualization. Proceedings of the 28th International Conference on Computational Linguistics. https://doi.org/10.18653/v1/2020.coling-main.458
Google Scholar
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). https://doi.org/10.1109/cvpr.2016.91
Google Scholar
Ruiz-del-Solar, J., Loncomilla, P., & Soto, N. (2018, March 28). A survey on deep learning methods for robot vision. arXiv.org. Retrieved March 2, 2023, from https://arxiv.org/abs/1803.10862
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). MobileNetV2: Inverted residuals and linear bottlenecks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. https://doi.org/10.1109/cvpr.2018.00474
Google Scholar
Shah, S., Dey, D., Lovett, C., & Kapoor, A. (2017). Airsim: High-fidelity visual and physical simulation for autonomous vehicles. Field and Service Robotics, 5, 621–635. https://doi.org/10.1007/978-3-319-67361-5_40
Google Scholar
Shin, S.-Y., Kang, Y.-W., & Kim, Y.-G. (2020). Reward-driven U-Net training for obstacle avoidance drone. Expert Systems with Applications, 143, 113064. https://doi.org/10.1016/j.eswa.2019.113064
Web of Science ®Google Scholar
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961
PubMed Web of Science ®Google Scholar
Singh, A., Kalaichelvi, V., & Karthikeyan, R. (2022). A survey on vision guided robotic systems with intelligent control strategies for autonomous tasks. Cogent Engineering, 9(1), https://doi.org/10.1080/23311916.2022.2050020
Web of Science ®Google Scholar
Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999, November 1). Policy gradient methods for reinforcement learning with function approximation: Proceedings of the 12th International Conference on Neural Information Processing Systems. Guide Proceedings. Retrieved June 3, 2022, from https://dl.acm.org/doi/10.5555/3009657.3009806
Google Scholar
Wang, Z., Schaul, T., Hessel, M., Hasselt, H. V., Lanctot, M., & Freitas, N. D. (2016, June 1). Dueling network architectures for Deep Reinforcement Learning: Proceedings of the 33rd International Conference on International Conference on Machine Learning - volume 48. Guide Proceedings. Retrieved June 3, 2022, from https://dl.acm.org/doi/10.55553045390.3045601
Google Scholar
Wu, Y., Mansimov, E., Liao, S., Radford, A., & Schulman, J. (2017, August 18). OpenAI Baselines: ACKTR & A2c. OpenAI. Retrieved June 3, 2022, from https://openai.com/blog/baselines-acktr-a2c/
Google Scholar
Xiao, X., Liu, B., Warnell, G., & Stone, P. (2022). Motion planning and control for mobile robot navigation using machine learning: A survey. Autonomous Robots, 46, 569–597. https://doi.org/10.1007/s10514-022-10039-8
Web of Science ®Google Scholar
Yue, P., Xin, J., Zhao, H., Liu, D., Shan, M., & Zhang, J. (2019). Experimental research on deep reinforcement learning in autonomous navigation of Mobile Robot. 2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA). https://doi.org/10.1109/iciea.2019.8833968
Google Scholar
Zhang, J., Hao, B., Chen, B., Li, C., Chen, H., & Sun, J. (2019). Hierarchical reinforcement learning for course recommendation in moocs. Proceedings of the AAAI Conference on Artificial Intelligence, 33((01|1)), 435–442. https://doi.org/10.1609/aaai.v33i01.3301435
Google Scholar
Zhou, X., Bai, T., Gao, Y., & Han, Y. (2019). Vision-based robot navigation through combining unsupervised learning and hierarchical reinforcement learning. Sensors, 19(7), 1576. https://doi.org/10.3390/s19071576
PubMed Web of Science ®Google Scholar
Zhou, Y., & Ho, H. W. (2022). Online robot guidance and navigation in non-stationary environment with hybrid hierarchical reinforcement learning. Engineering Applications of Artificial Intelligence, 114, 105152. https://doi.org/10.1016/j.engappai.2022.105152
Web of Science ®Google Scholar
Zieliński, P., & Markowska-Kaczmar, U. (2021). 3D robotic navigation using a vision-based deep reinforcement learning model. Applied Soft Computing, 110, 107602. https://doi.org/10.1016/j.asoc.2021.107602
Web of Science ®Google Scholar

Monocular vision guided deep reinforcement learning UAV systems with representation learning perception

Abstract

1. Introduction

2. Related works