1,545
Views
1
CrossRef citations to date
0
Altmetric
Research Article

A Novel Augmentative Backward Reward Function with Deep Reinforcement Learning for Autonomous UAV Navigation

& ORCID Icon
Article: 2084473 | Received 22 Feb 2022, Accepted 25 May 2022, Published online: 06 Jul 2022
 

ABSTRACT

The autonomous UAV (unmanned aerial vehicle) navigation has recently gained an increasing interest from both academic and industrial sectors due to its potential uses in various fields and especially, the need for social distancing during the pandemic. Many works have adopted a deep reinforcement learning (RL) method with experience replay called deep deterministic policy gradient (DDPG) to control the motion of UAV, and gain high accuracy results in static and simplified environments. However, they are still far from being ready for real world adoption in that the UAVs have to operate under complex and dynamic conditions. We also found that using only DDPG makes the learning process prone to oscillation and is inefficient for tasks having high dimensional action-state spaces. Furthermore, the goal reward mechanism in traditional reward functions brings a bias to the state, which resembles the one at the goal area and leads to erroneous action selection. To get closer to being ready for real world adoption, we proposed a novel method that enables UAVs to be capable of handling motion control in realistic environments. The first component of our proposed method is point cloud data (PCD) simplification with truncated icosahedron structure which converts enormous PCD into a few essential data points. In the second component of our method, we replace the traditional goal reward mechanism with a new mechanism called Augmentative Backward Reward (ABR) function to dispense the goal reward to transitions proportionately to its participation. By integrating simplified PCD and ABR, we achieved significantly better results when compared with using only the-state-of-the-art, TD3. In addition, we tested the proposed method with another navigation task, BipedalWalkerHardcore, a testbed for RL, and the result is still better and steadier than of TD3. These results indicate that the proposed method is robust.

Disclosure Statement

No potential conflict of interest was reported by the author(s).