Full article: Deep Recurrent Reinforcement Learning for Intercept Guidance Law under Partial Observability

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Nowadays, the rapid development of hypersonic vehicles brings great challenges to the missile defense system. As achieving successful interception depends highly on terminal guidance laws, research on guidance laws for intercepting highly maneuvering targets has aroused increasing attention. Artificial intelligence technologies, such as deep reinforcement learning (DRL), have been widely applied to improve the performance of guidance laws. However, the existing DRL guidance laws rarely consider the partial observability problem of onboard sensors, resulting in the limitations of their engineering applications. In this paper, a deep recurrent reinforcement learning (DRRL)-based guidance method is investigated to address the intercept guidance problem against maneuvering targets under partial observability. The sequence consisting of previous state observations is utilized as the input of the policy network. A recurrent layer is introduced into the networks to extract hidden information behind the temporal sequence to support policy training. The guidance problem is formulated as a partially observable Markov decision process model, and then a range-weighted reward function that considers the line-of-sight rate and energy consumption is designed to guarantee convergence of policy training. The effectiveness of the proposed DRRL guidance law is validated by extensive numerical simulations.

Introduction

Missile attack and defense confrontations are integral components of modern wars that have been extensively researched. Missile penetration capabilities have rapidly advanced in recent decades, such as the development of hypersonic vehicles, which greatly enhance the maneuverability of attackers. Moreover, the increasing diversity of maneuvering penetration strategies has increased the complexity and unpredictability of attacker motion. As a result, to intercept highly maneuvering targets successfully and accurately, it is critical to develop effective terminal guidance laws for interceptors.

A variety of terminal guidance laws have been proposed for intercepting maneuvering targets, such as the variants of proportional navigation guidance (PNG) (Ha et al. Citation1990; Yuan and Hsu Citation1995; Zarchan Citation2012), differential game (Battistini and Shima Citation2014) and sliding model control (SMC) (Hu, Han, and Xin Citation2019; Zhang et al. Citation2018) methods. These approaches can hit maneuvering targets precisely when interceptors exhibit a maneuverability advantage over attackers. However, this condition is not always satisfied in practice, and once attackers perform higher maneuvers, the interception performance is greatly reduced. Moreover, as the information obtained from the sensors is only a partial observation of the system state, an observer is usually utilized in these methods to estimate the unobservable states before generating the guidance commands, which consumes considerable computing resources and causes estimation biases.

As a subfield of artificial intelligence, reinforcement learning (RL) (Sutton and Barto Citation1998) and deep reinforcement learning (DRL) algorithms (Fujimoto, Hoof, and Meger Citation2018; Hasselt, Guez, and Silver Citation2016; Hausknecht and Stone Citation2015; Lillicrap et al. Citation2015; Volodymyr et al. Citation2015) are powerful data-driven approaches for addressing complicated decision and control problems. Recently, RL and DRL algorithms have been widely applied in education (Luo and Zhang Citation2023), medicine (Hantous, Rejeb, and Hellali Citation2022) and unmanned vehicles (Chansuparp and Jitkajornwanich Citation2022; Tan Citation2023; Wang et al. Citation2021; Wang, Gao, and Zhang Citation2021). To overcome the shortcomings of traditional guidance methods, RL and DRL algorithms have also been utilized to design guidance laws. Based on the Q-learning algorithm, Zhang, Ao, and Zhang (Citation2020) improved the PNG method with an optimal time-varying proportional coefficient, and He et al. (Citation2020) developed a novel guidance law to intercept maneuvering targets by introducing the concept of zero-effort-miss. Gaudet and Furfaro (Citation2012) proposed an RL guidance approach for the missile homing phase by using a policy optimization algorithm. Using the RL algorithm, Shalumov (Citation2020) developed a cooperative online launch and guidance policy for target – missile – defender engagement, which was shown to be near-optimal. Hong, Kim, and Park (Citation2020) developed a DDPG-based intercept guidance law using the interceptor’s velocity and line-of-sight angular rate to generate guidance commands. Liu et al. (Citation2021) investigated the impact time control guidance problem by estimating the time-to-go via a neural network and then nullifying the impact time error by utilizing the RL approach. Other DRL algorithms, such as model-based DRL (Liang et al. Citation2019; Peng et al. Citation2022), approximate dynamic programming (Zhou and Xu Citation2022) and reinforcement meta-learning (Gaudet, Furfaro, and Linares Citation2020), have also been used to explore intercept guidance laws. Overall, the aforementioned DRL-based guidance laws can address the disadvantages of traditional methods and exhibit excellent performance.

However, the full state of missile-target engagement system is not always accessible to the missile in practice due to the limited observability of onboard sensors. For these existing works, the partial observability problem has not been resolved in the process of guidance law design. Some studies assume that the full system state is available when computing guidance commands (Zhang et al. Citation2018). Other papers directly utilize sensory information (Gaudet, Furfaro, and Linares Citation2020; He, Shin, and Tsourdos Citation2021; Hong, Kim, and Park Citation2020; Wu et al. Citation2022) to approximate guidance laws, but the required maneuverability of the interceptor is much greater than that of the attacker. Furthermore, to handle unobservable states, a neural network-based state estimator was built as an auxiliary to the agent’s decision-making process (Li, Zhu, and Zhao Citation2022), which shares the same shortcomings with traditional estimation methods, such as estimation biases.

As the partial observability problem is one of the primary issues that needs to be settled before implementing DRL guidance law in practice, a novel guidance framework based on deep recurrent reinforcement learning (DRRL) is proposed in this paper (see ). The main contributions are summarized as follows:

A novel deep recurrent reinforcement learning framework is proposed to address the partial observability problem. The historical observation sequence, which consists of several timesteps of previous observations, is utilized as the input of the policy network. A recurrent neural network is introduced to extract the information hidden behind the historical observation sequence to support policy training.
The proposed DRRL algorithm is utilized to address the intercept guidance problem against a maneuvering target under the constraints of partial observability and limited maneuverability. Considering the partial observability of onboard sensors, the guidance problem is first formulated as a partially observable Markov decision process (POMDP) model. An immediate reward function that weights the impacts of the line-of-sight rate and energy consumption is designed to solve the sparse reward problem and improve the training performance.
Compared with traditional methods that use the full system state and perfect target acceleration information, the proposed DRRL guidance law achieves a higher interception probability and smaller miss distance when intercepting highly maneuvering targets. Additionally, the effect of the length of the observation sequence on policy performance is studied through extensive simulations.

Figure 1. The proposed guidance framework based on deep recurrent reinforcement learning.

The remainder of this paper is organized as follows. Section 2 presents the deep recurrent reinforcement learning algorithm to address partial observability. Section 3 the planar intercept problem is formulated against a maneuvering target under the framework of the partially observable Markov decision process model and presents the DRRL-based guidance law. In Section 4, extensive simulations are performed to validate the performance of the proposed guidance law. The conclusions are given in Section 5.

Deep Recurrent Reinforcement Learning

MDP and POMDP

In model-free reinforcement learning, an agent learns a policy from scratch and completes a task via episodic interactions with the environment. This problem can be abstracted as a Markov decision process model and described by a 4-tuple $(S, A, P, R)$ (Sutton and Barto Citation1998). At each timestep $t$ , the agent chooses an action $a_{t} \in A$ depending on the state $s_{t} \in S$ observed from the environment. This action is transferred to the next state $s_{t + 1}$ according to the transition function $P$ . Then, a feedback signal that is known as the reward function and denoted as $r_{t} \sim R (s_{t}, a_{t})$ is conveyed to the agent.

However, instead of the full state of the environment, an incomplete and noisy state observation is usually provided to the agent in practice because of limited observability. One example is missile guidance, in which onboard sensors only measure the range and angle but not the velocity and angular rate, which are indispensable information for guidance command generation. The partially observable Markov decision process (POMDP) model is utilized to describe the partially observable properties of real-world environments by explicitly admitting that the information received by the agent is only a portion of the system state. Formally, a POMDP model can be described as a 6-tuple $(S, A, P, R, Ω, O)$ . $S, A, P, R$ are the state space, action space, transition probability and reward function, as defined in the MDP. The agent no longer has access to the full system state and instead receives an observation from the observation space $Ω$ according to the probability distribution $o \sim O (s)$ .

Twin Delayed Deep Deterministic Policy Gradient (TD3) Algorithm

As the goal of reinforcement learning is to find an optimal behavior strategy for the agent to obtain optimal rewards, the policy gradient method directly optimizes the policy function, which is usually modeled as a parameterized function $π_{ϕ}$ , and the optimization objective is

(1)

J (π) = E_{π} [R_{t} |s_{t}] = E_{π} [\sum_{k = 1}^{\infty} γ^{k - 1} r_{t + k} |s_{t}]

(1)

where $γ \in [0, 1]$ is a discount factor that determines the priority of short-term rewards.

As shown in , the twin delayed deep deterministic policy gradient (TD3) algorithm (Fujimoto, Hoof, and Meger Citation2018) employs a deterministic policy network $μ_{ϕ}$ (named actor) that maps the state observation to the agent’s action directly and a group of state-action value functions (named critics) consisting of two independent networks $Q_{θ_{1}}, Q_{θ_{2}}$ . Furthermore, each network has a corresponding target network denoted as $μ_{ϕ^{'}}, Q_{{θ^{'}}_{1}}, Q_{{θ^{'}}_{2}}$ , which has the same structure as their original networks. The TD3 algorithm uses a replay buffer $B$ with a finite size $N_{B}$ to store the transitions sampled from the environment. At each timestep, the algorithm randomly samples $N_{b}$ transitions from the buffer and updates the parameters of the actor and critic. The target Q value is computed by

Figure 2. The workflow of the TD3 algorithm.

(2)

\begin{aligned} Q^{tar} = r_{t} + γ min_{k = 1, 2} Q_{{θ^{'}}_{k}} (s_{t + 1}, μ_{ϕ^{'}} (s_{t + 1}) + ε) \\ ε : clip (N (0, σ), - c, c) \end{aligned}

(2)

where a small amount of random noise $ε$ is introduced into the target policy to realize smoothing regularization of the training process. Then, the pair of critics is updated to minimize the Q value estimation error:

(3)

\begin{aligned} L (θ_{k}) = E [{(Q_{θ_{k}} (s_{t}, a_{t}) - Q^{tar})}^{2}] \\ θ_{k}^{i + 1} = θ_{k}^{i} - α_{θ} \nabla_{θ_{k}} L (θ_{k}) k = 1, 2 \end{aligned}

(3)

where $α_{θ}$ is the learning rate for critics. The actor network is updated less frequently than the critic network; that is, every $d$ times of critics’ training, the actor is updated to maximize the average reward of the policy:

(4)

\begin{aligned} \nabla_{ϕ} J (μ_{ϕ}) = E [\nabla_{ϕ} μ_{ϕ} (s_{t}) \nabla_{a} Q_{θ_{1}} (s_{t}, a) |_{a = μ_{ϕ} (s_{t})}] \\ ϕ^{i + 1} = ϕ^{i} + α_{ϕ} \nabla_{ϕ} J (μ_{ϕ}) \end{aligned}

(4)

where $α_{ϕ}$ is the learning rate for the actor. The target networks are softly updated by slowly tracking their corresponding networks:

(5)

\begin{matrix} \begin{matrix} ϕ^{'} \leftarrow κϕ + (1 - κ) ϕ^{'} \\ θ^{'} \leftarrow κθ + (1 - κ) θ^{'} \end{matrix} & κ ≪ 1 \end{matrix}

(5)

where $κ$ is the soft update rate.

Deep Recurrent Reinforcement Learning (DRRL) for POMDP

As mentioned above, the TD3 algorithm is proposed under the framework of the MDP, where the whole environment state is accessible to the agent. However, for practical tasks, the whole system state is not always available due to limited measurements, and the resulting partial observability would cause significant performance degradation of the RL policy. To address this problem, we propose a historical observation sequence-based deep recurrent reinforcement learning algorithm. As shown in , by combining the previous $m$ timesteps of observation, the current state observation is extended as a temporal sequence

Figure 3. Deep Recurrent Reinforcement Learning.

(6)

y_{t} = [o_{t - m}, \dots, o_{t - 1}, o_{t}]

(6)

Although the unobservable state is not provided explicitly in $y_{t}$ , the extensive information hidden behind the historical observation sequence can be extracted via neural networks and then substituted for the whole system state to help train the agent’s policy.

To extract and memorize the dependencies of continuous-time data effectively, we utilize a gated recurrent unit (GRU) (Cho et al. Citation2014), which is a typical kind of recurrent neural network (RNN). As depicted in , there are two gates in the GRU. The reset gate $r_{t}$ determines whether the previous hidden state should be ignored and is computed by

(7)

r_{t} = sigmoid (W_{r} \cdot [h_{t - 1}, x_{t}])

(7)

where $sigmoid (\cdot)$ is the logistic sigmoid function, $W_{r}$ is the weight matrix, and $h_{t - 1}$ and $x_{t}$ are the previous hidden state and input, respectively. The update gate $z_{t}$ determines whether the hidden state should be updated with a new state and is computed as

(8)

z_{t} = sigmoid (W_{z} \cdot [h_{t - 1}, x_{t}])

(8)

Then, the hidden state $h_{t}$ is updated by

(9)

h_{t} = (1 - z_{t}) * h_{t - 1} + z_{t} * {\tilde{h}}_{t}

(9)

where ${\tilde{h}}_{t} = \tanh (W \cdot [r_{t} * h_{t - 1}, x_{t}])$ and $\tanh (\cdot)$ is the hyperbolic tangent function. In this way, each GRU can capture dependencies of temporal sequences over different time scales.

As shown in , we adopt a GRU layer in both the actor and critic networks as a preprocessing module, which takes the temporal observation sequence as input, and its output is then fed to the original fully connected layers so that the hidden information behind the input sequence is extracted effectively as a substitute for those unobservable system states.

Guidance Law Based Upon DRRL

Problem Formulation

As shown in , we consider two-dimensional engagement, where T and M represent the target and the intercepting missile, respectively. An exo-atmospheric interception engagement is considered, and the effect of gravity is neglected for simplicity. It is assumed that the missile and target both travel at a constant velocity $v$ with a flight path angle $θ$ and lateral acceleration $a$ . Thus, the dynamics of the missile-target relative motion are modeled as

Figure 4. Engagement geometry.

(10)

\{\begin{matrix} \dot{R} = - v_{M} cos (λ - θ_{M}) - v_{T} cos (λ + θ_{T}) \\ R \dot{λ} = v_{M} sin (λ - θ_{M}) + v_{T} sin (λ + θ_{T}) \\ {\dot{θ}}_{T} = a_{T} / v_{T} \\ {\dot{θ}}_{M} = a_{M} / v_{M} \end{matrix}

(10)

where $R$ denotes the missile-target relative range and $λ$ is the line-of-sight (LOS) angle.

As the acceleration is generated according to the autopilot with a time delay, we model the autopilot dynamics as a first-order time lag system:

(11)

\dot{a} = - \frac{1}{τ} a + \frac{1}{τ} a^{C}

(11)

where $τ$ is the time constant and $a^{C}$ is the acceleration command.

Here, we consider two kinds of practical target periodic maneuver policies (Zarchan Citation2012): the barrel roll $a (t) = a_{^{T}}^{max} \cdot sin [\frac{2 π}{T} (t - t_{0})]$ and the vertical-S maneuver $a (t) = a_{^{T}}^{max} \cdot sign (sin [\frac{2 π}{T} (t - t_{0})])$ , where $a_{^{T}}^{max}$ is the maneuver amplitude denoting the maximum acceleration capability of the target, $T$ is the period, $t_{0}$ is the time to start maneuvering, and $sign (\cdot)$ is the sign function.

POMDP Model

In this section, considering the partial observability constraint, we formulate the interception problem under the framework of the POMDP model.

Generally, the whole system state is denoted as

(12)

s = {[R, λ, \dot{R}, \dot{λ}]}^{T}

(12)

including the missile-target relative range $R$ , line-of-sight angle $λ$ , relative velocity $\dot{R}$ and LOS angular rate $\dot{λ}$ . In practice, however, $\dot{R}$ and $\dot{λ}$ cannot be measured directly by onboard sensors; therefore, the state observation is written as

(13)

o = {[R, λ]}^{T}

(13)

As mentioned above, this paper utilizes the historical observation sequence $o_{t - m}, \dots, o_{t - 1}, o_{t}$ as the input of the policy network to generate guidance commands, which are chosen as the agent’s action:

(14)

a \in [- a_{M}^{max}, a_{M}^{max}], a \in R

(14)

where $a_{M}^{max}$ is the maximum acceleration of the interceptor.

The reward function has two components: the inherent terminal reward and the designed immediate reward. When the sign of the missile-target closing velocity becomes positive, the agent has reached the terminal state, and the confrontation terminates. According to the terminal miss distance $R_{terminal}$ , a terminal reward function can be designed as

(15)

r_{_{terminal}} = \{\begin{matrix} \begin{matrix} r_{0} & R_{terminal} \leq R_{hit} \end{matrix} \\ \begin{matrix} 0 & R_{terminal} > R_{hit} \end{matrix} \end{matrix}

(15)

where $r_{0}$ is a positive value and $R_{hit}$ is a constant used to judge whether the target has been hit. This constant is chosen according to the kill radius of the interceptor. When $R_{terminal}$ is equal to or less than $R_{hit}$ , the interception is successful, and the interceptor is rewarded with $r_{0}$ . In contrast, there is no reward for the off-target case where $R_{terminal}$ is larger than $R_{hit}$ . Furthermore, to improve the learning efficiency, an immediate reward $r_{t}$ that weights the impact of the LOS rate and energy consumption according to the missile-target relative range is designed for the case in which the state changes from $s_{t}$ to a nonterminal state $s_{t + 1}$ via an action $a_{t}$

(16)

r_{t} = (1 - {\overset{ˉ}{R}}_{t}) exp (- β {\dot{λ}}_{t}^{2}) - {\overset{ˉ}{R}}_{t} \cdot {\overset{ˉ}{a}}_{t}^{2}

(16)

where ${\overset{ˉ}{R}}_{t} = \frac{R_{t}}{R_{max}}$ is the normalized missile-target relative range at time $t$ , ${\dot{λ}}_{t}^{}$ is the LOS rate, ${\overset{ˉ}{a}}_{t}$ is the normalized missile acceleration, and $β$ is a constant. The first term in EquationEquation (16)(16) $r_{t} = (1 - {\overset{ˉ}{R}}_{t}) exp (- β {\dot{λ}}_{t}^{2}) - {\overset{ˉ}{R}}_{t} \cdot {\overset{ˉ}{a}}_{t}^{2}$ (16) is a Gaussian-like function of $\dot{λ}$ that encourages the interceptor to nullify the line-of-sight angular rate. This formulation is inspired by traditional guidance laws; if $\dot{λ} = 0$ holds, the missile-target relative velocity points toward the target throughout the engagement, ensuring successful interception with the maneuvering target. This method is effective for intercepting a maneuvering target but consumes a considerable amount of energy when tracking the target. Therefore, we apply an adaptive coefficient $\overset{ˉ}{R}$ that encourages the interceptor to maintain a constant line-of-sight angle $λ$ only when the missile and target are sufficiently close. Correspondingly, the second term in EquationEquation (16)(16) $r_{t} = (1 - {\overset{ˉ}{R}}_{t}) exp (- β {\dot{λ}}_{t}^{2}) - {\overset{ˉ}{R}}_{t} \cdot {\overset{ˉ}{a}}_{t}^{2}$ (16) is used to discourage unnecessary actions when the missile is far from the target.

Guidance Law Optimization

The detailed training procedure for the proposed DRRL guidance law is listed in .

Table 1. Training procedure of the proposed DRRL guidance law.

Display Table

It is noted that exploration noise is generated by an Ornstein-Uhlenbeck (OU) process to achieve better exploration, which is written as

(17)

d x_{t} = θ (μ - x_{t}) dt + σd W_{t}

(17)

The OU process includes a mean reversion that fluctuates around the mean value $μ$ with a rate $θ > 0$ and a Brownian motion, denoted by $W_{t}$ , with variance $σ > 0$ . As depicted in , the OU process results in temporally correlated noise that oscillates near 0. For physical control problems with inertia, the gap between two adjacent actions should be small, as it is unrealistic to alter one action without experiencing a delay. Since the intercept guidance problem is a physical control problem, the OU process is adopted for high-efficiency exploration.

Figure 5. Ornstein–Uhlenbeck noise ( $μ = 0, θ = 0.15, σ = 0.2$ ).

Furthermore, considering the instability of the training process caused by the introduction of a recurrent layer, we consider a random sample from the replay buffer at each update and use a bootstrapped random update strategy in which updates begin at random points in one episode and proceed for only one timestep, with the GRU’s initial hidden state set to zero at the start of each update (Hausknecht and Stone Citation2015).

Experiments and Analysis

Environment and Algorithm Settings

The environment is initialized as follows. First, the minimum and maximum values of the engagement parameters listed in are uniformly sampled. Then, the interceptor is placed on a collision triangle, as shown in , with the velocity vector $v_{M}^{P I P}$ pointing in the direction of the predicted intercept point (PIP) and the leading angle $L = {sin}^{- 1} \frac{v_{T} sin (θ_{T} + λ)}{v_{M}}$ . Furthermore, considering the potential biases when calculating the PIP in practice, a heading error $ε$ is introduced into the interceptor’s initial flight path angle: $θ_{M} = λ + L + ε$ .

Table 2. Initial parameters for engagement.

Display Table

In the training phase, the attacker’s maneuver mode is chosen as a barrel roll with a maximum acceleration of $a_{T}^{max} = 5 g$ and a maneuver period of $T = 10 s$ . The attacker begins to maneuver randomly when the missile-target relative range is less than 20 km and more than 5 km. The maximum acceleration of the interceptor is set as $a_{M}^{max} = 5 g$ . The time constant of the autopilot is chosen as $τ = 0.1 s$ . The system state is updated using the fourth-order Runge – Kutta method and variable step-size integration, where a timestep of 0.001 s is used for the final 100 m guidance and a timestep of 0.1 s is used otherwise. The number of previous observations in the input sequence is set to $m = 5$ . When the terminal state is reached, a terminal miss distance of less than $r_{hit} = 10 m$ is defined as successful interception, and the agent is given a positive reward of $r_{0} = 20$ in this case. The constant in the immediate reward function is set to $β = 10^{4}$ .

The reward at each step is discounted by the factor $γ = 0.99$ . The capacity of the experience replay buffer is $N_{B} = 10^{6}$ , and each update of a batch of $N_{b} = 512$ samples is randomly chosen and fed into the networks. The exploration noise is generated by an OU process with parameters of $μ = 0$ , $θ = 0.15$ and $σ = 0.3$ . The target policy noise is generated by a Gaussian process with parameters $μ = 0$ and $σ = 0.2$ , both of which are clipped with a maximum value of 0.5. For both the actor and critic networks, the number of hidden units is 128, 128, and 64, respectively, and their learning rates are set to $α_{ϕ} = α_{θ} = 3 \times 10^{- 4}$ . The target networks are softly updated with a rate of $κ = 5 \times 10^{- 3}$ . The actor train frequency is selected as $d = 2$ .

Training Results

The proposed DRRL guidance law is trained in the aforementioned stochastic scenario, and an ablation experiment using vanilla DRL is conducted for comparison, where a single observation layer and a fully connected layer are used. The learning curve for 60,000 episodes of training is shown in . It is clear that in the first 20,000 episodes, the average cumulative reward increases gradually for DRRL, and after the 20,000^th episode, the curve becomes steady, indicating the convergence of the training policy. Nevertheless, the DRL approach exhibits inferior learning efficiency, and the average cumulative reward is much lower than that of DRRL.

Figure 6. Learning curves.

Furthermore, 1000 Monte Carlo simulation tests are performed for different initial conditions, in which the initial parameters vary in the range listed in . Two classic model-based approaches, augmented proportional navigation (APN) and SMC, are used for comparison; refer to EquationEquations (18)(18) $a_{APN} = N_{APN} \dot{R} \dot{λ} + \frac{1}{2} N_{APN} {\hat{a}}_{T}$ (18) and (Equation19(19) $a_{SMC} = N_{SMC} |\dot{R}| \dot{λ} + {\overset{ˉ}{a}}_{T} \frac{\dot{λ}}{\dot{λ} + δ}$ (19) ), respectively.

(18)

a_{APN} = N_{APN} \dot{R} \dot{λ} + \frac{1}{2} N_{APN} {\hat{a}}_{T}

(18)

(19)

a_{SMC} = N_{SMC} |\dot{R}| \dot{λ} + {\overset{ˉ}{a}}_{T} \frac{\dot{λ}}{\dot{λ} + δ}

(19)

where $N_{APN}$ and $N_{SMC}$ are coefficients for APN and SMC, respectively; ${\hat{a}}_{T}$ is the estimation of target acceleration; ${\overset{ˉ}{a}}_{T}$ is the upper bound of target acceleration; and $δ$ is the coefficient of the limiter function. It is assumed that perfect information on the target maneuver is provided for the APN and SMC in the simulation. The results are shown in , and it is obvious that the proposed DRRL guidance law obtains a hit probability of 97.9%, which is 60% and 30% greater than those of APN and SMC, respectively. With the energy cost defined as $\int_{t_{0}}^{t_{f}} a_{M}^{2} d t$ , the DRRL approach achieves the minimum average energy consumption. Despite the limited observability and maneuverability of interceptors, the DRRL guidance law can realize higher interception performance with less energy consumption, which is of practical engineering significance.

Table 3. Test results in the training environment.

Download CSV Display Table

In , a missile-target engagement trajectory is depicted, including the flight trajectories of the missile and target and the line-of-sight angular rate, target acceleration and missile acceleration curves. The target starts performing the barrel roll maneuver at approximately 13 s, and successful interception is achieved by the APN, SMC and DRRL. The terminal miss distances are 2.96 m, 1.85 m and 0.86 m, respectively, where DRRL realizes a more accurate attack on the maneuvering target.

Figure 7. Interception trajectories of the APN, SMC and DRRL methods.

Test Results

In this section, a well-trained DRRL policy is tested in diverse training scenarios to verify its adaptability and robustness. The test scenarios are designed as follows.

Scenario 1: the target performs a barrel roll maneuver with the same parameters as in the training scenario.

Scenario 2: the target performs a vertical-S maneuver with the same parameters as in the training scenario.

Scenario 3: the target performs a barrel roll maneuver, and its period is 5 s.

Scenario 4: the target performs a barrel roll maneuver, and its period is 15 s.

Scenario 5: the target performs a barrel roll maneuver with a maximum acceleration of 7 g.

For these five scenarios, 1000 Monte Carlo simulation tests are conducted with various maximum heading errors $HE = 0^{\circ}, 5^{\circ}, 10^{\circ}, 15^{\circ}, 20^{\circ}$ . The hit probability results for each case are shown in . It is obvious that the proposed DRRL guidance law achieves a higher intercept probability than APN and SMC in all situations. As shown in ), as the target is always at maximum acceleration when performing a vertical-S maneuver, there is an obvious performance degradation for the three methods compared to the case of intercepting a barrel roll maneuvering target. Nevertheless, the hit probability of DRRL is approximately 30% greater than that of APN and SMC. According to , the DRRL guidance law adapts to various uncertainties of the target maneuver, including the maneuver period and maximum acceleration. In addition, it is clear from that the hit probability of DRRL decreases slowly as the initial heading error increases but remains higher than that of the other two methods, while SMC exhibits susceptibility to heading error.

Figure 8. Performance comparison of APN, SMC and DRRL in test scenarios.

Discussion on the Observation Sequence

In this section, the sensitivity of the proposed DRRL guidance law to the observation sequence length is studied. The simulation is conducted with $m$ ranging from 1 to 7 in the training scenario (see ), and the other hyperparameters remain unchanged.

The results are depicted in and . It is clear from that the efficiency of policy training is very low, and the average cumulative reward converges to a low value when the sequence input consists of only one timestep of the previous observation. When $m \geq 2$ , there is a significant improvement in training efficiency, indicating that the inclusion of more information contributes to higher learning efficiency and effectiveness. However, shows that the interception probability reaches the highest point of 99.7% when $m = 6$ and does not further increase when $m > 6$ . The results indicate that there is no additional information that can be extracted from the sequence input even if many more previous observations are provided. As a result, it is suggested that 6 timesteps of previous observations be included in the observation sequence, and the resulting guidance law can well juggle the interception performance and computational complexity.

Figure 9. Comparison of learning curves when the model is trained with different values of $m$ .

Table 4. Performance comparison with different lengths of observation sequence input.

Download CSV Display Table

Conclusion

In this work, a deep recurrent reinforcement learning algorithm is proposed to address the intercept guidance problem against maneuvering targets with constraints of partial observability and acceleration limits. By introducing historical observation sequence input and gated recurrent unit into the policy network, the information hidden behind the temporal sequence is extracted and used to support policy training. Numerical simulation results validate that the proposed DRRL guidance law can intercept highly maneuvering targets successfully and effectively when the interceptor has limited observation ability and maneuverability. Compared to the APN and SMC methods, the DRRL approach exhibits superior performance in terms of the hit rate, energy consumption and generalization capability. Moreover, sensitivity studies on the effect of observation sequence length were performed, and the best performance could be achieved when 6 timesteps of observation were included in the sequence input. This work provides insights for DRL guidance law design under constraints of partial observability and limited maneuverability.

However, there are still some limitations of the proposed DRRL guidance laws. First, although the learning efficiency of the DRRL method is enhanced compared to the vanilla DRL approach, it needs to be improved further to satisfy the requirement of engineering application. Additionally, future work might consider more diverse and realistic engagement scenarios, such as three-dimensional engagement and endo-atmospheric interception.

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article.

Additional information

Funding

This work was supported by the National Natural Science Foundation of China under Grant Nos. 62203349 and 12302061.

References

Battistini, S., and T. Shima. 2014. Differential games missile guidance with bearings-only measurements. IEEE Transactions on Aerospace and Electronic Systems 50 (4):2906–19. doi:10.1109/TAES.2014.130366
Web of Science ®Google Scholar
Chansuparp, M., and K. Jitkajornwanich. 2022. A novel augmentative backward reward function with deep reinforcement learning for autonomous UAV navigation. Applied Artificial Intelligence 36 (1):1. doi:10.1080/08839514.2022.2084473
Web of Science ®Google Scholar
Cho, K., B. V. Merrienboer, C. Gulcehre, Bahdanau D, Bougares F, Schwenk H, and Bengio Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 1724–1734, Association for Computational Linguistics. doi:10.3115/v1/D14-1179
Google Scholar
Fujimoto, S., H. V. Hoof, and D. Meger. 2018. Addressing function approximation error in actor-critic methods. International Conference on Machine Learning, Stockholm, Sweden, 1587–1596. doi:10.48550/arXiv.1802.09477
Google Scholar
Gaudet, B., and R. Furfaro. 2012. Missile homing-phase guidance law design using reinforcement learning. AIAA Guidance, Navigation, and Control Conference, Minneapolis, Minnesota. doi:10.2514/6.2012-4470
Google Scholar
Gaudet, B., R. Furfaro, and R. Linares. 2020. Reinforcement learning for angle-only intercept guidance of maneuvering targets. Aerospace Science and Technology 99:105746. doi:10.1016/j.ast.2020.105746
Web of Science ®Google Scholar
Ha, I. J., J. S. Hur, M. S. Ko, and T. L. Song. 1990. Performance analysis of PNG laws for randomly maneuvering targets. IEEE Transactions on Aerospace and Electronic Systems 26 (5):713–21. doi:10.1109/7.102706
Web of Science ®Google Scholar
Hantous, K., L. Rejeb, and R. Hellali. 2022. Detecting physiological needs using deep inverse reinforcement learning. Applied Artificial Intelligence 36 (1):1. doi:10.1080/08839514.2021.2022340
Web of Science ®Google Scholar
Hasselt, H. V., A. Guez, and D. Silver. 2016. Deep reinforcement learning with double Q-learning. Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence 30 (1):2094–100. doi:10.1609/aaai.v30i1.10295
Google Scholar
Hausknecht, M., and P. Stone. 2015. Deep recurrent Q-learning for partially observable MDPs. AAAI Fall Symposium Series. doi:10.48550/arXiv.1507.06527
Google Scholar
He, X., Z. Chen, F. Jia, and M. Wu. 2020. Guidance law based on zero effort miss and Q-learning algorithm. Seventh Symposium on Novel Photoelectronic Detection Technology and Application. doi:10.1117/12.2586409
Google Scholar
He, S., H. Shin, and A. Tsourdos. 2021. Computational missile guidance: A deep reinforcement learning approach. Journal of Aerospace Information Systems 18 (8):1–12. doi:10.2514/1.I010970
Web of Science ®Google Scholar
Hong, D., M. Kim, and S. Park. 2020. Study on reinforcement learning-based missile guidance law. Applied Sciences 10 (18):6567. doi:10.3390/app10186567
Google Scholar
Hu, Q., T. Han, and M. Xin. 2019. Sliding-mode impact time guidance law design for various target motions. Journal of Guidance, Control, and Dynamics 42 (1):136–48. doi:10.2514/1.G003620
Web of Science ®Google Scholar
Liang, C., W. Wang, Z. Liu, C. Lai, and B. Zhou. 2019. Learning to guide: Guidance law based on deep meta-learning and model predictive path integral control. Institute of Electrical and Electronics Engineers Access 7:47353–65. doi:10.1109/ACCESS.2019.2909579
Google Scholar
Lillicrap, T. P., J. J. Hunt, A. Pritzel, Heess N., Erez T., Tassa Y., Silver D., and Wierstra D. 2015. Continuous control with deep reinforcement learning. Computer Science 8 (6):A187. doi:10.48550/arXiv.1509.02971
Google Scholar
Liu, Z., J. Wang, S. He, H.-S. Shin, and A. Tsourdos. 2021. Learning prediction-correction guidance for impact time control. Aerospace Science and Technology 119:107187. doi:10.1016/j.ast.2021.107187
Web of Science ®Google Scholar
Li, W., Y. Zhu, and D. Zhao. 2022. Missile guidance with assisted deep reinforcement learning for head-on interception of maneuvering target. Complex & Intelligent Systems 8 (2):1205–16. doi:10.1007/S40747-021-00577-6
Web of Science ®Google Scholar
Luo, Y., and D. Zhang. 2023. Wireless network design optimization for computer teaching with deep reinforcement learning application. Applied Artificial Intelligence 37 (1):1. doi:10.1080/08839514.2023.2218169
Web of Science ®Google Scholar
Peng, C., H. Zhang, Y. He, and J. Ma. 2022. State-following-kernel based online reinforcement learning guidance law against maneuvering target. IEEE Transactions on Aerospace and Electronic Systems 58 (6):5784–97. doi:10.1109/TAES.2022.3178770
Web of Science ®Google Scholar
Shalumov, V. 2020. Cooperative online guide-launch-guide policy in a target-missile-defender engagement using deep reinforcement learning. Aerospace Science and Technology 104:105996. doi:10.1016/j.ast.2020.105996
Web of Science ®Google Scholar
Sutton, R., and A. Barto. 1998. Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Google Scholar
Tan, J. 2023. A method to plan the path of a robot utilizing deep reinforcement learning and multi-sensory information fusion. Applied Artificial Intelligence 37 (1):1. doi:10.1080/08839514.2023.2224996
Web of Science ®Google Scholar
Volodymyr, M., K. Koray, S. David, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. Nature 518 (7540):529–33. doi:10.1038/nature14236
PubMed Web of Science ®Google Scholar
Wang, N., Y. Gao, and X. Zhang. 2021. Data-driven performance-prescribed reinforcement learning control of an unmanned surface vehicle. IEEE Transactions on Neural Networks and Learning Systems 32 (12):5456–67. doi:10.1109/TNNLS.2021.3056444
PubMed Web of Science ®Google Scholar
Wang, N., Y. Gao, H. Zhao, and C. K. Ahn. 2021. Reinforcement learning-based optimal tracking control of an unknown unmanned surface vehicle. IEEE Transactions on Neural Networks and Learning Systems 32 (7):3034–45. doi:10.1109/TNNLS.2020.3009214
PubMed Web of Science ®Google Scholar
Wu, M., X. He, Z. Qiu, and Z. Chen. 2022. Guidance law of interceptors against a high-speed maneuvering target based on deep Q-Network. Transactions of the Institute of Measurement and Control 44 (7):1373–87. doi:10.1177/01423312211052742
Web of Science ®Google Scholar
Yuan, P. J., and S. C. Hsu. 1995. Solutions of generalized proportional navigation with maneuvering and nonmaneuvering targets. IEEE Transactions on Aerospace and Electronic Systems 31 (1):469–74. doi:10.1109/7.366329
Web of Science ®Google Scholar
Zarchan, P. 2012. Tactical and strategic missile guidance. 6th ed. American Institute of Aeronautics and Astronautics. doi:10.2514/4.868948
Google Scholar
Zhang, Q., B. Ao, and Q. Zhang. 2020. Reinforcement learning guidance law of Q-learning. Systems Engineering & Electronics 42 (2):414–19. doi:10.3969/j.issn.1001-506X.2020.02.21
Google Scholar
Zhang, X., G. Liu, C. Yang, and W. Jiang. 2018. Research on air combat maneuver decision-making method based on reinforcement learning. Electronics 7 (11):279. doi:10.3390/electronics7110279
Web of Science ®Google Scholar
Zhang, R., J. Wang, H. Li, Z. Li, and Z. Ding. 2018. Robust finite-time guidance against maneuverable targets with unpredictable evasive strategies. Aerospace Science and Technology 77:534–44. doi:10.1016/j.ast.2018.04.004
Web of Science ®Google Scholar
Zhou, Z., and H. Xu. 2022. Decentralized optimal large scale multi-player pursuit-evasion strategies: A mean field game approach with reinforcement learning. Neurocomputing 484:46–58. doi:10.1016/j.neucom.2021.01.141
Web of Science ®Google Scholar

Deep Recurrent Reinforcement Learning for Intercept Guidance Law under Partial Observability

ABSTRACT

Introduction