Full article: Online adaptive optimal tracking control for model-free nonlinear systems via a dynamic neural network

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

This paper presents an online adaptive approximate solution for the optimal tracking control problem of model-free nonlinear systems. Firstly, a dynamic neural network identifier with properly designed weights updating laws is developed to identify the unknown dynamics. Then an adaptive optimal tracking control policy consisting of two terms is proposed, i.e. a steady-state control term is established to ensure the desired tracking performance at the steady state, and an optimal control term is proposed to ensure the optimal tracking error dynamics optimally. The composite Lyapunov method is used to analyse the stability of the closed-loop system. Two simulation examples are presented to demonstrate the effectiveness of the proposed method.

KEYWORDS:

1. Introduction

The basic idea of the classical adaptive control is to update the model parameter and control law directly or indirectly, such that the control error can be minimized. However, it is generally not optimal. On the other side, the main drawback of the classical optimal control approach lies in that the system dynamics must be precisely known for solving the Hamilton-Jacobi-Bellman (HJB) equation in an off-line manner [Citation1]. Hence, by merging the knowledge from adaptive control and optimal control, the adaptive optimal control approach has been developed during the past decade and a survey of this research can be found in [Citation2–4].

To develop an online adaptive optimal control, Werbos [Citation5] introduced the general actor-critic (AC) framework for adaptive optimal control. The critic neural network (NN) approximates the evaluation function, mapping states to an estimated measure of the value function, whereas the NN approximates an optimal control law and generates the actions or control signals. Since then, various modifications to adaptive optimal control algorithms have been proposed as model-based methods (heuristic dynamic programming – HDP [Citation6] and dual heuristic programming-DHP [Citation7]) and model-free methods (action-dependent heuristic dynamic programming – ADHDP [Citation8] and Q learning [Citation9]). However, most of the previous works on adaptive optimal control have focused on discrete-time systems. The extensions of these adaptive optimal control research to continuous-time systems pose challenges in proving stability, convergence and ensuring the online updating law with model free [Citation10].

Discretizinging the continuous time system is generally not accurate, especially for the high-dimensional systems that prohibit the learning process. Hence, the online policy iteration-based algorithms are proposed to solve the linear [Citation11] and nonlinear [Citation12] continuous-time infinite horizon optimal control problems, which involve synchronous adaptive of both actor and critic NN. Furthermore, ref. [Citation10] extended the idea in refs. [Citation11,Citation12] by designing a novel AC-identifier architecture to approximate the HJB equation without the knowledge of system drift dynamics, but the knowledge of the input dynamics is required. The recent research in [Citation13] cancels this requirement by using the experience iteration technique. Based on ref. [Citation10], a simply identifier-critic structure-based optimal control method is proposed in [Citation14,Citation15], where just a critic NN is used to approximate the solution of the HJB equation and to calculate the optimal control action. In [Citation16], an optimal control method for nonzero-sum differential games of continuous-time nonlinear systems is designed directly from the critic NN instead of the action-critic dual network, which greatly simplifies the algorithm architecture.

Most of the existing adaptive optimal research studies mainly focus on dealing with regulation problems rather than trajectory tracking problems. The combined consideration of two aspects can ensure not only the realization of trajectory tracking and stabilization but also satisfying the prescribed performance index (such as minimization of the trajectory error, fuel consumption, etc.). In [Citation17] a new data-based iterative optimal learning control scheme is developed to solve a coal gasification optimal tracking control problem in the discrete-time domain. For continuous-time systems, linear quadratic tracking control of partially-unknown systems using reinforcement learning is present in [Citation18] and a nonlinear approximately optimal trajectory tracking method with exact model information is developed in [Citation19]. To relax the requirement of an explicit model, a steady-state control conjunction with an optimal control for nonlinear continuous-time systems is developed in [Citation20], which stabilizes the error dynamics in an optimal way.

Most of the above-mentioned adaptive optimal control method is based on the affine nonlinear system, to the best of our knowledge, only [Citation21] addressed the adaptive optimal control of unknown non-affine nonlinear systems in the discrete-time domain and [Citation22] introduces an adaptive recursive control for the model-based non-affine nonlinear continuous system. The optimal control of an unknown non-affine nonlinear continuous-time system is still a challenging task, which is the motivation of this paper.

The main contributions of this paper are listed as follows.

(1) The optimal tracking control of unknown non-affine nonlinear systems based on the critic identifier architecture is first proposed in this paper. Model-free property is achieved by a neuro identifier in conjunction with the novel updating laws for both the weights and the linear part matrix which is usually assumed to be a known Hurwitz matrix for the conventional black-box nonlinear system identification.
(2) Adaptive optimal tracking control policy consisting of two terms is proposed, i.e. a steady-state control term is established to ensure the desired tracking performance at the steady state, and an optimal control term is proposed to ensure the optimal tracking error dynamics. Online solution of the optimal control term is obtained directly by a single critic NN to approximate the optimal cost function of the HJB equation instead of the conventional action-critic dual network, which greatly reduces complexity and saves calculation time. A novel learning law driven by filtered parameter error is proposed for critic NN. The stability of the entire closed-loop system is proved by the properly designed composite Lyapunov method.

The main organization of the paper is as follows. The problem formulation is given in Section 2. The DNN identifier is designed in Section 3. Then, the optimal control strategy, based on the critic-identifier architecture, is present in Section 4. Two simulation examples are presented to verify the proposed scheme in Section 5 and the conclusion is drawn in Section 6.

2. Problem formulation

Consider the following non-affine nonlinear continuous-time systems (1) $\dot{x} (t) = f (x (t), u (t))$ (1) where $x (t) = (x_{1} (t), x_{2} (t), \dots, x_{n} (t))^{T} \in R^{n}$ is the state vector, $u (t) = (u_{1} (t), u_{2} (t), \dots, u_{m} (t))^{T} \in R^{m}$ is the control input vector and $f (\cdot)$ is an unknown continuous nonlinear smooth function for $x (t)$ and $u (t)$ .

The objective of the optimal tracking control problem is to design an optimal controller (1) to ensure that the state vector $x (t)$ tracks the specified trajectory $x_{r} (t)$ and minimize the infinite horizon performance cost function as follows: (2) $V (e (t)) = \int_{t}^{\infty} r (e (τ), u_{e} (e (τ)) d τ$ (2) where the tracking error is defined as $e (t) = x (t) - x_{r} (t)$ , the utility function with symmetric positive definite matrices $Q$ and $R$ is defined as $r (e (t), u (t)) = e^{T} (t) Q e (t)$ $+ u_{}^{T} (t) R u (t)$ .

From the basic optimal control theory, we define the Hamiltonian of (1) as (3) $\begin{aligned} H (e, u_{e}, V) & = V_{e}^{T} [f (x (t), u (t))] + e^{T} Q e \\ + u_{}^{T} (t) R u (t) \end{aligned}$ (3) where $V_{e} \overset{Δ}{=} \frac{\partial V}{\partial x}$ denotes the partial derivative of the cost function $V (e (t))$ with respect to $e (t)$ .

The optimal cost function $V^{*} (e (t))$ is given as (4) $V^{*} (e (t)) = min min_{u \in ψ (Ω)} \int_{t}^{\infty} r (e (τ), u_{e} (e (τ)) d τ$ (4) and it satisfies the HJB equation (5) $\begin{aligned} H (e, u^{*}, V^{*}) & = V_{e}^{* T} [f (x (t), u^{*} (t))] + e^{T} (t) Q e (t) \\ + u_{}^{* T} (t) R u^{*} (t) \end{aligned}$ (5) where the control $u$ is defined to be admissible for (2) on a compact set $Ω \in R^{n}$ , denoted by $u \in ψ (Ω)$ .

Theoretically, the optimal control for nonlinear system (1) can be obtained from Equations (4) and (5). However, optimal control cannot be obtained in practical systems due to two reasons: 1). The optimal cost function $V^{*} (e (t))$ should be obtained by solving the HJB equation (5). However, it is usually difficult to solve the high-order nonlinear partial differential equation (PDE) for general nonlinear systems via analytical methods. Moreover, the unknown nonlinear dynamic $f (\cdot)$ makes the solution unavailable for HJB Equation (2). The idea of optimal control $u^{*} (t)$ cannot be derived by solving $\frac{\partial H (e, u^{*}, V^{*})}{\partial u^{*}} = 0$ due to the unavailability of $V^{*} (e (t))$ .

In this paper, we develop a critic-identifier to solve the optimal control of an unknown non-affine nonlinear continuous-time system, all the learning processes can be updated online.

3. Adaptive model-free identifier

We employ the following dynamic neural network (DNN) model to approximate the nonlinear dynamic system (1) (6) $\dot{\hat{x}} (t) = A \hat{x} (t) + W_{1} σ (V_{1} [\hat{x} (t)]) + W_{2} ϕ (V_{2} [\hat{x} (t)]) u (t)$ (6) where $\hat{x} (t) \in R^{n}$ is the state of the DNN, $W_{1} \in R^{n \times m}, W_{2} \in R^{m \times n}$ are the weights in the output layers, $W_{1} \in R^{n \times m}, W_{2} \in R^{m \times n}$ are the weights in the hidden layer, $A \in R^{n \times n}$ is the matrix for the linear part of NNs, $u (t) = (u_{1} (t), u_{2} (t), \dots, u_{k} (t), 0, \dots, 0)^{T} \in R^{m}$ is the control input, the active function $σ (\cdot)$ (as well as $ϕ (\cdot)$ ) is the sigmoidal vector function which is defined as $σ (\cdot) = a / (1 + e^{- b x}) - c$ , where a, b and c are constants.

Remark

If we define $W = [W_{1}, W_{2}],$ $Ξ = {σ (V_{1} [\hat{x} (t)]), φ (V_{2} [\hat{x} (t)]) u (t)}$ , then (6) can be written as $\dot{\hat{x}} (t) = A \hat{x} (t) + W Ξ$ . It has been proved in [Citation23] that DNN with the form $\dot{\hat{x}} (t) = A \hat{x} (t) + W Ξ$ can approximate the nonlinear system (1) to any degree of accuracy if the hidden layer $V$ is large enough. Here, to simplify the analysis process, we consider the simplest structure (i.e. $m = n, V = I, ϕ (\cdot) = I$ ).

Then the nonlinear system (1) can be modelled by the DNN as follows: (7) $\dot{x} (t) = A^{*} x (t) + {W_{1}}^{*} σ (x (t)) + {W_{2}}^{*} u (t) + ξ_{1}$ (7) where $A^{*}$ , ${W_{1}}^{*}, {W_{2}}^{*}$ are the nominal unknown matrices and ${W_{1}}^{*}, {W_{2}}^{*}$ are bounded as ${W_{1}}^{*} {Λ_{1}}^{- 1} {W_{1}}^{* T} \leq \bar{W_{1}}, {W_{2}}^{*} {Λ_{2}}^{- 1} {W_{2}}^{* T} \leq \bar{W_{2}}$ ( ${Λ_{1}}^{- 1}, {Λ_{2}}^{- 1}$ are any positive definite symmetric matrices), and $ξ_{1}$ is regarded as the modelling error or disturbance and is assumed to be bounded.

Assumption

The identification error is defined by $Δ x = x (t) - \hat{x} (t)$ . The difference in the activation function $\tilde{σ} = σ (x (t)) - σ (\hat{x} (t))$ satisfies the generalized Lipshitz condition ${\tilde{σ}}^{T} Λ \tilde{σ} < [Δ x]^{T} D [Δ x] = Δ x^{T} D Δ x$ , and $D = D^{T} > 0$ is the known normalizing matrices.

Then from (6) and (7), we can obtain the error dynamic equation (8) $\begin{aligned} Δ \dot{x} & = A^{*} Δ x + \tilde{A} \hat{x} (t) + {W_{1}}^{*} \tilde{σ} + {\tilde{W}}_{1} σ \hat{x} (t) \\ + {\tilde{W}}_{2} u (t) + ξ_{1} \end{aligned}$ (8) where $\tilde{A} = A^{*} - A$ ${\tilde{W}}_{1} = {W_{1}}^{*} - W_{1}, {\tilde{W}}_{2} = {W_{2}}^{*} - W_{2},$

Lemma

[Citation24]

$A \in ℜ^{n \times n}$ is a Hurwitz matrix, $R, Q \in ℜ^{n \times n}$ $R = R^{T} > 0$ , $Q = Q^{T} > 0$ if $(A, R^{1 / 2})$ is controllable, $(A, Q^{1 / 2})$ is observable and $A^{T} R^{- 1} A - Q \geq$ $\frac{1}{4} (A^{T} R^{- 1} - R^{- 1} A) R (A^{T} R^{- 1} - R^{- 1} A)$ is satisfied, the algebraic Riccati equation $A^{T} X + X A + X R X + Q = 0$ has a unique positive definite solution $X = X^{T} > 0$ .

Theorem

Consider the identification scheme (6) for (1), the following updating law (9) $\begin{aligned} \dot{A} & = - k_{1} Δ x {\hat{x}}^{T}, {\dot{W}}_{1} = - k_{2} Δ x {σ_{1}}^{T} (\hat{x}), \\ {\dot{W}}_{2} & = - k_{3} Δ x u^{T} \end{aligned}$ (9) where k_1, k₂ and k₃ are positive constants, can guarantee the following stability properties:

For a precise identifier case i.e. $ξ_{1} = 0$ ${\hat{W}}_{1, 2}, \hat{A} \in L_{\infty}, Δ x \in L_{2} \cap L_{\infty}, lim_{t \to \infty} Δ x = 0.$
For bounded modelling error and disturbances i.e. $ξ_{1} \leq \bar{ξ_{1}}$ $Δ x, {\hat{W}}_{1, 2}, \hat{A} \in L_{\infty} .$

Proof:

Consider the Lyapunov function candidate (10) $\begin{aligned} L_{I} & = Δ x^{T} P Δ x + \frac{1}{k_{2}} t r {{\tilde{W}}_{1}^{T} P_{x} {\tilde{W}}_{1}} + \frac{1}{k_{3}} t r {{\tilde{W}}_{2}^{T} P_{x} {\tilde{W}}_{2}} \\ + \frac{1}{k_{1}} t r {{\tilde{A}}^{T} P_{x} \tilde{A}} \end{aligned}$ (10)

Hence, differentiating (11) and using (8) yield

(11)

\begin{aligned} {\dot{L}}_{I} & = Δ x^{T} ({A^{*}}^{T} P + P A^{*}) Δ x + 2 Δ x^{T} P \tilde{A} \hat{x} \\ + 2 Δ x^{T} P {\tilde{W}}_{1} σ (\hat{x}) + 2 Δ x^{T} P {\tilde{W}}_{2} u \\ + 2 Δ x^{T} P W_{1}^{*} \tilde{σ} + 2 Δ x^{T} P ξ_{1} + \frac{2}{k_{1}} t r {{\dot{\tilde{A}}}^{T} P \tilde{A}} \\ + \frac{2}{k_{2}} t r {{\dot{\tilde{W}}}_{1}^{T} P {\tilde{W}}_{1}} + \frac{2}{k_{3}} t r {{\dot{\tilde{W}}}_{2}^{T} P {\tilde{W}}_{2}} \end{aligned}

(11) By using the updating laws (9) and taking the facts

\dot{\tilde{A}} = - \dot{A}, {\dot{\tilde{W}}}_{1, 2} = - {\dot{W}}_{1, 2},

into consideration, then (11) becomes

(12)

\begin{aligned} {\dot{L}}_{I} & = Δ x^{T} ({A^{*}}^{T} P + P A^{*}) Δ x \\ + 2 Δ x^{T} P W_{1}^{*} \tilde{σ} + 2 Δ x^{T} P ξ_{1} \end{aligned}

(12) Using the following matrix inequality

(13)

X^{T} Y + (X^{T} Y)^{T} \leq X^{T} Λ^{- 1} X + Y^{T} Λ Y

(13) where

X, Y \in R^{j \times k}

are any matrices and

Λ \in R^{j \times k}

is any positive definite matrix. From Assumption 3.1, one obtains

(14)

\begin{aligned} 2 Δ x^{T} P W_{1}^{*} \tilde{σ} & \leq Δ x^{T} P W_{1}^{*} Λ^{- 1} W_{1}^{*} P Δ x + {\tilde{σ}}^{T} Λ \tilde{σ} \\ \leq Δ x^{T} P {\bar{W}}_{1} P Δ x + Δ x^{T} D Δ x \\ 2 Δ x^{T} P ξ_{1} & \leq Δ x^{T} P Λ_{ξ}^{- 1} P Δ x + ξ_{1}^{T} Λ_{ξ}^{- 1} ξ_{1} \end{aligned}

(14) Then substituting (14) into (12) obtains

(15)

\begin{aligned} {\dot{L}}_{I} & \leq Δ x^{T} (A^{* T} P + P A^{*} + P {\bar{W}}_{1} P + D + Q_{0}) Δ x \\ - Δ x^{T} Q_{0} Δ x + Δ x^{T} P Λ_{ξ} P Δ x + Δ ξ_{1}^{T} Λ_{ξ}^{- 1} Δ ξ_{1} \end{aligned}

(15) By defining

R = {\bar{W}}_{1}, Q = D + Q_{o}

, then if we can select proper

Q_{o}

so that

Q

satisfies the conditions in Lemma 3.1, there exists matrix

P

satisfying the equation

A^{* T} P + P A^{*} + P R P + Q = 0

Hence (15) becomes (16) ${\dot{L}}_{I} \leq - Δ x^{T} Q_{0} Δ x + Δ x^{T} P Λ_{ξ}^{- 1} P Δ x + Δ ξ_{1}^{T} Λ_{ξ}^{- 1} Δ ξ_{1}$ (16) Case 1: For precise identifier case i.e. $ξ_{1} = 0$ , (16) becomes (17) ${\dot{L}}_{I} \leq - Δ x^{T} Q_{0} Δ x \leq - λ_{min} (Q_{x}) ‖ Δ x ‖_{Q_{x}}^{2} \leq 0$ (17) From (17) we get $Δ x, {\hat{W}}_{1, 2}, \hat{A} \in L_{\infty}$ . Furthermore, from the error dynamics (8) we have $\dot{Δ} x \in L_{\infty}$ . By integrating (17) on both sides from 0 to ∞, we have $\int_{0}^{\infty} [- λ_{min} (Q_{x}) | | Δ x | |_{Q_{x}}^{2}]$ $\leq [V_{1} (0) - V_{1} (\infty)] < \infty$ , which implies that $Δ x \in L_{2}$ . Since $Δ x \in L_{2} \cap L_{\infty}$ and $\dot{Δ} x \in L_{\infty}$ , using Barbalat's Lemma we have $lim_{t \to \infty} Δ x = 0$ .

Case 2: For bounded modelling error and disturbances i.e. $ξ_{1} \leq {\bar{ξ}}_{1}$ . Equation (16) can be represented as (18) $\begin{aligned} {\dot{L}}_{I} & \leq - Δ x^{T} Q_{0} Δ x + Δ x^{T} P Λ_{ξ}^{- 1} P Δ x + Δ ξ_{1}^{T} Λ_{ξ}^{- 1} Δ ξ_{1} \\ \leq - α (‖ Δ x ‖) + β (‖ ξ_{1} ‖) \end{aligned}$ (18) where $α (‖ Δ x ‖) = (λ_{min} (Q_{0}) - λ_{max} (P Λ_{ξ} P)) ‖ Δ x ‖^{2}$ $β (| | ξ_{1} | |) = λ_{max} (Λ_{ξ}^{- 1}) | | ξ_{1} | |^{2}$ .

Since $α_{x}$ , $β_{x}$ , $α_{y}$ , $β_{y}$ are $K_{\infty}$ functions, $L_{I}$ is the ISS-Lyapunov function. Using Theorem 3.1 in [Citation24], the dynamics of the identification error (8) is input to state stable, which implies $Δ x, W_{1, 2}, A \in L_{\infty}$ . This completes the proof of Theorem 3.1.

4. Optimal control design

In this section, adaptive optimal control is designed based on the DNN identifier. From Section 3, we know that a nonlinear system (1) can be represented by DNN with the updating law (9) as follows: (19) $\dot{x} = A \hat{x} + W_{1} σ (\hat{x}) + W_{2} u + ξ_{1}$ (19) where the model error $ξ_{1}$ is still assumed to be bounded $ξ_{1} \leq {\bar{ξ}}_{1}$ . $Δ x$ and $W_{1, 2}$ are bound as Theorem 3.1.

Then (19) can be further rewritten as (20) $\dot{x} = A x + W_{1} σ (\hat{x}) + W_{2} u + ξ_{2}$ (20) where $ξ_{2} = ξ_{1} + A \hat{x} - A x = ξ_{1} - A Δ x$ . For bounded $ξ_{1}$ and $Δ x$ , $ξ_{2}$ is bounded as well i.e. $ξ_{2} \leq {\bar{ξ}}_{2}$ .

To achieve optimal tracking control, the control action u is designed as $u = u_{r} + u_{e}$ where $u_{r}$ is the steady-state control which ensures that the tracking error is at the steady state, and $u_{e}$ is the adaptive optimal control which is used to minimize the infinite horizon performance index function optimally. $u_{r}$ should be designed to compensate for the nonlinear dynamic in (20). Hence, let $u_{r}$ be (21) $u_{r} = W_{2}^{+} [{\dot{x}}_{d} - A x - W_{1} σ (x) - K e]$ (21) where $e = x - x_{r}$ denotes the state tracking error, $K$ is the feedback gain and $W_{2}^{+}$ denotes the generalized inverse of $W_{2}$ .

From (20) and (21), the error dynamic equation becomes (22) $\dot{e} = - K e + W_{2} u_{e} + ξ_{2}$ (22) In this case, the tracking problem with (20) is transferred to the regulator problem of (22). The adaptive optimal control $u_{e}$ is designed to stabilize (22) optimally. Hence rewrite the infinite horizon performance cost function (2) as (23) $V (e (t)) = \int_{t}^{\infty} r (e (τ), u_{e} (e (τ)) d τ$ (23) where $r (e, u_{e}) = e^{T} Q e + u_{e}^{T} R u_{e}$ is the utility function with the optimal control $u_{e}$ .

According to the optimal regulator problem design in [Citation25], an admissible control policy $u_{e}$ should be designed to ensure that the infinite horizon cost function (23) related to (22) is minimized. So, design the Hamiltonian of (22) as (24) $\begin{aligned} H (e, u_{e}, V) & = V_{e}^{T} [- K e + W_{2}^{+} u_{e} + ξ_{2}] \\ + e^{T} Q e + u_{e}^{T} R u_{e} \end{aligned}$ (24) where $V_{e} = \frac{\partial V (e)}{\partial e}$ is the partial derivative of the value function with respect to e.

Then we define the optimal cost function as (25) $V^{*} (e (t)) = min_{u_{e} \in ψ (Ω)} (\int_{t}^{\infty} r (e (τ), u_{e} (e (τ)) d τ)$ (25) and it satisfies the following HJB equation (26) $min_{u_{e} \in ψ (Ω)} [H (e, u_{e}^{*}, V^{*})] = 0$ (26) The last optimal control value $u_{e}^{*}$ for (22) can be obtained by solving $\frac{\partial H (e, u_{e}^{*}, V^{*})}{\partial u_{e}^{*}} = 0$ from (24) (27) $u_{e}^{*} = - \frac{1}{2} R^{- 1} [W_{2}]^{T} \frac{\partial V^{*} (e)}{\partial e}$ (27) where $V^{*} (e)$ is the solution of the HJB equation (26).

From (27), we can learn that the optimal control value $u_{e}^{*}$ is based on the optimal value function $V^{*} (e)$ . However, it is difficult to solve the nonlinear partial differential HJB equation (26) to obtain $V^{*} (e)$ . The usual method is to get the approximate solution via a critic NN as [Citation4,Citation5,Citation25]. A single-layer NN will be used to approximate the optimal value function (28) $V^{*} (e) = W_{3}^{* T} ϕ ψ (e) + ξ_{3}$ (28) and its derivative is (29) $\frac{\partial V^{*} (e)}{\partial e} = \nabla ψ^{T} (e) W_{3}^{*} + \nabla ξ_{3}$ (29) where $W_{3}^{*} \in R^{I}$ is the nominal weight vector, $ψ (e) \in R^{I}$ is the active function and $ξ_{3}$ is the approximation error, I represents the number of neurons. $\nabla ψ (e) = \frac{\partial ψ (e)}{\partial e}$ and $\nabla ξ_{3} = \frac{\partial ξ_{3}}{\partial e}$ are the partial derivatives of $ψ (e)$ and $ξ_{3}$ with respect to e, respectively.

Assumption

The nominal weight vector $W_{3}^{*}$ , the active function $ψ (e)$ and its derivative $\nabla ψ (e)$ are all bound, i.e $| | W_{3}^{*} | | \leq {\bar{W}}_{3}, ‖ ψ (e) ‖ \leq {\bar{ψ}}_{1}, ‖ \nabla ψ (e) ‖ \leq {\bar{ψ}}_{2}$ $| | \nabla ξ_{3} | | \leq {\bar{ψ}}_{3}$ .

Then substituting (28) with (27), one obtains (30) $u_{e}^{*} = - \frac{1}{2} R^{- 1} W_{2}^{T} (\nabla ϕ^{T} (e) W_{3}^{*} + \nabla ξ_{3})$ (30) The critic NN is approximated as (31) $V (e) = W_{3}^{T} ϕ (e)$ (31) where $W_{3}$ is the estimation of the nominal $W_{3}^{*}$ .

Then the approximate optimal control can be obtained from (30) and (31) (32) $u_{e} = - \frac{1}{2} R^{- 1} W_{2}^{T} \nabla ϕ^{T} (e) W_{3}$ (32)

Remark

The available adaptive optimal control method is usually based on the dual NN architecture, where the critic NN and action NN are employed to approximate the optimal cost function and optimal control policy, respectively. The complicated structure and computational burden make it difficult for practical implantation. In the following, we will calculate the optimal control action directly from the critic NN instead of the action-critic dual network.

Substituting (28) with (24), one obtains (33) $\begin{aligned} 0 & = W_{3}^{* T} \nabla ϕ (e) [- K e + W_{2}^{+} u_{e}] \\ + e^{T} Q e + u_{e}^{T} R u_{e} + ξ_{H J B} \end{aligned}$ (33) where $ξ_{H J B} = W_{3}^{* T} ϕ (e) ξ_{2} + \nabla ξ_{3} [- K e + W_{2}^{+} u_{e} + ξ_{2}]$ is the residual HJB equation error due to the DNN identifier error $ξ_{2}$ and NN approximation error $\nabla ξ_{3}$ .

Then (33) can be written as the general identification form as (34) $Y = - W_{3}^{* T} X - ξ_{H J B}$ (34) where $X = \nabla ϕ (e) [- K e + W_{2}^{+} u_{e}]$ , $Y = e^{T} Q e + u_{e}^{T} R u_{e}$ .

According to the least square method learning rules, one can get the estimation of nominal $W_{3}^{*}$ as $W_{3} = - (X X^{T})^{- 1} X Y^{T}$ in the case of residual HJB equation error equals zero. However, $ξ_{H J B}$ is not always zero and it is also difficult to finish the subsequent closed-loop stability analysis based on the least square method. Inspired by [Citation14,Citation26], we develop a novel robust estimation method of $W_{3}^{*}$ . The following equation is used to identify (34) (35) $Y = - W_{3}^{T} X - ξ_{H J B 1}$ (35) where $ξ_{H J B}$ can be assumed to be the model error and unknown disturbance.

For (35), the filtered version of $Y$ is defined as (36) $\dot{z} = τ z + Y, z (0) = 0$ (36) where ${\dot{L}}_{o} = {\tilde{W}}_{3}^{T} μ^{- 1} {\dot{\tilde{W}}}_{3} = - E (t) {\tilde{W}}_{3}^{T} {\tilde{W}}_{3} + {\tilde{W}}_{3}^{T} ς_{f} \leq - σ | | {\tilde{W}}_{3} | |^{2} - | | {\tilde{W}}_{3} | | {\bar{ς}}_{f}$ is a positive constant,and z is an auxiliary variable.

We further define the auxiliary variables $z_{f}, Y_{f}, X_{f}$ and $ξ_{H J B 1 f}$ as (37) ${\begin{array}{l} η {\dot{z}}_{f} + z_{f} = z, z_{f} (0) = 0 \\ η {\dot{Y}}_{f} + Y_{f} = Y, Y_{f} (0) = 0 \\ η {\dot{X}}_{f} + X_{f} = X, X_{f} (0) = 0 \\ η {\dot{ξ}}_{H J B 1 f} + ξ_{H J B 1 f} = ξ_{H J B 1}, ξ_{H J B 1 f} (0) = 0 \end{array}$ (37) where $η$ is a filter parameter. It should be noted that the fictitious filtered variable $ξ_{H J B 1 f}$ is just used for analysis.

Then we get (38) $\begin{aligned} Y_{f} & = - W_{3}^{T} X_{f} - ξ_{H J B 1 f} \end{aligned}$ (38) (39) $\begin{aligned} {\dot{z}}_{f} & = - τ z_{f} + Y_{f} \end{aligned}$ (39) From the first equation in (36), one obtains (40) ${\dot{z}}_{f} = (z - z_{f}) / η$ (40) According to (38), (39) and (40), we have (41) $(z - z_{f}) / η + τ z_{f} = - W_{3}^{T} X_{f} - ξ_{H J B 1 f}$ (41) Furthermore, we define the auxiliary regression matrix $E \in R^{l \times l}$ and vector $F \in R^{l}$ as (42) ${\begin{array}{l} \dot{E} (t) = - η E (t) + X_{f} X_{f}^{T}, E (0) = 0 \\ \dot{F} (t) = - η F (t) + X_{f} [(z - z_{f}) / η + τ z_{f}] F (0) = 0 \end{array}$ (42) where $η$ is a positive constant as defined in (37).

The solution of (42) is derived as (43) ${\begin{aligned} E (t) & = \int_{0}^{t} e^{- η (t - r)} X (r) X^{T} (r) d r \\ F (t) & = \int_{0}^{t} e^{- η (t - r)} X (r) [(z (r) - z_{f} (r)) / η \\ + τ z_{f} (r)] d r \end{aligned}$ (43) Finally, we denote a vector M as (44) $M = E (t) W_{3} + F (t)$ (44) The adaptive law for updating $W_{3}$ is provided by (45) ${\dot{W}}_{3} = - μ M$ (45) where $μ$ is the learning gain.

Theorem

For system (34) with the updating law (44) then the value function weight error ${\tilde{W}}_{3} = W_{3}^{*} - W_{3}$ converges to a compact set around zero.

Proof:

The Lyapunov function is selected as (46) $L_{o} = \frac{1}{2} {\tilde{W}}_{3}^{T} μ^{- 1} {\tilde{W}}_{3}$ (46)

Then, by substituting (42) with (44), one obtains

(47)

M = E (t) W_{3} + F (t) = - E (t) {\tilde{W}}_{3} + ς_{f}

(47) where

ς_{f} = - \int_{0}^{t} e^{- η (t - r)} X_{f} ξ_{H J B 1 f} d r

is bounded as

| | s_{f} | | \leq {\bar{ζ}}_{f}

It can be seen from [Citation26] that the persistently excited (PE) for $X$ can make the matrix defined in (43) is positive define, i.e. $λ_{min} (E) > σ > 0$ Then according ${\tilde{W}}_{3}^{.} = - {\dot{W}}_{3}$ , the derivative of (46) is calculated as (48) $\begin{aligned} {\dot{L}}_{o} & = W_{3}^{T} μ^{- 1} {\tilde{W}}_{3}^{.} = - E (t) {\tilde{W}}_{3}^{T} {\tilde{W}}_{3} + {\tilde{W}}_{3}^{T} ς_{f} \\ \leq - | | {\tilde{W}}_{3} | | (σ | | {\tilde{W}}_{3} | | - {\bar{ζ}}_{f}) \end{aligned}$ (48) Then ${\tilde{W}}_{3}$ converges into the compact set $Ω : {| | {\tilde{W}}_{3} | | \leq {\bar{ζ}}_{f} / σ}$

Theorem

For system (1) with an adaptive optimal control $u$ signal (21) and (32) and adaptive laws (9) and (45), the tracking error e is uniformly ultimately bound, and the optimal control $u_{e}$ in (32) converges to a small bound around its ideal optimal solution $u_{e}^{*}$ in (30).

Proof:

Design the Lyapunov function as $L = L_{I} + L_{o} + L_{c}$ where $L_{I}$ can be expressed as (10) and the time derivative of (18) satisfies the following inequality (49) $\begin{aligned} {\dot{L}}_{I} & \leq - (λ_{min} (Q_{0}) - λ_{max} (P Λ_{ξ} P)) ‖ Δ x ‖^{2} \\ + λ_{max} (Λ_{ξ}^{- 1}) | | ξ_{1} | |^{2} \end{aligned}$ (49)

L_{o}

is defined as (46) and its derivation is obtained from (48) such that

(50)

\begin{aligned} {\dot{L}}_{o} & = {\tilde{W}}_{3}^{T} μ^{- 1} {\dot{\tilde{W}}}_{3} = - E (t) {\tilde{W}}_{3}^{T} {\tilde{W}}_{3} + {\tilde{W}}_{3}^{T} ς_{f} \\ \leq - σ | | {\tilde{W}}_{3} | |^{2} - | | {\tilde{W}}_{3} | | {\bar{ζ}}_{f} \end{aligned}

(50) From the basic inequality

a b \leq a^{2} δ / 2 + b^{2} / 2 δ

with

δ > 0

, we can rewrite (50) as

(51)

{\dot{L}}_{o} \leq - (σ - \frac{1}{2 δ}) | | {\tilde{W}}_{3} | |^{2} + \frac{δ {\bar{ζ}}_{f}^{2}}{2}

(51)

L_{C}

is defined as

(52)

L_{C} = Γ e^{T} e + κ V^{*} (e)

(52) where

V^{*} (e)

is the optimal cost function defined in (25) and

Γ, κ > 0

are positive constants.

Substituting (32) with (22), one obtains (53) $\begin{aligned} \dot{e} & = - K e + W_{2} (- 1 / 2 R^{- 1} W_{2}^{T} \nabla φ^{T} W_{3}) + ξ_{2} \\ = - K e + 1 / 2 W_{2} R^{- 1} W_{2}^{T} \nabla φ^{T} {\tilde{W}}_{3} + W_{2} u_{e}^{*} \\ + 1 / 2 W_{2} R^{- 1} W_{2}^{T} \nabla φ^{T} \nabla ξ_{3} + ξ_{2} \end{aligned}$ (53) Then time derivation of (52) can be deduced from (28) and (53) as (54) $\begin{aligned} {\dot{L}}_{C} & = 2 Γ e^{T} \dot{e} + κ (- e^{T} Q e - u_{e}^{* T} R u_{e}^{*}) \\ = 2 Γ e^{T} (- K e + \frac{1}{2} W_{2} R^{- 1} W_{2}^{T} \nabla φ^{T} {\tilde{W}}_{3} + W_{2} u_{e}^{*} \\ + \frac{1}{2} W_{2} R^{- 1} W_{2}^{T} \nabla φ^{T} \nabla ξ_{3} + ξ_{2}) \\ + κ (- e^{T} Q e - u_{e}^{* T} R u_{e}^{*}) \\ \leq - [Γ K + κ λ_{min} (Q) - Γ (| | W_{2}^{T} R^{- 1} W_{2}^{} \nabla φ | | \\ + | | W_{2}^{T} R^{- 1} W_{2}^{T} | | + 2)] | | e | |^{2} \\ + \frac{1}{4} Γ (| | W_{2}^{T} R^{- 1} W_{2}^{} \nabla φ | | | | {\tilde{W}}_{3} | |^{2} \\ - [κ λ_{min} (R) - Γ | | W_{2} | |^{2}] | | u_{e}^{*} | |^{2} \\ + \frac{1}{2} Γ | | W_{2}^{T} R^{- 1} W_{2}^{T} | | \nabla ξ_{3}^{T} \nabla ξ_{3} + Γ ξ_{2}^{T} ξ_{2} \end{aligned}$ (54) Then from (49), (50) and (54), the time derivative of L is $\dot{L} = {\dot{L}}_{I} + {\dot{L}}_{o} + {\dot{L}}_{c}$ and satisfied the following inequality (55) $\begin{aligned} \dot{L} & \leq - (λ_{min} (Q_{0}) - λ_{max} (P Λ_{ξ} P)) ‖ Δ x ‖^{2} \\ - [Γ K + κ λ_{min} (Q) - Γ (| | W_{2}^{T} R^{- 1} W_{2}^{T} | | (‖ \nabla ϕ ‖ \\ + 1) + 2)] ‖ ‖^{2} - [κ λ_{min} (R) - Γ {| | W_{2} | |}^{2}] | | u_{e}^{*} | |^{2} \\ - [σ - \frac{1}{2 δ} - \frac{1}{4} Γ (| | W_{2}^{T} R^{- 1} W_{2} \nabla ϕ | |] {| | {\tilde{W}}_{3} | |}^{2} \\ λ_{max} (Λ_{ξ}^{- 1}) | | ξ_{1} | |^{2} + \frac{1}{2} Γ | | W_{2}^{T} R^{- 1} W_{2}^{T} | | \nabla ξ_{3}^{T} \nabla ξ_{3} \\ + Γ ξ_{2}^{T} ξ_{2} + \frac{δ {\bar{ζ}}_{f}^{2}}{2} \end{aligned}$ (55) If we can choose the appropriate parameters to satisfy the following condition (56) $\begin{aligned} λ_{min} (Q_{0}) & > λ_{max} (P Λ_{ξ} P)), Γ < \frac{4 σ δ - 2}{δ | | W_{2}^{T} R^{- 1} W_{2} \nabla ϕ | |} \\ κ & > max {\frac{Γ {| | W_{2} | |}^{2}}{λ_{min} (R)}, \\ \frac{Γ (| | W_{2}^{T} R^{- 1} W_{2}^{T} | | (‖ \nabla ϕ ‖ + 1) + 2)}{λ_{min} (Q)}} \end{aligned}$ (56) Then (55) can be further represented as (57) $\dot{L} \leq - h_{1} ‖ Δ x ‖^{2} - h_{2} | | {\tilde{W}}_{3} | |^{2} - h_{3} ‖ e ‖^{2} + ϑ$ (57) where $h_{1} = λ_{min} (Q_{0}) - λ_{max} (P Λ_{ξ} P)$ $h_{2} = σ - \frac{1}{2 δ} - \frac{1}{4} Γ (| | W_{2}^{T} R^{- 1} W_{2} \nabla ϕ | |) h_{3} = Γ K + κ λ_{min} (Q) - Γ (| | W_{2}^{T} R^{- 1} W_{2}^{T} | | (‖ \nabla ϕ ‖ + 1) + 2$ $ϑ = λ_{max} (Λ_{ξ}^{- 1}) | | ξ_{1} | |^{2} + \frac{1}{2} Γ | | W_{2}^{T} R^{- 1} W_{2}^{T} | | \nabla ξ_{3}^{T} \nabla ξ_{3} + Γ ξ_{2}^{T} ξ_{2} + \frac{δ ζ_{f}^{2}}{2}$ are all positive constants from condition (56).

Then $\dot{L} < 0$ if (58) $‖ Δ x ‖ > \sqrt{ϑ / h_{1}}, | | {\tilde{W}}_{3} | | > \sqrt{ϑ / h_{2}}, ‖ e ‖ > \sqrt{ϑ / h_{3}}$ (58) which means the identification error $‖ Δ x ‖$ , the tracking error e and NN weights error $| | {\tilde{W}}_{3} | |$ are all bound.

Moreover, we have (59) ${\hat{u}}_{e} - u_{e}^{*} = \frac{1}{2} R^{- 1} W_{2}^{T} \nabla ϕ^{T} {\tilde{W}}_{3} + \frac{1}{2} R^{- 1} W_{2}^{T} \nabla ξ_{3}$ (59) When $t \to \infty$ , the upper bound of (59) is (60) $\begin{aligned} lim_{t \to \infty} | | {\hat{u}}_{e} - u_{e}^{*} | | \\ \leq \frac{1}{2} | | R^{- 1} W_{2}^{T} | | (| | \nabla ϕ^{T} | | | | {\tilde{W}}_{3} | | + \nabla {\bar{ξ}}_{3}) \leq ζ \end{aligned}$ (60) where $ζ$ depends on the DNN identification approximation error and the critic NN weight error ${\tilde{W}}_{3}$ .

The structure diagram of the control scheme is illustrated in Figure .

Figure 1. Structural diagram of the control scheme.

A summary of the ADP-based optimal tracking control algorithm is as follows

(1) Select the proper initial values of active functions σ(·) and ϕ(·) in Equation (6) and updating gains k₁, k₂, k₃ in Equation (9) for the identifier. σ(·) is usually selected as the sigmoidal function $σ (\cdot) = a / (1 + e^{- b x}) - c$ where a, b and c are the designed constants. ψ (·) is selected as ψ (·) = I. α,β and γ are tuned online according to equations (9). Hence, there is no need to select the initial weight values of α,β and γ. Meanwhile, select the proper function ϕ(·) in Equation (31) and the updating gain μ in Equation (45) for the critic NN ϕ (·) is usually selected as a smooth function consisting of a different combination between state tracking errors.
(2) The inputs/outputs data of an unknown non-affine nonlinear system (1) is used to train the identifier.
(3) Adaptive optimal tracking control law consisting of the steady-state control law in an equation and the optimal control law in Equation (32) is obtained based on the first two steps.

5. Simulations

We consider the following two examples to illustrate the theoretical results in this section.

Example

Considering the following non-affine nonlinear system (61) $\begin{aligned} [\begin{matrix} {\dot{x}}_{1} \\ {\dot{x}}_{2} \end{matrix}] & = [\begin{matrix} - x_{1} + x_{2} \\ - 0.5 x_{1} - 0.5 x_{2} (1 - (\cos (2 x_{1} + 2)^{2}) \end{matrix}] \\ + [\begin{matrix} u_{1} \\ (\cos (2 x_{1} + 2)) + \sin (u_{2}) \end{matrix}] \end{aligned}$ (61)

The matrices

Q

and

R

of the performance index function are chosen as identify matrices. The control objective is to make the state

x_{1}

and

x_{2}

follow the desired trajectory

x_{1 r} = \sin t

and

x_{2 r} = \cos (t) - \sin (t)

. First, a DNN identifier (6) with the updating law (9) is used to identify the non-affine nonlinear system. Parameters are selected as

k_{1} = k_{2} = k_{3} = 1

, active function is selected as

σ (\cdot) = 2 / (1 + e^{- 2 x}) - 0.5

The identification error is shown in Figure . We can see that the proposed identifier can model the non-affine nonlinear system accurately. Then, with the identified model, the adaptive optimal tracking controller is implemented for the unknown non-affine nonlinear continuous system (61). Define the trajectory error as $e_{1} = x_{1} - x_{r 1}, e_{2} = x_{2} - x_{r 2}$ . The activation function of critic NN is selected as $ϕ = [e_{1}^{2}, e_{1} e_{2}, e_{2}^{2}]$ . The adaptive gain of the critic NN is selected as $μ = 100$ , and the steady control gain is selected as K = 1200. Figures and represent the trajectory tracking, and the convergence property for the weight of the critic NN is shown in Figure , which demonstrates that the proposed adaptive optimal tracking controller can ensure satisfactory tracking performance for an unknown non-affine nonlinear continue system.

Figure 2. State identification error.

Figure 3. State tracking for x₁.

Figure 4. State tracking for x₂.

Figure 5. Convergence property for the critic NN weight $x = [β γ],$ .

Example

The classical 2-DOF single-track vehicle model, as shown in Figure , is commonly used in AFS/DYC control design [Citation27]. The parameter notations are shown in Table .

Figure 6. Single-track vehicle model.

Table 1. Description of vehicle model parameters.

Display Table

The mathematical model of Figure considering the uncertainty parameters is expressed as follows: (62) $\begin{aligned} \dot{x} & = (A + Δ A) x + (B + Δ B) u + E δ_{f} \\ y & = C x \end{aligned}$ (62) where $x = [\begin{matrix} β & γ \end{matrix}]$ $β$ is the side slip angel, $γ$ is the yaw rate; $u = [\begin{matrix} δ_{c} \\ M_{c} \end{matrix}]$ , $δ_{c}$ is the active steer angle, $M_{c}$ is the corrective yaw moment and $δ_{f}$ is the driver steer input $\begin{aligned} A & = [\begin{matrix} - 2 \frac{C_{r} + C_{f}}{m v_{x}} & - 1 - 2 \frac{C_{f} l_{f} - C_{r} l_{r}}{m v_{x}^{2}} \\ - 2 \frac{C_{f} l_{f} - C_{r} l_{r}}{I_{z}} & - 2 \frac{C_{f} l_{f}^{2} + C_{r} l_{r}^{2}}{I_{z} v_{x}} \end{matrix}], \\ B & = [\begin{matrix} \frac{2 C_{f}}{m v_{x}} & 0 \\ \frac{2 C_{f} l_{f}}{I_{z}} & \frac{1}{I_{z}} \end{matrix}], E = [\begin{matrix} \frac{2 C_{f}}{m v_{x}} \\ \frac{2 C_{f} l_{f}}{I_{z}} \end{matrix}], Δ A = D F E_{1}, \\ Δ B & = D F E_{2}, F = [\begin{matrix} ρ_{f} & 0 \\ 0 & ρ_{r} \end{matrix}], C = [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}] \\ D & = [\begin{matrix} - \frac{2 C_{f} Δ_{f}}{m v_{x}} & \frac{2 C_{r} Δ_{r}}{m v_{x}} \\ - \frac{2 C_{f} Δ_{f} l_{f}}{I_{z}} & - \frac{2 C_{r} Δ_{r} l_{r}}{I_{z} v_{x}} \end{matrix}], \\ E_{1} & = [\begin{matrix} 1 & \frac{l_{f}}{v_{x}} \\ - 1 & \frac{l_{r}}{v_{x}} \end{matrix}], E_{2} = [\begin{matrix} - 1 & 0 \\ 0 & 0 \end{matrix}] . \end{aligned}$ The main object of vehicle stability control is to design the proper controller to make the actual vehicle yaw rate and sideslip to follow the desired responses. The reference model is usually selected as (63) ${\dot{x}}_{r} = A_{r} x_{r} + E_{r} δ_{f}$ (63) where $x_{r} = [\begin{matrix} β_{r} \\ γ_{r} \end{matrix}], A_{r} = [\begin{matrix} - \frac{1}{τ_{β}} & 0 \\ 0 & - \frac{1}{τ_{r}} \end{matrix}]$ $E_{r} = [\begin{matrix} \frac{1 - \frac{m l_{f}}{2 (l_{f} + l_{r}) l_{r} C_{r}} v_{x}^{2}}{1 + \frac{m}{(l_{f} + l_{r})} (\frac{l_{f}}{2 C_{r}} - \frac{l_{r}}{2 C_{f}}) v_{x}^{2}} \\ \frac{\frac{v_{x}}{l_{f} + l_{r}}}{1 + \frac{m}{(l_{f} + l_{r})} (\frac{l_{f} + l_{r})}{2 C_{r}} - \frac{l_{r}}{2 C_{f}}) v_{x}^{2}} \end{matrix}]$ $τ_{r}, τ_{β}$ are the designed time constants of raw rate and sideslip angle, respectively.

With the assumption that the variation and uncertainty of tire cornering stiffness can be described as (64) ${\begin{cases} C_{f} = C_{f 0} (1 + Δ_{f} ρ_{f}), | | ρ_{f} | | \leq 1 \\ C_{r} = C_{r 0} (1 + Δ_{r} ρ_{r}), | | ρ_{r} | | \leq 1 \end{cases}$ (64) where $C_{f 0}, C_{r 0}$ and $C_{f}, C_{r}$ are the nominal and actual cornering stiffness of the front and rear tires respectively, $C_{f 0}, C_{r 0}$ are the deviation magnitude, $ρ_{f}, ρ_{r}$ are perturbations.

Simulation parameters of the vehicle system are selected as m = 1704kg, C_f= 63224N/rad, C_r= 84680 N/rad, I_z= 3048 kg m², l_f= 1.135 m and l_r= 1.555 m. A 28-degree step steer manoeuvre with an initial speed (of 80 km/h) is simulated to verify the proposed method. The time-varying parameters of C_f and C_r are obtained from (64) by selecting Δ_f, Δ_r as constant 0.5 and ρ_f, ρ_r as band-limited white noise with the amplitude ±0.01. As shown in Figures and , the proposed method still demonstrates strong robustness and self-adaptive performance, i.e. less tracking error for yaw rate and sideslip angle, when encountering time-varying cornering stiffness in step steer manoeuvre.

Figure 7. Side-slip angle.

Figure 8. Yaw rate.

To show the identification performance of the proposed algorithm, the performance index-Root Mean Square (RMS) for the state’s error has been adopted for comparison. (65) $R M S = \sqrt{\sum_{i = 1}^{n} e^{2} (i) / n}$ (65) where n is the number of the simulation steps, $e (i)$ is the corresponding state response at the ith step.

The RMS values of the side slip angle and yaw rate are 0.915 × 10⁻⁴ and 3.173 × 10⁻⁴, respectively.

6. Conclusions

In this paper, we develop an adaptive optimal controller with a critic-identifier structure to solve the trajectory tracking problem for model uncertain non-affine nonlinear continuous-time system. First, a model-free DNN identifier is designed to reconstruct the unknown dynamic. Then, based on the identification model, an adaptive optimal controller is presented, which can realize the trajectory tracking and stabilize the error dynamic optimally. In addition, a critic NN is introduced to approximate the optimal value function, and a novel robust tuning law is established to update the critic NN weight. The stability of the closed-loop system is proved by the Lyapunov approach. Simulation results of two examples are presented to verify the validity of the proposed approach.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work is supported by the National Natural Science Foundation of China (NSFC) under Grant 62073298, Key Research and Development Projects of Henan Province in 2022 under Grant 221111240200, Special Application for Key Scientific and Technological Project of Henan Province [Grant Number JDG20220037].

References

Powell B. Approximate dynamic programming: solving the curses of dimensionality. New Jersey: Wiley-Blackwell; 2007.
Google Scholar
Lebedev D, Margellos K, Goulart P. Convexity and feedback in approximate dynamic programming for delivery time slot pricing. IEEE Trans Control Syst Technol. 2022;30(2):893–900.
Web of Science ®Google Scholar
Zhang H, Liu D, Luo Y, et al. Adaptive dynamic programming for control: algorithms and stability. London: Springer; 2013.
Google Scholar
Lewis FL, Liu D, Editors. Approximate dynamic programming and reinforcement learning for feedback control. Hoboken (NJ): Wiley; 2013.
Google Scholar
Werbos P. Approximate dynamic programming for real-time control and neural modeling. handbook of intelligent control, neural, fuzzy, and adaptive approaches. New York (NY): Van Nostrand Reinhold; 1992.
Google Scholar
Miller W, Sutton R, Werbos P. Neural networks for control. Cambridge (MA): MIT Press; 1990.
Google Scholar
Fairbank M, Alonso E, Prokhorov D. Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks. IEEE Trans Neural Netw Learn Syst. 2012;23(10):1671–1676.
PubMed Web of Science ®Google Scholar
Zhu L, Modares H, Peen G, et al. Adaptive suboptimal output-feedback control for linear systems using integral reinforcement learning. IEEE Trans Control Syst Technol. 2015;23(1):264–273.
Web of Science ®Google Scholar
Wei Q, Liu D, Shi G. A novel dual iterative q-learning method for optimal battery management in smart residential environments. IEEE Trans Ind Electron. 2015;62(4):2509–2518.
Web of Science ®Google Scholar
Bhasin S, Kamalapurkar R, Johnson M, et al. A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica (Oxf). 2013;49(1):82–92.
Web of Science ®Google Scholar
Vrabie D, Pastravanu O, Abu-Khalaf M, et al. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica (Oxf). 2009;45:477–484.
Web of Science ®Google Scholar
Vamvoudakis K, Lewis F. Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem. Proc Int Joint Conf Neural Netw. 2009;46:3180–3187.
Google Scholar
Modares H, Lewis F, Naghibi-Sistani M. Adaptive optimal control of unknown constrained-input dystems using policy iteration and neural networks. IEEE Trans Neural Netw Learn Syst. 2013;24(10):1513–1525.
PubMed Web of Science ®Google Scholar
Na J, Lv Y, Wu X, et al. Approximate optimal tracking control for continuous-time unknown nonlinear systems). Nan Jing, China, Proceedings of the 33rd Chinese control conference; 2014; p. 8990–8995.
Google Scholar
Lv Y, Na J, Yang Q, et al. Online adaptive optimal control for continuous-time nonlinear systems with completely unknown dynamics. Int J Control. 2016;89(1):99–112.
Web of Science ®Google Scholar
Zhang H, Cui L, Luo Y. Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern. 2013;43(1):2168–2267.
Web of Science ®Google Scholar
Wei Q, Liu D. Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification. IEEE Trans Autom Sci Eng. 2014;11(4):1020–1036.
Web of Science ®Google Scholar
Modares H, Lewis F. Linear quadratic tracking control of partially-unknown continuous-time systems using reinforcement learning. IEEE Trans Autom Control. 2014;59(11):3051–3056.
Web of Science ®Google Scholar
Kamalapurkar R, Dinhb H, Bhasin S, et al. Approximate optimal trajectory tracking for continuous-time nonlinear systems. Automatica (Oxf). 2015;51:40–48.
Web of Science ®Google Scholar
Lv Y, Ren X, Na J. Adaptive optimal tracking controls of unknown multi-input systems based on nonzero-sum game theory. J Franklin Inst. 2019;22(12):2226–2236.
Google Scholar
Zhang X, Zhang H, Sun Q, et al. Adaptive dynamic programming-based optimal control of unknown nonaffine nonlinear discrete-time systems with proof of convergence. Neurocomputing. 2012;91:48–55.
Web of Science ®Google Scholar
Wang H, Tian Y. Non-affine nonlinear systems adaptive optimal trajectory tracking controller design and application. Stud Inf Control. 2015;24(1):5–11.
Web of Science ®Google Scholar
Li X, Yu W. Dynamic system identification via recurrent multilayer perceptrons. Inf Sci (Ny). 2002;147:45–63.
Web of Science ®Google Scholar
Poznyak A, Yu W, Sanchez E, et al. Nonlinear adaptive trajectory tracking using dynamic neural networks. IEEE Trans Neural Netw. 1999;10(6):1402–1411.
PubMedGoogle Scholar
Abu-Khalaf M, Lewis F. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica (Oxf). 2005;41(5):779–791.
Web of Science ®Google Scholar
Na J, Yang J, Wu X, et al. Robust adaptive parameter estimation of sinusoidal signals. Automatica (Oxf). 2015;53:376–384.
Web of Science ®Google Scholar
Yang X, Wang Z, Peng W. Coordinated control of AFS and DYC for vehicle handling and stability based on optimal guaranteed cost theory. Veh Syst Dyn. 2009;47(1):57–79.
Web of Science ®Google Scholar

Online adaptive optimal tracking control for model-free nonlinear systems via a dynamic neural network

Abstract

1. Introduction

2. Problem formulation

3. Adaptive model-free identifier

[Citation24]

4. Optimal control design

5. Simulations

Table 1. Description of vehicle model parameters.

6. Conclusions

Disclosure statement

References

Information for

Open access

Opportunities

Help and information

Online adaptive optimal tracking control for model-free nonlinear systems via a dynamic neural network

Abstract

1. Introduction

2. Problem formulation

3. Adaptive model-free identifier

[Citation24]

4. Optimal control design

5. Simulations

Table 1. Description of vehicle model parameters.

6. Conclusions

Disclosure statement

Additional information

Funding

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date