Publication Cover
Automatika
Journal for Control, Measurement, Electronics, Computing and Communications
Volume 64, 2023 - Issue 3
944
Views
1
CrossRef citations to date
0
Altmetric
Regular Papers

Online adaptive optimal tracking control for model-free nonlinear systems via a dynamic neural network

, &
Pages 431-440 | Received 14 Jun 2021, Accepted 13 Jan 2023, Published online: 13 Feb 2023

Abstract

This paper presents an online adaptive approximate solution for the optimal tracking control problem of model-free nonlinear systems. Firstly, a dynamic neural network identifier with properly designed weights updating laws is developed to identify the unknown dynamics. Then an adaptive optimal tracking control policy consisting of two terms is proposed, i.e. a steady-state control term is established to ensure the desired tracking performance at the steady state, and an optimal control term is proposed to ensure the optimal tracking error dynamics optimally. The composite Lyapunov method is used to analyse the stability of the closed-loop system. Two simulation examples are presented to demonstrate the effectiveness of the proposed method.

1. Introduction

The basic idea of the classical adaptive control is to update the model parameter and control law directly or indirectly, such that the control error can be minimized. However, it is generally not optimal. On the other side, the main drawback of the classical optimal control approach lies in that the system dynamics must be precisely known for solving the Hamilton-Jacobi-Bellman (HJB) equation in an off-line manner [Citation1]. Hence, by merging the knowledge from adaptive control and optimal control, the adaptive optimal control approach has been developed during the past decade and a survey of this research can be found in [Citation2–4].

To develop an online adaptive optimal control, Werbos [Citation5] introduced the general actor-critic (AC) framework for adaptive optimal control. The critic neural network (NN) approximates the evaluation function, mapping states to an estimated measure of the value function, whereas the NN approximates an optimal control law and generates the actions or control signals. Since then, various modifications to adaptive optimal control algorithms have been proposed as model-based methods (heuristic dynamic programming – HDP [Citation6] and dual heuristic programming-DHP [Citation7]) and model-free methods (action-dependent heuristic dynamic programming – ADHDP [Citation8] and Q learning [Citation9]). However, most of the previous works on adaptive optimal control have focused on discrete-time systems. The extensions of these adaptive optimal control research to continuous-time systems pose challenges in proving stability, convergence and ensuring the online updating law with model free [Citation10].

Discretizinging the continuous time system is generally not accurate, especially for the high-dimensional systems that prohibit the learning process. Hence, the online policy iteration-based algorithms are proposed to solve the linear [Citation11] and nonlinear [Citation12] continuous-time infinite horizon optimal control problems, which involve synchronous adaptive of both actor and critic NN. Furthermore, ref. [Citation10] extended the idea in refs. [Citation11,Citation12] by designing a novel AC-identifier architecture to approximate the HJB equation without the knowledge of system drift dynamics, but the knowledge of the input dynamics is required. The recent research in [Citation13] cancels this requirement by using the experience iteration technique. Based on ref. [Citation10], a simply identifier-critic structure-based optimal control method is proposed in [Citation14,Citation15], where just a critic NN is used to approximate the solution of the HJB equation and to calculate the optimal control action. In [Citation16], an optimal control method for nonzero-sum differential games of continuous-time nonlinear systems is designed directly from the critic NN instead of the action-critic dual network, which greatly simplifies the algorithm architecture.

Most of the existing adaptive optimal research studies mainly focus on dealing with regulation problems rather than trajectory tracking problems. The combined consideration of two aspects can ensure not only the realization of trajectory tracking and stabilization but also satisfying the prescribed performance index (such as minimization of the trajectory error, fuel consumption, etc.). In [Citation17] a new data-based iterative optimal learning control scheme is developed to solve a coal gasification optimal tracking control problem in the discrete-time domain. For continuous-time systems, linear quadratic tracking control of partially-unknown systems using reinforcement learning is present in [Citation18] and a nonlinear approximately optimal trajectory tracking method with exact model information is developed in [Citation19]. To relax the requirement of an explicit model, a steady-state control conjunction with an optimal control for nonlinear continuous-time systems is developed in [Citation20], which stabilizes the error dynamics in an optimal way.

Most of the above-mentioned adaptive optimal control method is based on the affine nonlinear system, to the best of our knowledge, only [Citation21] addressed the adaptive optimal control of unknown non-affine nonlinear systems in the discrete-time domain and [Citation22] introduces an adaptive recursive control for the model-based non-affine nonlinear continuous system. The optimal control of an unknown non-affine nonlinear continuous-time system is still a challenging task, which is the motivation of this paper.

The main contributions of this paper are listed as follows.

  • (1) The optimal tracking control of unknown non-affine nonlinear systems based on the critic identifier architecture is first proposed in this paper. Model-free property is achieved by a neuro identifier in conjunction with the novel updating laws for both the weights and the linear part matrix which is usually assumed to be a known Hurwitz matrix for the conventional black-box nonlinear system identification.

  • (2) Adaptive optimal tracking control policy consisting of two terms is proposed, i.e. a steady-state control term is established to ensure the desired tracking performance at the steady state, and an optimal control term is proposed to ensure the optimal tracking error dynamics. Online solution of the optimal control term is obtained directly by a single critic NN to approximate the optimal cost function of the HJB equation instead of the conventional action-critic dual network, which greatly reduces complexity and saves calculation time. A novel learning law driven by filtered parameter error is proposed for critic NN. The stability of the entire closed-loop system is proved by the properly designed composite Lyapunov method.

The main organization of the paper is as follows. The problem formulation is given in Section 2. The DNN identifier is designed in Section 3. Then, the optimal control strategy, based on the critic-identifier architecture, is present in Section 4. Two simulation examples are presented to verify the proposed scheme in Section 5 and the conclusion is drawn in Section 6.

2. Problem formulation

Consider the following non-affine nonlinear continuous-time systems (1) x˙(t)=f(x(t),u(t))(1) where x(t)=(x1(t),x2(t),,xn(t))TRn is the state vector, u(t)=(u1(t),u2(t),,um(t))TRm is the control input vector and f() is an unknown continuous nonlinear smooth function for x(t) and u(t).

The objective of the optimal tracking control problem is to design an optimal controller (1) to ensure that the state vector x(t) tracks the specified trajectory xr(t) and minimize the infinite horizon performance cost function as follows: (2) V(e(t))=tr(e(τ),ue(e(τ))dτ(2) where the tracking error is defined as e(t)=x(t)xr(t), the utility function with symmetric positive definite matrices Q and R is defined as r(e(t),u(t))=eT(t)Qe(t) +uT(t)Ru(t).

From the basic optimal control theory, we define the Hamiltonian of (1) as (3) H(e,ue,V)=VeT[f(x(t),u(t))]+eTQe+uT(t)Ru(t)(3) where Ve=ΔVx denotes the partial derivative of the cost function V(e(t)) with respect to e(t).

The optimal cost function V(e(t)) is given as (4) V(e(t))=minminuψ(Ω)tr(e(τ),ue(e(τ))dτ(4) and it satisfies the HJB equation (5) H(e,u,V)=VeT[f(x(t),u(t))]+eT(t)Qe(t)+uT(t)Ru(t)(5) where the control u is defined to be admissible for (2) on a compact set ΩRn, denoted by uψ(Ω).

Theoretically, the optimal control for nonlinear system (1) can be obtained from Equations (4) and (5). However, optimal control cannot be obtained in practical systems due to two reasons: 1). The optimal cost function V(e(t)) should be obtained by solving the HJB equation (5). However, it is usually difficult to solve the high-order nonlinear partial differential equation (PDE) for general nonlinear systems via analytical methods. Moreover, the unknown nonlinear dynamic f() makes the solution unavailable for HJB Equation (2). The idea of optimal control u(t) cannot be derived by solving H(e,u,V)u=0 due to the unavailability of V(e(t)).

In this paper, we develop a critic-identifier to solve the optimal control of an unknown non-affine nonlinear continuous-time system, all the learning processes can be updated online.

3. Adaptive model-free identifier

We employ the following dynamic neural network (DNN) model to approximate the nonlinear dynamic system (1) (6) xˆ˙(t)=Axˆ(t)+W1σ(V1[xˆ(t)])+W2ϕ(V2[xˆ(t)])u(t)(6) where xˆ(t)Rn is the state of the DNN, W1Rn×m,W2Rm×n are the weights in the output layers, W1Rn×m,W2Rm×n are the weights in the hidden layer, ARn×n is the matrix for the linear part of NNs, u(t)=(u1(t),u2(t),,uk(t),0,,0)TRm is the control input, the active function σ() (as well as ϕ()) is the sigmoidal vector function which is defined as σ()=a/(1+ebx)c, where a, b and c are constants.

Remark

If we define W=[W1,W2], Ξ={σ(V1[xˆ(t)]),φ(V2[xˆ(t)])u(t)}, then (6) can be written as xˆ˙(t)=Axˆ(t)+WΞ. It has been proved in [Citation23] that DNN with the form xˆ˙(t)=Axˆ(t)+WΞ can approximate the nonlinear system (1) to any degree of accuracy if the hidden layer V is large enough. Here, to simplify the analysis process, we consider the simplest structure (i.e. m=n,V=I,ϕ()=I).

Then the nonlinear system (1) can be modelled by the DNN as follows: (7) x˙(t)=Ax(t)+W1σ(x(t))+W2u(t)+ξ1(7) where A, W1,W2 are the nominal unknown matrices and W1,W2 are bounded as W1Λ11W1TW1¯,W2Λ21W2TW2¯ (Λ11,Λ21 are any positive definite symmetric matrices), and ξ1 is regarded as the modelling error or disturbance and is assumed to be bounded.

Assumption

The identification error is defined by Δx=x(t)xˆ(t). The difference in the activation function σ~=σ(x(t))σ(xˆ(t)) satisfies the generalized Lipshitz condition σ~TΛσ~ <[Δx]TD[Δx]=ΔxTDΔx, and D=DT >0 is the known normalizing matrices.

Then from (6) and (7), we can obtain the error dynamic equation (8) Δx˙=AΔx+A~xˆ(t)+W1σ~+W~1σxˆ(t)+W~2u(t)+ξ1(8) where A~=AA W~1=W1W1,W~2=W2W2,

Lemma

[Citation24]

An×n is a Hurwitz matrix, R,Qn×n R=RT >0,Q=QT >0 if (A,R1/2) is controllable, (A,Q1/2) is observable and ATR1AQ 14(ATR1R1A)R(ATR1R1A) is satisfied, the algebraic Riccati equation ATX+XA+XRX+Q=0 has a unique positive definite solution X=XT >0.

Theorem

Consider the identification scheme (6) for (1), the following updating law (9) A˙=k1ΔxxˆT,W˙1=k2Δxσ1T(xˆ),W˙2=k3ΔxuT(9) where k1, k2 and k3 are positive constants, can guarantee the following stability properties:

  1. For a precise identifier case i.e. ξ1=0 Wˆ1,2,AˆL,ΔxL2L,limtΔx=0.

  2. For bounded modelling error and disturbances i.e. ξ1ξ1¯ Δx,Wˆ1,2,AˆL.

Proof:

Consider the Lyapunov function candidate (10) LI=ΔxTPΔx+1k2tr{W~1TPxW~1}+1k3tr{W~2TPxW~2}+1k1tr{A~TPxA~}(10)

Hence, differentiating (11) and using (8) yield (11) L˙I=ΔxT(ATP+PA)Δx+2ΔxTPA~xˆ+2ΔxTPW~1σ(xˆ)+2ΔxTPW~2u+2ΔxTPW1σ~+2ΔxTPξ1+2k1tr{A~˙TPA~}+2k2tr{W~˙1TPW~1}+2k3tr{W~˙2TPW~2}(11) By using the updating laws (9) and taking the facts A~˙=A˙,W~˙1,2=W˙1,2, into consideration, then (11) becomes (12) L˙I=ΔxT(ATP+PA)Δx+2ΔxTPW1σ~+2ΔxTPξ1(12) Using the following matrix inequality (13) XTY+(XTY)TXTΛ1X+YTΛY(13) where X,YRj×k are any matrices and ΛRj×k is any positive definite matrix. From Assumption 3.1, one obtains (14) 2ΔxTPW1σ~ΔxTPW1Λ1W1PΔx+σ~TΛσ~ΔxTPW¯1PΔx+ΔxTDΔx2ΔxTPξ1ΔxTPΛξ1PΔx+ξ1TΛξ1ξ1(14) Then substituting (14) into (12) obtains (15) L˙IΔxT(ATP+PA+PW¯1P+D+Q0)ΔxΔxTQ0Δx+ΔxTPΛξPΔx+Δξ1TΛξ1Δξ1(15) By defining R=W¯1,Q=D+Qo, then if we can select proper Qo so that Q satisfies the conditions in Lemma 3.1, there exists matrix P satisfying the equation ATP+PA+PRP+Q=0.

Hence (15) becomes (16) L˙IΔxTQ0Δx+ΔxTPΛξ1PΔx+Δξ1TΛξ1Δξ1(16) Case 1: For precise identifier case i.e. ξ1=0, (16) becomes (17) L˙IΔxTQ0Δxλmin(Qx)ΔxQx20(17) From (17) we get Δx,Wˆ1,2,AˆL. Furthermore, from the error dynamics (8) we have Δ˙xL. By integrating (17) on both sides from 0 to ∞, we have 0[λmin(Qx)||Δx||Qx2] [V1(0)V1()]<, which implies that ΔxL2. Since ΔxL2L and Δ˙xL, using Barbalat's Lemma we have limtΔx=0.

Case 2: For bounded modelling error and disturbances i.e. ξ1ξ¯1. Equation (16) can be represented as (18) L˙IΔxTQ0Δx+ΔxTPΛξ1PΔx+Δξ1TΛξ1Δξ1α(Δx)+β(ξ1)(18) where α(Δx)=(λmin(Q0)λmax(PΛξP))Δx2 β(||ξ1||)=λmax(Λξ1)||ξ1||2.

Since αx, βx, αy, βy are K functions, LI is the ISS-Lyapunov function. Using Theorem 3.1 in [Citation24], the dynamics of the identification error (8) is input to state stable, which implies Δx,W1,2,AL. This completes the proof of Theorem 3.1.

4. Optimal control design

In this section, adaptive optimal control is designed based on the DNN identifier. From Section 3, we know that a nonlinear system (1) can be represented by DNN with the updating law (9) as follows: (19) x˙=Axˆ+W1σ(xˆ)+W2u+ξ1(19) where the model error ξ1 is still assumed to be bounded ξ1ξ¯1. Δx and W1,2 are bound as Theorem 3.1.

Then (19) can be further rewritten as (20) x˙=Ax+W1σ(xˆ)+W2u+ξ2(20) where ξ2=ξ1+AxˆAx=ξ1AΔx. For bounded ξ1 and Δx, ξ2 is bounded as well i.e. ξ2ξ¯2.

To achieve optimal tracking control, the control action u is designed as u=ur+ue where ur is the steady-state control which ensures that the tracking error is at the steady state, and ue is the adaptive optimal control which is used to minimize the infinite horizon performance index function optimally. ur should be designed to compensate for the nonlinear dynamic in (20). Hence, let ur be (21) ur=W2+[x˙dAxW1σ(x)Ke](21) where e=xxr denotes the state tracking error, K is the feedback gain and W2+ denotes the generalized inverse of W2.

From (20) and (21), the error dynamic equation becomes (22) e˙=Ke+W2ue+ξ2(22) In this case, the tracking problem with (20) is transferred to the regulator problem of (22). The adaptive optimal control ue is designed to stabilize (22) optimally. Hence rewrite the infinite horizon performance cost function (2) as (23) V(e(t))=tr(e(τ),ue(e(τ))dτ(23) where r(e,ue)=eTQe+ueTRue is the utility function with the optimal control ue.

According to the optimal regulator problem design in [Citation25], an admissible control policy ue should be designed to ensure that the infinite horizon cost function (23) related to (22) is minimized. So, design the Hamiltonian of (22) as (24) H(e,ue,V)=VeT[Ke+W2+ue+ξ2]+eTQe+ueTRue(24) where Ve=V(e)e is the partial derivative of the value function with respect to e.

Then we define the optimal cost function as (25) V(e(t))=minueψ(Ω)(tr(e(τ),ue(e(τ))dτ)(25) and it satisfies the following HJB equation (26) minueψ(Ω)[H(e,ue,V)]=0(26) The last optimal control value ue for (22) can be obtained by solving H(e,ue,V)ue=0 from (24) (27) ue=12R1[W2]TV(e)e(27) where V(e) is the solution of the HJB equation (26).

From (27), we can learn that the optimal control value ue is based on the optimal value function V(e). However, it is difficult to solve the nonlinear partial differential HJB equation (26) to obtain V(e). The usual method is to get the approximate solution via a critic NN as [Citation4,Citation5,Citation25]. A single-layer NN will be used to approximate the optimal value function (28) V(e)=W3Tϕψ(e)+ξ3(28) and its derivative is (29) V(e)e=ψT(e)W3+ξ3(29) where W3RI is the nominal weight vector, ψ(e)RI is the active function and ξ3 is the approximation error, I represents the number of neurons. ψ(e)=ψ(e)e and ξ3=ξ3e are the partial derivatives of ψ(e) and ξ3 with respect to e, respectively.

Assumption

The nominal weight vector W3, the active function ψ(e) and its derivative ψ(e) are all bound, i.e ||W3||W¯3,ψ(e)ψ¯1,ψ(e)ψ¯2 ||ξ3||ψ¯3.

Then substituting (28) with (27), one obtains (30) ue=12R1W2T(ϕT(e)W3+ξ3)(30) The critic NN is approximated as (31) V(e)=W3Tϕ(e)(31) where W3 is the estimation of the nominal W3.

Then the approximate optimal control can be obtained from (30) and (31) (32) ue=12R1W2TϕT(e)W3(32)

Remark

The available adaptive optimal control method is usually based on the dual NN architecture, where the critic NN and action NN are employed to approximate the optimal cost function and optimal control policy, respectively. The complicated structure and computational burden make it difficult for practical implantation. In the following, we will calculate the optimal control action directly from the critic NN instead of the action-critic dual network.

Substituting (28) with (24), one obtains (33) 0=W3Tϕ(e)[Ke+W2+ue]+eTQe+ueTRue+ξHJB(33) where ξHJB=W3Tϕ(e)ξ2+ξ3[Ke+W2+ue+ξ2] is the residual HJB equation error due to the DNN identifier error ξ2 and NN approximation error ξ3.

Then (33) can be written as the general identification form as (34) Y=W3TXξHJB(34) where X=ϕ(e)[Ke+W2+ue], Y=eTQe+ueTRue.

According to the least square method learning rules, one can get the estimation of nominal W3 as W3=(XXT)1XYT in the case of residual HJB equation error equals zero. However, ξHJB is not always zero and it is also difficult to finish the subsequent closed-loop stability analysis based on the least square method. Inspired by [Citation14,Citation26], we develop a novel robust estimation method of W3. The following equation is used to identify (34) (35) Y=W3TXξHJB1(35) where ξHJB can be assumed to be the model error and unknown disturbance.

For (35), the filtered version of Y is defined as (36) z˙=τz+Y,z(0)=0(36) where L˙o=W~3Tμ1W~˙3=E(t)W~3TW~3+W~3Tςfσ||W~3||2||W~3||ς¯f is a positive constant,and z is an auxiliary variable.

We further define the auxiliary variables zf,Yf,Xf and ξHJB1f as (37) {ηz˙f+zf=z,zf(0)=0ηY˙f+Yf=Y,Yf(0)=0ηX˙f+Xf=X,Xf(0)=0ηξ˙HJB1f+ξHJB1f=ξHJB1,ξHJB1f(0)=0(37) where η is a filter parameter. It should be noted that the fictitious filtered variable ξHJB1f is just used for analysis.

Then we get (38) Yf=W3TXfξHJB1f(38) (39) z˙f=τzf+Yf(39) From the first equation in (36), one obtains (40) z˙f=(zzf)/η(40) According to (38), (39) and (40), we have (41) (zzf)/η+τzf=W3TXfξHJB1f(41) Furthermore, we define the auxiliary regression matrix ERl×l and vector FRl as (42) {E˙(t)=ηE(t)+XfXfT,E(0)=0F˙(t)=ηF(t)+Xf[(zzf)/η+τzf]F(0)=0(42) where η is a positive constant as defined in (37).

The solution of (42) is derived as (43) {E(t)=0teη(tr)X(r)XT(r)drF(t)=0teη(tr)X(r)[(z(r)zf(r))/η+τzf(r)]dr(43) Finally, we denote a vector M as (44) M=E(t)W3+F(t)(44) The adaptive law for updating W3 is provided by (45) W˙3=μM(45) where μ is the learning gain.

Theorem

For system (34) with the updating law (44) then the value function weight error W~3=W3W3 converges to a compact set around zero.

Proof:

The Lyapunov function is selected as (46) Lo=12W~3Tμ1W~3(46)

Then, by substituting (42) with (44), one obtains (47) M=E(t)W3+F(t)=E(t)W~3+ςf(47) where ςf=0teη(tr)XfξHJB1fdr is bounded as ||sf||ζ¯f.

It can be seen from [Citation26] that the persistently excited (PE) for X can make the matrix defined in (43) is positive define, i.e. λmin(E)>σ>0 Then according W~3.=W˙3, the derivative of (46) is calculated as (48) L˙o=W3Tμ1W~3.=E(t)W~3TW~3+W~3Tςf||W~3||(σ||W~3||ζ¯f)(48) Then W~3 converges into the compact set Ω:{||W~3||ζ¯f/σ}

Theorem

For system (1) with an adaptive optimal control u signal (21) and (32) and adaptive laws (9) and (45), the tracking error e is uniformly ultimately bound, and the optimal control ue in (32) converges to a small bound around its ideal optimal solution ue in (30).

Proof:

Design the Lyapunov function as L=LI+Lo+Lcwhere LI can be expressed as (10) and the time derivative of (18) satisfies the following inequality (49) L˙I(λmin(Q0)λmax(PΛξP))Δx2+λmax(Λξ1)||ξ1||2(49)

Lo is defined as (46) and its derivation is obtained from (48) such that (50) L˙o=W~3Tμ1W~˙3=E(t)W~3TW~3+W~3Tςfσ||W~3||2||W~3||ζ¯f(50) From the basic inequality aba2δ/2+b2/2δ with δ>0, we can rewrite (50) as (51) L˙o(σ12δ)||W~3||2+δζ¯f22(51) LC is defined as (52) LC=ΓeTe+κV(e)(52) where V(e) is the optimal cost function defined in (25) and Γ,κ>0 are positive constants.

Substituting (32) with (22), one obtains (53) e˙=Ke+W2(1/2R1W2TφTW3)+ξ2=Ke+1/2W2R1W2TφTW~3+W2ue+1/2W2R1W2TφTξ3+ξ2(53) Then time derivation of (52) can be deduced from (28) and (53) as (54) L˙C=2ΓeTe˙+κ(eTQeueTRue)=2ΓeT(Ke+12W2R1W2TφTW~3+W2ue+12W2R1W2TφTξ3+ξ2)+κ(eTQeueTRue)[ΓK+κλmin(Q)Γ(||W2TR1W2φ||+||W2TR1W2T||+2)]||e||2+14Γ(||W2TR1W2φ||||W~3||2[κλmin(R)Γ||W2||2]||ue||2+12Γ||W2TR1W2T||ξ3Tξ3+Γξ2Tξ2(54) Then from (49), (50) and (54), the time derivative of L is L˙=L˙I+L˙o+L˙c and satisfied the following inequality (55) L˙(λmin(Q0)λmax(PΛξP))Δx2[ΓK+κλmin(Q)Γ(||W2TR1W2T||(ϕ+1)+2)]2[κλmin(R)Γ||W2||2]||ue||2[σ12δ14Γ(||W2TR1W2ϕ||]||W~3||2λmax(Λξ1)||ξ1||2+12Γ||W2TR1W2T||ξ3Tξ3+Γξ2Tξ2+δζ¯f22(55) If we can choose the appropriate parameters to satisfy the following condition (56) λmin(Q0)>λmax(PΛξP)),Γ<4σδ2δ||W2TR1W2ϕ||κ>max{Γ(||W2TR1W2T||(ϕ+1)+2)λmin(Q)Γ||W2||2λmin(R),Γ(||W2TR1W2T||(ϕ+1)+2)λmin(Q)}(56) Then (55) can be further represented as (57) L˙h1Δx2h2||W~3||2h3e2+ϑ(57) where h1=λmin(Q0)λmax(PΛξP) h2=σ12δ14Γ(||W2TR1W2ϕ||)h3=ΓK+κλmin(Q)Γ(||W2TR1W2T||(ϕ+1)+2 ϑ=λmax(Λξ1)||ξ1||2+12Γ||W2TR1W2T||ξ3Tξ3+Γξ2Tξ2+δζf22 are all positive constants from condition (56).

Then L˙<0 if (58) Δx>ϑ/h1,||W~3||>ϑ/h2,e>ϑ/h3(58) which means the identification error Δx, the tracking error e and NN weights error ||W~3|| are all bound.

Moreover, we have (59) uˆeue=12R1W2TϕTW~3+12R1W2Tξ3(59) When t, the upper bound of (59) is (60) limt||uˆeue||12||R1W2T||(||ϕT||||W~3||+ξ¯3)ζ(60) where ζ depends on the DNN identification approximation error and the critic NN weight error W~3.

The structure diagram of the control scheme is illustrated in Figure .

Figure 1. Structural diagram of the control scheme.

Figure 1. Structural diagram of the control scheme.

A summary of the ADP-based optimal tracking control algorithm is as follows

  • (1) Select the proper initial values of active functions σ(·) and ϕ(·) in Equation (6) and updating gains k1, k2, k3 in Equation (9) for the identifier. σ(·) is usually selected as the sigmoidal function σ()=a/(1+ebx)c where a, b and c are the designed constants. ψ (·) is selected as ψ (·) =  I. α,β and γ are tuned online according to equations (9). Hence, there is no need to select the initial weight values of α,β and γ. Meanwhile, select the proper function ϕ(·) in Equation (31) and the updating gain μ in Equation (45) for the critic NN ϕ (·) is usually selected as a smooth function consisting of a different combination between state tracking errors.

  • (2) The inputs/outputs data of an unknown non-affine nonlinear system (1) is used to train the identifier.

  • (3) Adaptive optimal tracking control law consisting of the steady-state control law in an equation and the optimal control law in Equation (32) is obtained based on the first two steps.

5. Simulations

We consider the following two examples to illustrate the theoretical results in this section.

Example

Considering the following non-affine nonlinear system (61) [x˙1x˙2]=[x1+x20.5x10.5x2(1(cos(2x1+2)2)]+[u1(cos(2x1+2))+sin(u2)](61)

The matrices Q and R of the performance index function are chosen as identify matrices. The control objective is to make the state x1 and x2 follow the desired trajectory x1r=sint and x2r=cos(t)sin(t). First, a DNN identifier (6) with the updating law (9) is used to identify the non-affine nonlinear system. Parameters are selected as k1=k2=k3=1, active function is selected as σ()=2/(1+e2x)0.5.

The identification error is shown in Figure . We can see that the proposed identifier can model the non-affine nonlinear system accurately. Then, with the identified model, the adaptive optimal tracking controller is implemented for the unknown non-affine nonlinear continuous system (61). Define the trajectory error as e1=x1xr1,e2=x2xr2. The activation function of critic NN is selected as ϕ=[e12,e1e2,e22]. The adaptive gain of the critic NN is selected as μ=100, and the steady control gain is selected as K = 1200. Figures  and  represent the trajectory tracking, and the convergence property for the weight of the critic NN is shown in Figure , which demonstrates that the proposed adaptive optimal tracking controller can ensure satisfactory tracking performance for an unknown non-affine nonlinear continue system.

Figure 2. State identification error.

Figure 2. State identification error.

Figure 3. State tracking for x1.

Figure 3. State tracking for x1.

Figure 4. State tracking for x2.

Figure 4. State tracking for x2.

Figure 5. Convergence property for the critic NN weight x=[βγ],.

Figure 5. Convergence property for the critic NN weight x=[βγ],.

Example

The classical 2-DOF single-track vehicle model, as shown in Figure , is commonly used in AFS/DYC control design [Citation27]. The parameter notations are shown in Table .

Figure 6. Single-track vehicle model.

Figure 6. Single-track vehicle model.

Table 1. Description of vehicle model parameters.

The mathematical model of Figure  considering the uncertainty parameters is expressed as follows: (62) x˙=(A+ΔA)x+(B+ΔB)u+Eδfy=Cx(62) where x=[βγ] β is the side slip angel, γ is the yaw rate; u=[δcMc], δc is the active steer angle, Mc is the corrective yaw moment and δf is the driver steer input A=[2Cr+Cfmvx12CflfCrlrmvx22CflfCrlrIz2Cflf2+Crlr2Izvx],B=[2Cfmvx02CflfIz1Iz],E=[2Cfmvx2CflfIz],ΔA=DFE1,ΔB=DFE2,F=[ρf00ρr],C=[1001]D=[2CfΔfmvx2CrΔrmvx2CfΔflfIz2CrΔrlrIzvx],E1=[1lfvx1lrvx],E2=[1000].The main object of vehicle stability control is to design the proper controller to make the actual vehicle yaw rate and sideslip to follow the desired responses. The reference model is usually selected as (63) x˙r=Arxr+Erδf(63) where xr=[βrγr],Ar=[1τβ001τr] Er=[1mlf2(lf+lr)lrCrvx21+m(lf+lr)(lf2Crlr2Cf)vx2vxlf+lr1+m(lf+lr)(lf+lr)2Crlr2Cf)vx2]τr,τβ are the designed time constants of raw rate and sideslip angle, respectively.

With the assumption that the variation and uncertainty of tire cornering stiffness can be described as (64) {Cf=Cf0(1+Δfρf),||ρf||1Cr=Cr0(1+Δrρr),||ρr||1(64) where Cf0,Cr0 and Cf,Cr are the nominal and actual cornering stiffness of the front and rear tires respectively, Cf0,Cr0 are the deviation magnitude, ρf,ρr are perturbations.

Simulation parameters of the vehicle system are selected as m = 1704kg, Cf= 63224N/rad, Cr= 84680 N/rad, Iz= 3048 kg m2, lf= 1.135 m and lr = 1.555 m. A 28-degree step steer manoeuvre with an initial speed (of 80 km/h) is simulated to verify the proposed method. The time-varying parameters of Cf and Cr are obtained from (64) by selecting Δf, Δr as constant 0.5 and ρf, ρr as band-limited white noise with the amplitude ±0.01. As shown in Figures  and , the proposed method still demonstrates strong robustness and self-adaptive performance, i.e. less tracking error for yaw rate and sideslip angle, when encountering time-varying cornering stiffness in step steer manoeuvre.

Figure 7. Side-slip angle.

Figure 7. Side-slip angle.

Figure 8. Yaw rate.

Figure 8. Yaw rate.

To show the identification performance of the proposed algorithm, the performance index-Root Mean Square (RMS) for the state’s error has been adopted for comparison. (65) RMS=i=1ne2(i)/n(65) where n is the number of the simulation steps, e(i) is the corresponding state response at the ith step.

The RMS values of the side slip angle and yaw rate are 0.915 × 10−4 and 3.173 × 10−4, respectively.

6. Conclusions

In this paper, we develop an adaptive optimal controller with a critic-identifier structure to solve the trajectory tracking problem for model uncertain non-affine nonlinear continuous-time system. First, a model-free DNN identifier is designed to reconstruct the unknown dynamic. Then, based on the identification model, an adaptive optimal controller is presented, which can realize the trajectory tracking and stabilize the error dynamic optimally. In addition, a critic NN is introduced to approximate the optimal value function, and a novel robust tuning law is established to update the critic NN weight. The stability of the closed-loop system is proved by the Lyapunov approach. Simulation results of two examples are presented to verify the validity of the proposed approach.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work is supported by the National Natural Science Foundation of China (NSFC) under Grant 62073298, Key Research and Development Projects of Henan Province in 2022 under Grant 221111240200, Special Application for Key Scientific and Technological Project of Henan Province [Grant Number JDG20220037].

References

  • Powell B. Approximate dynamic programming: solving the curses of dimensionality. New Jersey: Wiley-Blackwell; 2007.
  • Lebedev D, Margellos K, Goulart P. Convexity and feedback in approximate dynamic programming for delivery time slot pricing. IEEE Trans Control Syst Technol. 2022;30(2):893–900.
  • Zhang H, Liu D, Luo Y, et al. Adaptive dynamic programming for control: algorithms and stability. London: Springer; 2013.
  • Lewis FL, Liu D, Editors. Approximate dynamic programming and reinforcement learning for feedback control. Hoboken (NJ): Wiley; 2013.
  • Werbos P. Approximate dynamic programming for real-time control and neural modeling. handbook of intelligent control, neural, fuzzy, and adaptive approaches. New York (NY): Van Nostrand Reinhold; 1992.
  • Miller W, Sutton R, Werbos P. Neural networks for control. Cambridge (MA): MIT Press; 1990.
  • Fairbank M, Alonso E, Prokhorov D. Simple and fast calculation of the second-order gradients for globalized dual heuristic dynamic programming in neural networks. IEEE Trans Neural Netw Learn Syst. 2012;23(10):1671–1676.
  • Zhu L, Modares H, Peen G, et al. Adaptive suboptimal output-feedback control for linear systems using integral reinforcement learning. IEEE Trans Control Syst Technol. 2015;23(1):264–273.
  • Wei Q, Liu D, Shi G. A novel dual iterative q-learning method for optimal battery management in smart residential environments. IEEE Trans Ind Electron. 2015;62(4):2509–2518.
  • Bhasin S, Kamalapurkar R, Johnson M, et al. A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica (Oxf). 2013;49(1):82–92.
  • Vrabie D, Pastravanu O, Abu-Khalaf M, et al. Adaptive optimal control for continuous-time linear systems based on policy iteration. Automatica (Oxf). 2009;45:477–484.
  • Vamvoudakis K, Lewis F. Online actor critic algorithm to solve the continuous-time infinite horizon optimal control problem. Proc Int Joint Conf Neural Netw. 2009;46:3180–3187.
  • Modares H, Lewis F, Naghibi-Sistani M. Adaptive optimal control of unknown constrained-input dystems using policy iteration and neural networks. IEEE Trans Neural Netw Learn Syst. 2013;24(10):1513–1525.
  • Na J, Lv Y, Wu X, et al. Approximate optimal tracking control for continuous-time unknown nonlinear systems). Nan Jing, China, Proceedings of the 33rd Chinese control conference; 2014; p. 8990–8995.
  • Lv Y, Na J, Yang Q, et al. Online adaptive optimal control for continuous-time nonlinear systems with completely unknown dynamics. Int J Control. 2016;89(1):99–112.
  • Zhang H, Cui L, Luo Y. Near-optimal control for nonzero-sum differential games of continuous-time nonlinear systems using single-network ADP. IEEE Trans Cybern. 2013;43(1):2168–2267.
  • Wei Q, Liu D. Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification. IEEE Trans Autom Sci Eng. 2014;11(4):1020–1036.
  • Modares H, Lewis F. Linear quadratic tracking control of partially-unknown continuous-time systems using reinforcement learning. IEEE Trans Autom Control. 2014;59(11):3051–3056.
  • Kamalapurkar R, Dinhb H, Bhasin S, et al. Approximate optimal trajectory tracking for continuous-time nonlinear systems. Automatica (Oxf). 2015;51:40–48.
  • Lv Y, Ren X, Na J. Adaptive optimal tracking controls of unknown multi-input systems based on nonzero-sum game theory. J Franklin Inst. 2019;22(12):2226–2236.
  • Zhang X, Zhang H, Sun Q, et al. Adaptive dynamic programming-based optimal control of unknown nonaffine nonlinear discrete-time systems with proof of convergence. Neurocomputing. 2012;91:48–55.
  • Wang H, Tian Y. Non-affine nonlinear systems adaptive optimal trajectory tracking controller design and application. Stud Inf Control. 2015;24(1):5–11.
  • Li X, Yu W. Dynamic system identification via recurrent multilayer perceptrons. Inf Sci (Ny). 2002;147:45–63.
  • Poznyak A, Yu W, Sanchez E, et al. Nonlinear adaptive trajectory tracking using dynamic neural networks. IEEE Trans Neural Netw. 1999;10(6):1402–1411.
  • Abu-Khalaf M, Lewis F. Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica (Oxf). 2005;41(5):779–791.
  • Na J, Yang J, Wu X, et al. Robust adaptive parameter estimation of sinusoidal signals. Automatica (Oxf). 2015;53:376–384.
  • Yang X, Wang Z, Peng W. Coordinated control of AFS and DYC for vehicle handling and stability based on optimal guaranteed cost theory. Veh Syst Dyn. 2009;47(1):57–79.