832
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Output Feedback Controller for a Class of Unknown Nonlinear Discrete Time Systems Using Fuzzy Rules Emulated Networks and Reinforcement Learning

ORCID Icon

Abstract

A model-free adaptive control for non-affine discrete time systems is developed by utilising the output feedback and action-critic networks. Fuzzy rules emulated network (FREN) is employed as the action network and multi-input version (MiFREN) is implemented as the critic network. Both networks are constructed using human knowledge based on IF–THEN rules according to the controlled plant and the learning laws are established by reinforcement learning without any off-line learning phase. The theoretical derivation of the convergence of the tracking error and internal signal is demonstrated. The numerical simulation and the experimental system are given to validate the proposed scheme.

1. Introduction

Due to the complexity of controlled plants nowadays, it is commonly difficult or impossible to establish its mathematical model especially for the discrete time system [Citation1]. By utilising only input–output data of the controlled plant, the model-free approaches have been developed [Citation2, Citation3]. On the other hand, the performance of the controllers is related to data's quality and quantity [Citation4]. For some engineering applications, it is very difficult to access all state variables, thus the output feedback is still a preferable scheme [Citation5, Citation6]. Furthermore, the close-loop analysis and stability approaches have been proposed [Citation7, Citation8, Citation9] to guarantee the performance of controllers. From the engineering point of view, the stability analysis beside of closed-loop's performance is only a basic minimum requirement even for the artificial intelligence controller [Citation10]. Therefore, the optimal controllers are more desirable for modern applications [Citation11] or by nature view [Citation12].

To ensure the closed-loop performance with the optimisation of the predefined cost function, the schemes based on adaptive dynamic programming have been utilised but the mathematic models have been required for its iterative learning [Citation13, Citation14]. With the model-free aspects, reinforcement learning (RL) algorithms have been developed to solve optimal control [Citation15, Citation16] with the estimated solution of the Hamilton–Jacobi–Bellman equation [Citation17, Citation18]. To mimic the RL process, the approaches based on action-critic networks have been derived by artificial neural networks ( ANNs) under considering the controlled plant as a black box [Citation19, Citation20]. Nevertheless, even the mathematic model is unknown but the engineer still has basic human knowledge of the controlled plant such that ‘IF higher output is required THEN more control effort should be supplied’. Thus, the controlled plant can be considered as a grey box.

To integrate the human knowledge as IF–THEN format into the controller, fuzzy logic systems ( FLSs) have been utilised in control applications [Citation21] also including the optimal problems [Citation22]. By including the learning ability to FLS, the integrations between FLS and ANN have been developed such as fuzzy neural network (FNN) [Citation23] and fuzzy rules emulated network (FREN) [Citation24, Citation25]. Thereafter, the approaches of using FNN and FREN for solving the optimal problem with RL have been proposed [Citation26, Citation27] when the controlled plants have been considered as a class of affine systems. On the other hand, the problem of non-affine systems has been studied in Ref. [Citation28] by the approach of critic-action networks when the state feedback has been utilised for gaining enough information to tune ANNs.

In this work, the output feedback model-free controller is proposed when the control effort is non-affine with respect to system dynamics. The controller is designed by the action network called FRENa with the set of IF–THEN rules according to the controlled plant. Thereafter, the long-term cost function is estimated by the multi-input version of FREN called MiFRENc when IF–THEN rules are established under the general aspect for minimising both tracking error and control energy. The learning laws are derived with the RL approach to tune all adjustable parameters of FRENa and MiFRENc aiming to minimise the tracking error and the estimated cost function. Furthermore, the closed-loop analysis is provided by the Lyapunov method to demonstrate the convergence of the tracking error and internal signals.

This paper is organised as follows. Section 2 introduces a class of systems under our investigation and problem formulation. The proposed scheme is introduced in Section 3 including the network architectures with IF–THEN rules of FRENa and MiFRENc and their formulations. The learning laws and closed-loop analysis are derived in Section 4. Section 5 provides the results of the simulation and experimental system.

2. Controlled Plant as a Class of Nonlinear Discrete-Time Systems

In this work, the controlled plant for a class of non-affine discrete time systems is considered as (1) y(k+1)=f(y(k),,y(kny),u(k),,u(knu))+d(k),(1) where y(k+1)R is the plant's output with respect to the control effort u(k)R, f() is an unknown nonlinear function, nu and ny are unknown system orders and d(k) denotes a bounded disturbance such that |d(k)|dM. For further analysis, the following assumptions are expressed according to the unknown nonlinear function f() with respect to the control effort u(k).

Assumption 2.1

The derivative of y(k+1) with respect to u(k) is existed and bounded such that (2) 0<gmy(k+1)u(k)gM,(2) where gm and gM are positive constants.

Remark 2.2

The condition mentioned in (Equation2) indicates that the controlled plant in (Equation1) is a positive control direction. That will assist the setting of IF–THEN rules according to the change of control effort Δu(k) altogether with the change of output Δy(k+1).

Referring to condition (Equation2), it is clear that the change of output Δy(k+1) with respect to the change of control effort Δu(k) can be rewritten as (3) gmdΔy(k+1)Δu(k)gMd,(3) where Δu(k)>0 and gmd and gMd are constants according to gm and gM, respectively. This will lead to the setting of IF–THEN rules such that

IF Δu(k) is positive-large, THEN Δy(k+1) should be positive-large

or

IF Δu(k) is negative small, THEN Δy(k+1) should be negative small.

By utilising those IF–THEN rules, the adaptive controller based on FRENs will be established in the next section.

3. RL Controller

The proposed controller is illustrated by the block diagram in Figure . In this work, the plant is selected as a DC motor current control. Only the armature current is measured as the output y(k+1) (mA) when the control effort u(k) (V) is the voltage fed to the driver unit. Thus, the IF–THEN rules mentioned in Section 2 can be rewritten according to the physical nature such that

Figure 1. Closed-loop system architecture.

Figure 1. Closed-loop system architecture.

IF we apply positive-large change of control voltage [Δu(k)], THEN we should have positive-large change of armature current [Δy(k+1)].

According to this knowledge, the action network (FRENa) is first established to generate the control effort y(k) when the input is the tracking error e(k) defined as (4) e(k)=r(k)y(k),(4) where r(k) is the desired trajectory. Second, the critic network is designed using MiFRENc to produce the estimated long-term cost function L^(k) for the controller FRENa. The details of two networks and its IF–THEN rules are given as follows.

3.1. Controller or Action Network

To utilise the action network, the IF–THEN rules with the relation between the tracking error e(k) and the control effort u(k) are first established. By considering the basic knowledge such that, positive-large e(k) means lack of y(k) in positive-large. In order to compensate, it clearly requires that the control effort u(k) should be positive-large. For conclusion, we have IF e(k) is positive-large, THEN u(k) should be positive-large. With seven linguistic levels, it leads to the design of IF–THEN rules as

where notations of linguistic variables N, P, L, M, S and Z denote negative, positive, large, medium, small and zero, respectively.

Employing this set of IF–THEN rules, the network architecture of FRENa is illustrated by Figure . According to the network architecture in Figure and the function formulation of FREN in Ref. [Citation24], the control effort u(k) is determined by (5) u(k)=βaT(k)ϕa(k),(5) where (6) ϕa(k)=[μNL(ek)μNM(ek)μPL(ek)]T(6) and (7) βa(k)=[βaNL(k)βaNM(k)βaPL(k)]T.(7)

Figure 2. Action network or controller based on FREN.

Figure 2. Action network or controller based on FREN.

Let us consider FRENa as the function estimator of the unknown control effort, thus it exists the ideal control effort u(k) with the ideal parameter βa such that (8) u(k)=βaTϕa(k)+εa(k),(8) where εa(k) is a bounded residual error |εa(k)|εaM.

By using the dynamics (Equation1) with the control laws (Equation5) and (Equation8), the tracking error e(k+1) is rearranged as (9) e(k+1)=r(k+1)y(k+1)=f(uk)f(uk)d(k).(9) Recalling Assumption 1 and using mean value theorem, the error dynamic (Equation9) can be rewritten as (10) e(k+1)=f(x)x|x=um(k)[u(k)u(k)]d(k)=g(k)[u(k)u(k)]d(k),(10) where (11) g(k)=f(um(k))um(k),(11) and um(k)[min{uk,uk},max{uk,uk}]. Employing the control laws (Equation8) and (Equation5), it yields (12) e(k+1)=g(k)[βaβa(k)]Tϕa(k)+g(k)εa(k)d(k).(12) Let us define β~a(k)=βaβa(k), da(k)=g(k)εa(k)d(k) and (13) Λa(k)=β~aT(k)ϕa(k),(13) and we obtain (14) e(k+1)=g(k)Λa(k)+da(k).(14) It is worth to note that the tracking error obtained by (Equation14) is functional by β~a(k) and the unknown but bounded da(k) such that |da(k)|daM. This relation will be used for the performance analysis afterward.

3.2. Estimated Cost–Function or Critic Network

In this work, the long-term cost function L(k) is employed by an infinite-horizon of the tracking error e(k) and the control effort u(k) with the discount factor γL as (15) L(k)=i=kγLikl(i),(15) where (16) l(k)=pe2(k)+qu2(k),(16) where p and q are positive constants and 0<γL1.

L(k) in (Equation15) is functional by two input arguments with the quadratic functions (fx=x2) of e(k) and u(k). Thus, an adaptive network MiFRENc is utilised to estimate L(k) as the block diagram in Figure . In order to design MiFRENc, IF–THEN rules are first established by Table . Thereafter, the network architecture of MiFRENc is illustrated by Figure . By utilising the network in Figure and results in Ref. [Citation24], the estimated cost function L^(k) is determined by (17) L^(k)=βcT(k)ϕc(k),(17) where (18) βc(k)=[βZZ(k)βZS(k)βLL(k)]T(18) and (19) ϕc(k)=[ϕ1(k)ϕ2(k)ϕ9(k)]T.(19)

Figure 3. Estimated cost function or critic network.

Figure 3. Estimated cost function or critic network.

Table 1. MiFRENc: IF–THEN rules.

Using the universal approximation property of MiFREN [Citation24], there exists an ideal parameter βc such that (20) L(k)=βcTϕc(k)+εc(k),(20) where εc(k) is a bounded residual error such that |εc(k)|εcM. Adding and subtracting βcTϕc(k) on the left-hand side of (Equation17) yields (21) L^(k)=β~cT(k)ϕc(k)+βcTϕc(k)=Λc(k)+βcTϕc(k),(21) where β~c(k)=βcT(k)βc and Λc(k)=β~cT(k)ϕc(k).

In order to improve the performance of FRENa and MiFRENc, the learning laws will be developed and explained in the next section.

4. Learning Algorithms and Performance Analysis

4.1. Action Network Learning Law

Considering the tracking error within Λa(k) as (Equation14) and the estimated cost function L^(k), in this work, the error function of action network is given as (22) ea(k)=g(k)Λa(k)+1g(k)L^(k).(22) Thereafter, the cost function to be minimised is utilised as (23) Ea(k)=12ea2(k).(23) Applying the gradient descent, the tuning law for βa is derived as (24) βa(k+1)=βa(k)ηaEa(k)βa(k),(24) where ηa is the learning rate. By using the chain rule and (Equation13), it yields (25) Ea(k)βa(k)=Ea(k)ea(k)ea(k)Λa(k)Λa(k)βa(k)=ea(k)g(k)ϕa(k).(25) Recalling (Equation24) with (Equation25) and using ea(k) in (Equation22), it leads to (26) βa(k+1)=βa(k)+ηaea(k)g(k)ϕa(k)=βa(k)+ηa[g(k)Λa(k)+1g(k)L^(k)]g(k)ϕa(k)=βa(k)+ηa[g(k)Λa(k)+L^(k)]ϕa(k).(26) By eliminating da(k) in (Equation14), the learning law (Equation26) is rewritten as (27) βa(k+1)=βa(k)+ηa[e(k+1)+L^(k)]ϕa(k).(27) The final learning law of FRENa given by (Equation27) is a practical one because all parameters required on the left-hand side are certainly obtained at the time index k + 1.

4.2. Critic Network Learning Law

In general, the error function of critic networks is employed by the estimated cost function L^(k). Therefore, in this work, the error function ec(k) is given as (28) ec(k)=δL^(k)L^(k1)+l(k),(28) where δ is a positive constant. In order to tune βc, the cost function Ec(k) is defined as (29) Ec(k)=12ec2(k).(29) Applying the gradient descent at (Equation29) with respect to βc(k), we have (30) βc(k+1)=βc(k)ηcEc(k)βc(k),(30) where ηc is the learning rate. Using the chain rule along Ec(k) in (Equation29), ec(k) in (Equation28) and L^(k) in (Equation17), it yields (31) Ec(k)βc(k)=Ec(k)ec(k)ec(k)L^(k)L^(k)βc(k)=ec(k)δϕc(k).(31) Rewriting (Equation30) with (Equation31), it leads to (32) βc(k+1)=βc(k)ηcec(k)δϕc(k)=βc(k)ηcδ[l(k)L^(k1)+δL^(k)]ϕc(k).(32) Finally, we have a practical tuning law for MiFRENc.

4.3. Closed-Loop Analysis

In the following theorem, the closed-loop performance of the output feedback controller is demonstrated while the tracking error and internal signals are bounded.

Theorem 4.1

For the non-affine discrete time system mentioned in Section 2, the performance of the closed-loop system configured by the structure of FRENa and MiFRENc in Section 3 is guaranteed in terms of the bonded tracking error and internal signals when the designed parameters are selected as follows: (33) 12<δ1,(33) (34) 0<ηagmνagM2(34) and (35) 0<ηc1νcδ2,(35) where νa and νc are upper limits of ||ϕa(k)||2 and ||ϕc(k)||2, respectively.

Proof : The proof is given in Appendix.

The validation of the proposed control scheme will be presented in the next section for the computer simulation system with a non-affine discrete time system and the hardware implementation system for DC motor current control plant.

5. Simulation and Experimental Systems

5.1. Simulation System and Results

The controller developed in this work is first implemented on the nonlinear discrete time given as (36) y(k+1)=sin(yk)+[5+cos(ykuk)]u(k).(36) It is worth to mention that the mathematic model in (Equation36) is used only to establish the simulation. In this test, the desired trajectory is given as (37) r(k+1)=Arsin(ωrπkkM),(37) where kM=500 as the maximum time index, Ar=1.0 and ωr=8. To follow (Equation33), δ is selected as δ=0.75 and νa=νc=1.5. By using this setting and (Equation35), the learning rate of MiFRENc is determined as (38) 0<ηc1δ2νc2=10.7521.52=0.7901.(38) In this case, the learning rate for MiFRENc is selected as ηc=0.5. To select the learning rate of FRENa, let us chose gm and gM as 1 and 6, respectively. By using (Equation34), the learning rate of FRENa is determined as (39) 0<ηagmνa2gM2=11.5262=0.0123.(39) Thus, the learning rate for FRENa is selected as ηa=0.01.

Figures  and  illustrate the setting of membership functions for FRENa and MiFRENc, respectively. The initial setting of adjustable parameters β(1) for FRENa and MiFRENc is given as Table .

Figure 4. FRENa membership functions: simulation case.

Figure 4. FRENa membership functions: simulation case.

Figure 5. MiFRENc membership functions: simulation case.

Figure 5. MiFRENc membership functions: simulation case.

Table 2. Initial setting β(1): simulation system.

Figure  displays the tracking performance with both plots of y(k) and e(k) and Figure  represents the control effort u(k). The estimated cost function L^(k) is illustrated in Figure . The phase plane trajectory of u(k) and e(k) is depicted in Figure  to demonstrate the closed-loop system's behaviour.

Figure 6. Tracking performance y(k) and e(k): simulation system.

Figure 6. Tracking performance y(k) and e(k): simulation system.

Figure 7. Control effort u(k): simulation system.

Figure 7. Control effort u(k): simulation system.

Figure 8. Estimated cost function L^(k): simulation system.

Figure 8. Estimated cost function L^(k): simulation system.

Figure 9. u(k) and e(k): simulation system.

Figure 9. u(k) and e(k): simulation system.

5.2. Experimental System and Results

The experimental system is constructed by a DC motor current control. The output y(k+1) is the armature current (mA) and the input u(k) is the control voltage applied to the driver circuit depicted in Figure . Same as the simulation systems, let us select δ=0.75, νa=νc=1.5, gm=5 and gM=10. Thus, the learning rate of FRENa is designed as (40) 0<ηagmνa2gM2=51.52102=0.0222.(40) In this case, we select ηa=0.01. For MiFRENc, we use the same learning rate as the simulation system such that ηc=0.5 because of the same network architecture. The desired trajectory is given as (41) r(k+1)=Irsin(ωrπkkM),(41) where (42) Ir={15(mA)if0k<kM2,30(mA)otherwise,(42) (43) ωr={8if0k<kM2,4otherwise,(43) and kM= 2000.

Figures  and  represent the setting of membership functions of FRENa and MiFRENc, respectively. All adjustable parameters β(1) for FRENa and MiFRENc are initialised as the setting in Table .

Figure 10. FRENa membership functions: experimental system.

Figure 10. FRENa membership functions: experimental system.

Figure 11. MiFRENc membership functions: experimental system.

Figure 11. MiFRENc membership functions: experimental system.

Table 3. Initial setting β(1): experimental system.

Figure  displays the motor current y(k) and the tracking error e(k) to demonstrate the performance of the closed-loop system. The maximum absolute value of tracking error is |e(k)|max=48.2936 (mA) and the average absolute value of tracking error at steady state is 0.4924 (mA) when k=1500–2000. Figure  shows the control effort u(k). The estimated cost function L^(k) is illustrated in Figure . The phase plane trajectory of u(k) and e(k) is plotted in Figure . Thus, the large variation is detected because of the back-EMF. In order to evaluate the proposed scheme working under the situation of back-EMF, the pulse-train trajectory is implemented with the response displayed in Figure . It is clear that the effect of back-EMF is eliminated within the second pulse (B).

Figure 12. Tracking performance y(k) and e(k): experimental system.

Figure 12. Tracking performance y(k) and e(k): experimental system.

Figure 13. Control effort u(k): experimental system.

Figure 13. Control effort u(k): experimental system.

Figure 14. Estimated cost function L^(k): experimental system.

Figure 14. Estimated cost function L^(k): experimental system.

Figure 15. u(k) and e(k): experimental system.

Figure 15. u(k) and e(k): experimental system.

Figure 16. Pulse response: experimental system.

Figure 16. Pulse response: experimental system.

6. Conclusions

A model-free adaptive control for a class of non-affine discrete time systems has been developed by RL. The closed-loop system has been established by the output feedback with two adaptive networks FRENa and MiFRENc. The initial settings of FRENa and MiFRENc have been conducted according to the human knowledge of the controlled plant within the format of IF–THEN rules. The performance has been enchanted by the learning laws for both FRENa and MiFRENc while the tracking error and internal signals have been guaranteed the convergence over the reasonable compact sets. The numerical system and experimental results have represented to verify theoretical conjecture.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work has been supported by Fundamental Research Funds for CINVESTAV-IPN and Mexican Research Organization CONACyT [grant number 257253].

Notes on contributors

C. Treesatayapun

C. Treesatayapun received the Ph.D. in elec- trical engineering from Chiang-Mai University, Thailand, in 2004. He was a production engineer at SAGA Elec- tronics (JRC-NJR) from 1998-2000 and was a head of electrical engineering program at North Chiang-Mai University, Thailand from 2001- 2007. He is currently a senior researcher at Department of robotic and advanced manufac- turing, Mexican Research Center and Advanced Technology, CINVESTAV-IPN, Saltillo campus, Mexico. His current research interests include automation and robotic system control and optimization, adaptive and learning algorithms and electric machine drives.

References

  • Hou ZS, Wang Z. From model-based control to data-driven control: survey, classification and perspective. Inf Sci. 2013;235:3–35.
  • Zhu Y, Hou ZS. Data-driven MFAC for a class of discrete-time nonlinear systems with RBFNN. IEEE Trans Neural Netw Learn Syst. 2014;25(5):1013–1020.
  • Wang X, Li X, Wang J, et al. Data-driven model-free adaptive sliding mode control for the multi degree-of-freedom robotic exoskeleton. Inf Sci. 2016;327:246–257.
  • Lin N, Chi R, Huang B. Data-driven recursive least squares methods for non-affined nonlinear discrete-time systems. Appl Math Modell. 2020;81:787–798.
  • Kaldmae A, Kotta U. Input–output linearization of discrete-time systems by dynamic output feedback. Eur J Control. 2014;20:73–78.
  • Treesatayapun C. Data input–output adaptive controller based on IF–THEN rules for a class of non-affine discrete-time systems: the robotic plant. J Intell Fuzzy Syst. 2015;28:661–668.
  • Liu YJ, Tong S. Adaptive NN tracking control of uncertain nonlinear discrete-time systems with nonaffine dead-zone input. IEEE Trans Cybern. 2015;45(3):497–505.
  • Zhang CL, Li JM. Adaptive iterative learning control of non-uniform trajectory tracking for strict feedback nonlinear time-varying systems with unknown control direction. Appl Math Model. 2015;39:2942–2950.
  • Precup RE, Radac MB, Roman RC, et al. Model-free sliding mode control of nonlinear systems: algorithms and experiments. Inf Sci. 2017;381:176–192.
  • Raj R, Mohan BM. Stability analysis of general Takagi–Sugeno fuzzy two-term controllers. Fuzzy Inf Eng. 2018;10(2):196–212.
  • Zhang X, Zhang HG, Sun QY, et al. Adaptive dynamic programming-based optimal control of unknown nonaffine nonlinear discrete-time systems with proof of convergence. Neurocomputing. 2012;35:48–55.
  • Eftekhari M, Zeinalkhani M. Extracting interpretable fuzzy models for nonlinear systems using gradient-based continuous ant colony optimization. Fuzzy Inf Eng. 2013;5(3):255–277.
  • Liu D, Wang D, Yang X. An iterative adaptive dynamic programming algorithm for optimal control of unknown discrete-time nonlinear systems with constrained inputs. Inf Sci. 2013;220(20):331–342.
  • Jiang H, Zhang H. Iterative ADP learning algorithms for discrete-time multi-player games. Artif Intell Rev. 2018;50(1):75–91.
  • Liu D, Wang D, Zhao D, et al. Neural-network-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual heuristic programming. IEEE Trans Autom Sci Eng. 2012;9(3):628–634.
  • Kiumarsi B, Lewis FL, Modares H, et al. Reinforcement Q-learning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica. 2014;50(4):1167–1175.
  • Yang Q, Jagannathan S. Reinforcement learning controller design for affine nonlinear discrete-time systems using online approximators. IEEE Trans Syst Man Cybern B Cybern. 2012;42(2):377–390.
  • Ha M, Wang D, Liu D. Event-triggered constrained control with DHP implementation for nonaffine discrete-time systems. Inf Sci. 2020;519:110–123.
  • Xu B, Yang C, Shi Z. Reinforcement learning output feedback NN control using deterministic learning technique. IEEE Trans Neural Netw Learn Syst. 2014;25(3):635–641.
  • Liu YJ, Li S, Tong S, et al. Adaptive reinforcement learning control based on neural approximation for nonlinear discrete-time systems with unknown nonaffine dead-zone input. IEEE Trans Neural Netw Learn Syst. 2019;30(1):295–305.
  • Allam E, Elbab HF, Hady MA, et al. Vibration control of active vehicle suspension system using fuzzy logic algorithm. Fuzzy Inf Eng. 2010;2(4):361–387.
  • Niftiyev AA, Zeynalov CI, Poormanuchehri M. Fuzzy optimal control problem with non–linear functional. Fuzzy Inf Eng. 2011;3(3):311–320.
  • Fei J, Wang T. Adaptive fuzzy-neural-network based on RBFNN control for active power filter. Int J Mach Learn Cybern. 2019;10:1139–1150.
  • Treesatayapun C, Uatrongjit S. Adaptive controller with fuzzy rules emulated structure and its applications. Eng Appl Artif Intell. 2005;18:603–615.
  • Treesatayapun C. Adaptive control based on IF–THEN rules for grasping force regulation with unknown contact mechanism. Robot Comput Integr Manuf. 2014;30:11–18.
  • Abouheaf M, Gueaieb W. Neurofuzzy reinforcement learning control schemes for optimized dynamical performance. 2019 IEEE International Symposium on Robotic and Sensors Environments (ROSE). Ontario, Canada; 2019 June. p. 17–18.
  • Treesatayapun C. Fuzzy-rule emulated networks based on reinforcement learning for nonlinear discrete-time controllers. ISA Trans. 2008;47:362–373.
  • Wei Q, Lewis FL, Sun Q, et al. Discrete-time deterministic q-learning: a novel convergence analysis. IEEE Trans Cybern. 2017;47(5):1224–1237.

Appendix 1.

Proof of Theorem 4.1

Let us refer to the standard Lyapunov function as (A1) V(k)=V1(k)+V2(k)+V3(k)+V4(k)=ρ1e2(k)+ρ2ηaβ~aT(k)β~a(k)+ρ3ηcβ~cT(k)β~c(k)+ρ4Λc2(k1),(A1) where ρ1, ρ2, ρ3 and ρ4 are positive constants satisfying the following conditions: (A2) ρ1>34pρ3,(A2) (A3) ρ2>ρ1gM2+(ρ3/8)qgmρ3,(A3) (A4) ρ3>ρ4δ2(A4) and (A5) ρ4>ρ34.(A5) Utilising (Equation14), ΔV1(k) is obtained as (A6) ΔV1(k)=ρ1[e2(k+1)e2(k)]=ρ1[[g(k)Λa(k)+da(k)]2e2(k)]ρ1[2g2(k)Λa2(k)+2da2(k)e2(k)]ρ1e2(k)+2ρ1gM2Λa2(k)+2ρ1daM2.(A6) Recalling the tuning law in (Equation26), ΔV2(k) is expressed as (A7) ΔV2(k)=ρ2ηa[β~aT(k+1)β~a(k+1)β~aT(k)β~a(k)]=2ρ2[g(k)Λa(k)+L^(k)]β~aT(k)ϕ(k)+ρ2ηa[g(k)Λa(k)+L^(k)]2ϕaT(k)ϕ(k)=2ρ2Λa(k)[g(k)Λa(k)]2ρ2Λa(k)L^(k)+ρ2ηa||ϕa(k)||2[g(k)Λa(k)+L^(k)]2.(A7) Applying the lower bound and upper bound of g(k), it leads to (A8) ΔV2(k)2ρ2gmΛa2(k)2ρ2Λa(k)L^(k)+ρ2ηa||ϕa(k)||2gM2Λa2(k)+ρ2ηa||ϕa(k)||2[L^2(k)+2g(k)Λa(k)L^(k)]=ρ2[gmΛa2(k)(gmηa||ϕa(k)||2gM2)Λa2(k)2Λa(k)[Iηa||ϕa(k)||2g(k)]L^(k)+ηa||ϕa(k)||2L^2(k)]=ρ2[gmΛa2(k)(gmηa||ϕa(k)||2gM2)[Λa2(k)+2Λa(k)[Iηa||ϕa(k)||2g(k)]L^(k)gmηa||ϕa(k)||2gM2]+ηa||ϕa(k)||2L^2(k)]=ρ2gmΛa2(k)ρ2(gmηa||ϕa(k)||2gM2)||Λa(k)+[1ηa||ϕa(k)||2g(k)]L^(k)gmηa||ϕa(k)||2gM2||2+ρ21ηa||ϕa(k)||2gmgmηa||ϕa(k)||2gM2L^2(k)ρ2gmΛa2(k)+ρ2gmL^2(k)ρ2(gmηa||ϕa(k)||2gM2)||Λa(k)+[1ηa||ϕa(k)||2g(k)]L^(k)gmηa||ϕa(k)||2gM2||2.(A8) By using the learning law of MiFRENc in (Equation32), ΔV3(k) is derived as (A9) ΔV3(k)=ρ3ηc[β~cT(k+1)β~c(k+1)β~cT(k)β~c(k)]=ρ3ηc[2ηcδec(k)β~cT(k)ϕc(k)+ηc2δ2ec2(k)||ϕc(k)||2]=2ρ3δΛc(k)ec(k)+ρ3ηcδ2||ϕc(k)||2ec2(k).(A9) Recalling ec(k) in (Equation28) with ±δL(k) and ±L(k1) and using (EquationA10), it yields (A10) ec(k)=δ[L^(k)L(k)]+δL(k)[L^(k1)L(k1)]L(k1)+l(k)=δ[β^cT(k)ϕc(k)βcTϕc(k)εc(k)]+δL(k)L(k1)+l(k)[β^cT(k1)Fc(k1)βcTϕc(k1)εc(k1)]=δβ~cT(k)ϕc(k)β~cT(k1)ϕc(k1)+δL(k)L(k1)+l(k)δεc(k)+εc(k1)=δΛc(k)Λc(k1)+δL(k)L(k1)+l(k)δεc(k)+εc(k1)(A10) or (A11) δΛc(k)=ec(k)δL(k)+Λc(k1)+L(k1)l(k)+δεc(k)εc(k1).(A11) By using (EquationA11) and (Equation16), (EquationA9) can be derived as (A12) ΔV3(k)=2ρ3ec(k)[ec(k)δL(k)+Λc(k1)+L(k1)l(k)+δεc(k)εc(k1)]+ρ3ηcδ2||ϕc(k)||2ec2(k)=ρ3[1ηcδ2||ϕc(k)||2]ec2(k)ρ3ec2(k)+2ρ3ec(k)[δL(k)Λc(k1)L(k1)+l(k)δεc(k)+εc(k1)]=ρ3[1ηcδ2||ϕc(k)||2]ec2(k)ρ3δ2Λc2(k)+ρ3[δL(k)Λc(k1)L(k1)+l(k)δεc(k)+εc(k1)]2ρ3[1ηcδ2||ϕc(k)||2]ec2(k)ρ3δ2Λc2(k)+ρ34Λc2(k1)+ρ34l2(k)+ρ34[δL(k)L(k1)]2+ρ34[δεc(k)εc(k1)]2ρ3[1ηcδ2||Fc(k)||2]ec2(k)ρ3δ2Λc2(k)+ρ34Λc2(k1)+ρ34pe2(k)+ρ38qΛa2(k)+ρ38||βaT(k)ϕa(k)||2+ρ34[δL(k)L(k1)]2+ρ3εcM2.(A12) Finally, ΔV4(k) is formulated as (A13) ΔV4(k)=ρ4[Λc2(k)Λc2(k1)].(A13) Recalling (EquationA6), (EquationA8), (EquationA12) and (EquationA13), ΔV(k) is rewritten as (A14) ΔV(k)ρ13e2(k)+ρ1gM2Λa2(k)+ρ1dM2ρ2gmΛa2(k)ρ2(gmηa||ϕa(k)||2gM2)||Λa(k)+[1ηa||ϕa(k)||2g(k)]L(k)gmηa||ϕa(k)||2gM2||2+ρ2gmL2(k)ρ3[1ηcδ2||ϕc(k)||2]ec2(k)ρ3δ2Λc2(k)+ρ34Λc2(k1)+ρ34pe2(k)+ρ38qΛa2(k)+ρ38||βaTϕa(k)||2+ρ34[δL(k)L(k1)]2+ρ3εcM2ρ4[Λc2(k)Λc2(k1)][ρ13ρ34p]e2(k)[ρ2gmρ1gM2ρ38q]Λa2(k)[ρ3δ2ρ4]Λc2(k)[ρ4ρ34]Λc2(k1)ρ3[1ηcδ2||ϕc(k)||2]ec2(k)ρ2[gmηa||ϕa(k)||2gM2]||Λa(k)+[1ηa||ϕa(k)||2g(k)]L(k)gmηa||ϕa(k)||2gM2||2+VMVee2(k)VaΛa2(k)Vc0Λc2(k)Vc1Λc2(k1)Vcec2(k)+VM,(A14) where (A15) VMρ1dm2+ρ3εcM2+ρ38βaM2+[ρ38(γ1)2+ρ2go]LM2ρ2[gmηa||ϕa(k)||2gM2]||Λa(k)+[1ηa||ϕa(k)||2g(k)]L(k)gmηa||ϕa(k)||2gM2||2,(A15) (A16) Ve=ρ13ρ34p,(A16) (A17) Va=ρ2gmρ1gM2ρ38q,(A17) (A18) Vc0=ρ3δ2ρ4,(A18) (A19) Vc1=ρ4ρ34(A19) and (A20) Vc=ρ3[1ηcδ2||ϕc(k)||2].(A20) According to the conditions in (EquationA2) – (EquationA5), Ve, Va, Vc0 and Vc1 are always positive. Furthermore, by setting the membership functions of FRENa and MiFRENc, the upper limits exist such that (A21) 0<||ϕa(k)||2νa(A21) and (A22) 0<||ϕc(k)||2νc.(A22) Combining with (Equation34) and (Equation35), it leads to (A23) gmηa||ϕa(k)||2gM2>0(A23) and (A24) VMρ1dm2+ρ3εcM2+ρ38βaM2+[ρ38(γ1)2+ρ2go]LM2.(A24) By this mean, we have (A25) |e(k)|ΩeVMVe,(A25) (A26) |Λa(k)|ΩaVMVa(A26) and (A27) |Λc(k)|ΩcVMVc0.(A27) The proof is completed here.