Full article: Robustness improvement of optimal control in terms of RBFNN with empirical model reduction and transfer learning

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

This paper proposes a method to compute solutions of optimal controls for dynamic systems in terms of radial basis function neural networks (RBFNN) with Gaussian neurons. The RBFNN is used to compute the value function from Hamilton–Jacobi–Bellman equation with the policy iteration. The concept of dominant system is introduced to create initial coefficients of the neural networks to stabilise unstable systems and guarantee the convergence of policy iteration. Model reduction and transfer learning techniques are introduced to improve robustness of RBFNN optimal control, and reduce computational time. Numerical and experimental results show that the resulting optimal control has excellent control performance in stabilisation and trajectory tracking, and much-improved robustness to disturbances and model uncertainties even when the system responses move to a domain of the state space that is larger than the domain where the neural networks are trained.

Keywords:

1. Introduction

Optimal controls are commonly determined with the help of the Hamilton–Jacobi–Bellman (HJB) equation. The HJB equation is a nonlinear partial differential equation for the value function of the optimal control problem. There have been many studies on the solution of the HJB equation in the mathematics and controls community. However, it remains a technical challenge to obtain analytical solutions of the HJB equation for nonlinear control problems. This paper presents a method to obtain approximate solutions of the HJB equation for nonlinear control systems. The method consists of radial basis function neural networks, reduced order modelling and transfer learning.

Crandall and Lions proposed the viscosity solution of the HJB equation in Crandall and Lions (Citation1983), and found that the HJB equation exists non-smoothnesss solution. A sequential alternating least squares method to solve the high dimensional linear HJB equation was presented in Stefansson and Leong (Citation2016). In addition to the numerical methods such as viscosity solution, finite difference and finite element, neural networks are other promising approaches that have garnered recent interest (Greif, Citation2017).

Neural networks have been extensively studied for solving partial differential equations (PDEs) (Han et al., Citation2018; Lagaris et al., Citation2000; Sirignano & Spiliopoulos, Citation2018). Neural networks with different activation functions have also been applied to the HJB equation (Munos et al., Citation1999; T. Nakamura-Zimmerer et al., Citation2021). The HJB equation with action constraints (Abu-Khalaf & Lewis, Citation2005; Yang et al., Citation2014) and fixed final time has been studied with the help of neural networks with polynomial neurons (Cheng et al., Citation2007a, Citation2007b). The least squares algorithm with a time-domain discounting factor and $\tanh (x)$ neuron was studied in Tassa and Erez (Citation2007) to improve the smoothness and convergence of neural networks solutions. The activation function $\tanh (x)$ was also used to solve high dimensional HJB equations in T. Nakamura-Zimmerer et al. (Citation2020). The rectified linear neurons were used to approximate the value function using a grid-free approach in Jiang et al. (Citation2016). Effects of different types of neural networks for the solution of the Hamilton–Jacobi–Bellman equation are studied in Yang et al. (Citation2014).

The radial basis neural networks (RBFNN) were used to compute the HJB solution along the response trajectory (D. Zhang et al., Citation2016). The normalised Gaussian networks were adopted to develop a reinforcement learning framework to obtain the global value function for continuous time systems while training the networks along different trajectories (Doya, Citation2000). The actor-critic reinforcement learning together with the RBFNN was also applied to obtain the optimal control for a discrete-time system (D. Zhang et al., Citation2016).

Recently, the interest in solving the HJB solution in high dimensional state space is growing (Darbon & Osher, Citation2016; Yegorov & Dower, Citation2021). A neural network with sigmoid activation function was proposed to compute the viability sets of nonlinear continuous systems in Djeridane and Lygeros (Citation2006). The tanh activation function was considered to develop a causality-free neural network to solve high dimensional HJB equations including the co-state vector as part of the solution (T. Nakamura-Zimmerer et al., Citation2020; T. E. Nakamura-Zimmerer, Citation2022). Two neural networks architectures are introduced using minimum plus algebra to solve high dimensional optimal control problems with specified terminal cost using the min-pooling activation function in Darbon et al. (Citation2021). Indeed, neural networks can obtain excellent approximations of the value function from the solution of the HJB equation leading to the optimal policy. Among all the neural networks structures and activation functions, the polynomial activation functions are adopted in most cases to solve the optimal control problems in reinforcement learning (Abu-Khalaf & Lewis, Citation2005; Cheng et al., Citation2007a, Citation2007b; Liu et al., Citation2014; Luo et al., Citation2014; Modares & Lewis, Citation2014; Modares et al., Citation2014; Yang et al., Citation2014).

While the solutions of the HJB equation have been extensively studied, fewer investigations have focussed on the performance of optimal controls obtained in terms of the RBFNN. This paper presents an approach based on radial basis function neural networks with Gaussian activation functions to solve the HJB equation for various optimal control problems and to demonstrate the robustness of the optimal controls with the model reduction and transfer learning technique in machine learning. This study addresses the robustness improvement of the optimal controls obtained with the RBFNN, model reduction and transfer learning.

There are many studies on the radial basis function neural networks. An adaptive extended Kalman filter (AEKF) was developed for estimating the weights, centres, and width of RBFNN (Medagam & Pourboghrat, Citation2009). It has been shown that the RBFNN with one hidden layer and the same smoothing factor in each kernel are broad enough for universal approximation (Park & Sandberg, Citation1991). The RBFNN have been extensively applied to regression and classification problems. An excellent review of this method can be found in Wu et al. (Citation2012). The RBFNN have been shown to have advantage in terms of design, generality and tolerance to noises compared to other neural network structures (Yu et al., Citation2011). It has been found that RBFNN have an advantage over the conventional sigmoid neural networks (NN) owing to that the nth-dimensional gaussian function is well-established from probability theory, Kalman filtering, etc. (Lewis et al., Citation1998). An application of the RBFNN in control design can be found in W. Zhang et al. (Citation2022) to approximate the error model in vibration suppression. The advantages of RBFNN motivate us to further explore the potential of the RBFNN method to solve the HJB equation for nonlinear control systems.

Many mechanical systems with multi-degrees of freedom are underactuated by design. The dynamics of these systems live in a relatively high dimensional state space. It turns out to be highly advantageous to discover a reduced order model of the system, in which the limited control efforts can be focussed to stabilise the system and achieve improved tracking performance. The reduced order model also decreases numerical complexities in solving the HJB equation for nonlinear control systems. For this reason, it is important to find a reduced order model of the system before applying the RBFNN method to solve the HJB equation.

The balanced truncation model reduction (BTMR) method is a popular algorithm in the controls community to reduce the dimension of the dynamic system with the help of principal component analysis and singular value decomposition of the Hankel matrix constructed with the controllability and observability gramians. It has received much attention from the research community in the past decades (Gugercin & Antoulas, Citation2004; Pernebo & Silverman, Citation1982). The balanced model reduction has been applied to control design for high dimensional systems with thousands of degrees of freedom (Huang & Kramer, Citation2020; Wu et al., Citation2020). The controllability and observability gramians for nonlinear control systems can be computed from input-output data, leading to the so-called empirical gramians (Himpe, Citation2018). In this paper, the BTMR algorithms with both model-based and empirical gramians are considered for improving the robustness of the RBFNN optimal control, so as to reduce the computation complexity.

Since the mathematical model of nonlinear control systems can never perfectly capture the dynamics of the real system, the RBFNN control based on the mathematical model may need fine tuning when it is applied to the physical system in an experimental setting. This is where the transfer learning technique of machine learning comes in. After certain experimental data is obtained, the neural networks can be partially retrained in order to further improve the performance of the control.

The objective of this paper is to propose an efficient, practical and useful RBFNN algorithm to solve the HJB equation in optimal control. We demonstrate the RBFNN optimal control design with a linear rotary pendulum system and several nonlinear systems. In particular, the robustness of the RBFNN optimal control is explored and reported in terms of stabilisation and tracking of the pendulum and other nonlinear systems. The traditional LQR control and the optimal control obtained with the polynomial neural networks are used as benchmarks to compare with the proposed RBFNN optimal control design.

The remainder of the paper is outlined as follows. Section 2 reviews the optimal control formulation in terms of the HJB equation. Section 3 presents the RBFNN method to solve the nonlinear HJB equation. Section 4 discusses the balanced truncation algorithm and the RBFNN optimal control with the reduced order model. Section 6 presents numerical simulations and experimental results of the RBFNN optimal control for linear systems. Section 7 presents numerical simulations of the RBFNN optimal control for the nonlinear systems. Section 8 concludes the paper.

2. The Hamilton–Jacobi–Bellman equation

Consider a nonlinear time-invariant dynamic system governed by the state equation (1) $\begin{aligned} \dot{x} (t) = f (x) + g (x) u, x (t_{0}) = x_{0}, \end{aligned}$ (1) where $t_{0}$ is the initial time, $x_{0}$ is the initial state, $x \in R^{n \times 1}$ , $u \in R^{m \times 1}$ . $f (\cdot)$ and $g (\cdot)$ are nonlinear functions of their arguments. Define a target set at a terminal time t = T as (2) $\begin{aligned} ϕ (x (T), T) = 0, \end{aligned}$ (2) where $ϕ \in R^{p \times 1}$ . ϕ defines a set where the terminal state $x (T)$ must settle at time T.

The optimal control problem amounts to finding a control $u$ in an admissible set $U \subset R^{m \times 1}$ that drives the system from the initial state $x_{0}$ to the target set $ϕ (x (T), T)$ such that the following performance index J is minimised while the state equation is satisfied. (3) $\begin{aligned} J = ϕ (x (T), T) + \int_{t_{0}}^{T} L (x (τ), u (τ), τ) d τ, \end{aligned}$ (3) where $L (x (t), u (t), t) > 0$ is called the Lagrange function.

Define a value function along the optimal trajectory based on the performance index as (4) $\begin{aligned} V (x^{*} (t), t) & = ϕ (x^{*} (T), T) + \int_{t}^{T} L (x^{*} (τ), u^{*} (τ), τ) d τ \\ = min_{u} [ϕ (x^{*} (T), T) + \int_{t}^{T} L (x^{*}, u, τ) d τ] . \end{aligned}$ (4) By definition,the value function has the following properties (5) $\begin{aligned} \begin{aligned} V (x^{*} (T), T) & = ϕ (x^{*} (T), T), \\ V (x^{*} (t), t_{0}) & = J . \end{aligned} \end{aligned}$ (5) Take the derivative of the value function with respect to time t based on the definition in Equation (Equation4(4) $\begin{aligned} V (x^{*} (t), t) & = ϕ (x^{*} (T), T) + \int_{t}^{T} L (x^{*} (τ), u^{*} (τ), τ) d τ \\ = min_{u} [ϕ (x^{*} (T), T) + \int_{t}^{T} L (x^{*}, u, τ) d τ] . \end{aligned}$ (4) ) (6) $\begin{aligned} \frac{d V (x^{*} (t), t)}{d t} = - L (x^{*} (t), u^{*} (t), t) . \end{aligned}$ (6)

Remark 2.1

A remark on the role of the value function is in order. By definition, we have (7) $\begin{aligned} V (x^{*} (t), t) \geq 0, \frac{d V (x^{*} (t), t)}{d t} \leq 0. \end{aligned}$ (7) Hence, $V (x^{*} (t), t)$ is a Lyapunov function. The existence of the Lyapunov function implies that the system under the optimal control $u^{*} (t)$ is stable.

By treating $V (x^{*} (t), t)$ as a multi-variable function of the state and time, we can also apply the chain rule of differentiation along the optimal path to obtain the Hamilton–Jacobi–Bellman equation (HJB) as (8) $\begin{aligned} \frac{\partial V (x^{*} (t), t)}{\partial t} & = - L (x^{*} (t), u^{*} (t), t) \\ - \frac{\partial V (x^{*} (t), t)^{T}}{\partial x} (f (x^{*} (t)) + g (x^{*} (t)) u^{*} (t)) \\ = - H (x^{*} (t), u^{*} (t), \frac{\partial V (x^{*} (t), t)}{\partial x}, t) \\ = - min_{u} H (x^{*} (t), u (t), \frac{\partial V (x^{*} (t), t)}{\partial x}, t), \end{aligned}$ (8) where H is the Hamiltonian function defined as (9) $\begin{aligned} H (x (t), u (t), λ (t), t) \\ = L (x (t), u (t), t) + λ^{T} (t) [f (x (t)) + g (x (t)) u (t)], \end{aligned}$ (9) and $λ (t)$ is a vector of Lagrange multipliers.

Equation (Equation8(8) $\begin{aligned} \frac{\partial V (x^{*} (t), t)}{\partial t} & = - L (x^{*} (t), u^{*} (t), t) \\ - \frac{\partial V (x^{*} (t), t)^{T}}{\partial x} (f (x^{*} (t)) + g (x^{*} (t)) u^{*} (t)) \\ = - H (x^{*} (t), u^{*} (t), \frac{\partial V (x^{*} (t), t)}{\partial x}, t) \\ = - min_{u} H (x^{*} (t), u (t), \frac{\partial V (x^{*} (t), t)}{\partial x}, t), \end{aligned}$ (8) ) is a partial differential equation defined in the state space governing the value function. According to the definition of the value function, we have a terminal condition for the HJB equation as $V (x^{*} (T), T) = ϕ (x^{*} (T), T)$ . When there are constraints imposed on the states, boundary conditions can also be imposed to the HJB equation.

Consider a special case (10) $\begin{aligned} L (x, u) = \frac{1}{2} [x^{T} (t) Qx (t) + u^{T} (t) Ru (t)], \end{aligned}$ (10) where $Q$ is a semi-positive definite symmetrical matrix, and $R$ is a positive definite symmetrical matrix. The optimal control for this case is expressed in terms of the value function. (11) $\begin{aligned} u^{*} (t) = - R^{- 1} g^{T} (x^{*} (t)) \frac{\partial V (x^{*} (t), t)}{\partial x} . \end{aligned}$ (11) When the system is nonlinear, it is difficult to obtain the function of $V (x^{*} (t), t)$ analytically. This study focuses on the RBFNN with Gaussian activation functions for obtaining the value function.

3. Neural networks solution of value function

In the following, we shall focus on the special time-invariant case when the terminal time $T \to \infty$ , the terminal cost $ϕ (x (T), T) \to 0$ , and the value function is only a function of the state, i.e. $V = V (x)$ .

The value function can be expressed in terms of radial basis function neural networks with Gaussian activation functions, (12) $\begin{aligned} V (x, w (k)) & = \sum_{j = 1}^{N_{G}} w_{j} (k) g (x, μ_{j}, Σ_{j}) + C \\ \equiv w^{T} (k) h (x) + C \equiv h^{T} (x) w (k) + C, \end{aligned}$ (12) where $N_{G}$ is the number of Gaussian functions $g (x, μ_{j}, Σ_{j})$ and C is an arbitrary integration constant to guarantee the value function to be non-negative. (13) $\begin{aligned} g (x, μ_{j}, Σ_{j}) & = \prod_{i = 1}^{n} \frac{1}{\sqrt{2 π} σ_{j, i}} \exp [- \frac{1}{2 σ_{j, i}^{2}} (x_{i} - μ_{j, i})^{2}] \end{aligned}$ (13) (14) $\begin{aligned} \equiv \prod_{i = 1}^{n} g_{j} (x_{i}, μ_{j, i}, σ_{j, i}) \end{aligned}$ (14) The vectors and matrices in the above equations are defined as (15) $\begin{aligned} w (k) & = [w_{1} (k), w_{2} (k), \dots, w_{N_{G}} (k)]^{T} \in R^{N_{G} \times 1}, \end{aligned}$ (15) (16) $\begin{aligned} h (x) & = [g (x, μ_{1}, Σ_{1}), g (x, μ_{2}, Σ_{2}), \dots, \\ \times g (x, μ_{N_{G}}, Σ_{N_{G}})]^{T} \in R^{N_{G} \times 1}, \end{aligned}$ (16) (17) $\begin{aligned} μ_{j} & = [μ_{j, i}] \in R^{N_{G} \times n}, \end{aligned}$ (17) (18) $\begin{aligned} Σ_{j} & = [σ_{j, i}] \in R^{N_{G} \times n}, \end{aligned}$ (18) where k is the iteration index of the policy iteration algorithm introduced next. In general, the elements of the mean matrix $μ_{j}$ and covariance matrix $Σ_{j}$ can be trainable coefficients of the neural networks.

3.1. Implementation of RBFNN

In this study, the Gaussian functions are adopted as kernels to approximate an unknown function as is commonly done in statistics and mesh-free finite elements (J.-S. Chen et al., Citation2017; Silverman, Citation2018; Sun & Hsu, Citation1988) That is to say, a finite domain $D$ of interest in the state space can be divided into grids and take the grid coordinates as the mean $μ_{j}$ . Furthermore, the covariance matrix $Σ_{j}$ is taken to be diagonal and the same constant for all Gaussian functions. It is noteworthy that with these choices of the mean and covariance matrix, the RBFNN can indeed deliver sufficiently accurate solutions of nonlinear partial differential equations (Wang et al., Citation2022a, Citation2022b, Citation2023). It is shown in this paper that the optimal control for nonlinear dynamic systems can also be obtained in this manner, which allows us to focus on the issues in the control design.

Future efforts will treat both $μ_{j}$ and $Σ_{j}$ as trainable coefficients. The effect of varying means and standard deviations on the optimal control solution will be studied in a separate paper.

In the following discussions, we drop the symbol $*$ for optimal solutions for brevity.

3.2. Policy iteration (PI) algorithm

Recall the optimal control and consider the RBFNN solution of the value function in Equation (Equation12(12) $\begin{aligned} V (x, w (k)) & = \sum_{j = 1}^{N_{G}} w_{j} (k) g (x, μ_{j}, Σ_{j}) + C \\ \equiv w^{T} (k) h (x) + C \equiv h^{T} (x) w (k) + C, \end{aligned}$ (12) ). At the kth step of iterations, the control is given as (19) $\begin{aligned} u (x, k) = - R^{- 1} g^{T} (x) \frac{\partial V (x, w (k))}{\partial x}, \end{aligned}$ (19) where $u (x, k)$ and $V (x (t), w (k))$ denote the control and value function at the $k^{th}$ step of iterations. $u (x, k = 0)$ is chosen to be a stabilising initial control. The value function is updated with the following equation, (20) $\begin{aligned} \frac{\partial V^{T} (x, w (k + 1))}{\partial x} (f (x) + g (x) u (x, k)) + L (x, u (x, k)) = 0. \end{aligned}$ (20) According to Theorem 4 of reference Saridis and Lee (Citation1979), the pair ${u (x, k), V (x, w (k))}$ determined by Equations (Equation19(19) $\begin{aligned} u (x, k) = - R^{- 1} g^{T} (x) \frac{\partial V (x, w (k))}{\partial x}, \end{aligned}$ (19) ) and (Equation20(20) $\begin{aligned} \frac{\partial V^{T} (x, w (k + 1))}{\partial x} (f (x) + g (x) u (x, k)) + L (x, u (x, k)) = 0. \end{aligned}$ (20) ) leads to a convergent sequence such that (21) $\begin{aligned} \begin{aligned} V (x, w (k)) & \geq V (x, w (k + 1)), \\ lim_{k \to \infty} V (x, w (k)) & = V^{*} (x), \\ lim_{k \to \infty} u (x, k) & = u^{*} (x) = - R^{- 1} g^{T} (x) \frac{\partial V^{*} (x)}{\partial x} . \end{aligned} \end{aligned}$ (21) where $V^{*} (x)$ denotes the true optimal value function and $u^{*} (x)$ is the true optimal control. Given a random value function and iteratively updating the new value function with the greedy policy is called value iteration or successive approximation (Saridis & Lee, Citation1979). Instead of updating the value function, the policy iteration is defined as iteratively updating the greedy policy to the optimal solution. Since the control $u (x, k)$ and the value function are both iteratively updated during the calculation of the RBFNN, it is hard to distinguish the current approach as value iteration or policy iteration. In this case, it is treated as policy iteration. The proof of convergence of the similar algorithm for saturated controls was presented by Abu-Khalaf and Lewis (Citation2005) and Meyn (Citation2022). Many examples reported in the literature also confirm the convergence of the successive approximation algorithm (Tassa & Erez, Citation2007). We have found that the choice of an initial stabilising control policy $u_{0}$ is important to put the PI algorithm on the right track to converge.

Next, the updating equation is presented in matrix form. Consider the expressions of Gaussian function in Equations (Equation14(14) $\begin{aligned} \equiv \prod_{i = 1}^{n} g_{j} (x_{i}, μ_{j, i}, σ_{j, i}) \end{aligned}$ (14) )–(Equation18(18) $\begin{aligned} Σ_{j} & = [σ_{j, i}] \in R^{N_{G} \times n}, \end{aligned}$ (18) ), the partial derivative is expressed as, (22) $\begin{aligned} \begin{aligned} \frac{\partial V (x, w (k))}{\partial x_{i}} & = \sum_{j = 1}^{N_{G}} \frac{- (x_{i} - μ_{j, i})}{σ_{j, i}^{2}} w_{j} (k) g_{j} (x, μ_{ij}, σ_{ij}) \\ \equiv \sum_{j = 1}^{N_{G}} \frac{- (x_{i} - μ_{j, i})}{σ_{j, i}^{2}} w_{j} (k) g (x, μ_{j}, Σ_{j}) \\ \equiv \sum_{j = 1}^{N_{G}} d_{i, j} (x) w_{j} (k), \end{aligned} \end{aligned}$ (22) where (23) $\begin{aligned} d_{i, j} (x) = \frac{- (x_{i} - μ_{j, i})}{σ_{j, i}^{2}} g (x, μ_{j}, Σ_{j}) . \end{aligned}$ (23) Hence, the gradient of the value function is (24) $\begin{aligned} \frac{\partial V (x, w (k))}{\partial x} = D (x) w (k) \in R^{n \times 1}, \end{aligned}$ (24) where $D = [d_{i, j}] \in R^{n \times N_{G}}$ in Equation (Equation23(23) $\begin{aligned} d_{i, j} (x) = \frac{- (x_{i} - μ_{j, i})}{σ_{j, i}^{2}} g (x, μ_{j}, Σ_{j}) . \end{aligned}$ (23) ). The control can now be written in the matrix form as (25) $\begin{aligned} u (x, k) = - R^{- 1} g^{T} (x) D (x) w (k) . \end{aligned}$ (25) Equation (Equation20(20) $\begin{aligned} \frac{\partial V^{T} (x, w (k + 1))}{\partial x} (f (x) + g (x) u (x, k)) + L (x, u (x, k)) = 0. \end{aligned}$ (20) ) in terms of these matrices reads (26) $\begin{aligned} S^{T} (x) w (k + 1) + L (x, u (x, k)) = 0, \end{aligned}$ (26) where $S \in R^{N_{G} \times 1}$ and (27) $\begin{aligned} S^{T} (x) = [f (x) - g (x) R^{- 1} g^{T} (x) D (x) w (k)]^{T} D (x) . \end{aligned}$ (27) Define the error of the HJB equation due to the neural networks approximation as (28) $\begin{aligned} e (x, w (k + 1)) = S^{T} (x) w (k + 1) + L (x, u (x, k)) . \end{aligned}$ (28) An integrated squared error in the state space $R^{n \times 1}$ is defined as (29) $\begin{aligned} J_{HJB} (w (k + 1)) = \int_{R^{n \times 1}} \frac{1}{2} e^{2} (x, w (k + 1)) d x . \end{aligned}$ (29)

3.3. Sampling method for integration

Computation of $J_{HJB} (w (k + 1))$ involves integration in high dimensional space and can be intensive. Therefore, instead of integration in Equation (Equation29(29) $\begin{aligned} J_{HJB} (w (k + 1)) = \int_{R^{n \times 1}} \frac{1}{2} e^{2} (x, w (k + 1)) d x . \end{aligned}$ (29) ), we can uniformly sample a large number of points $x_{s} \in D \subset R^{n \times 1}$ to compute an approximate value of $J_{HJB} (w (k + 1))$ . A sampled value $J_{s}$ of $J_{HJB}$ is obtained as (30) $\begin{aligned} J_{s} (w (k + 1)) & = \frac{1}{2} \sum_{s = 1}^{N_{s}} e^{2} (x_{s}, w (k + 1)) \\ = \frac{1}{2} \sum_{s = 1}^{N_{s}} [S^{T} (x_{s}) w (k + 1) + L (x_{s}, u (x_{s}, k))]^{2} \end{aligned}$ (30) where $N_{s}$ is the number of sampled points $x_{s}$ in the domain $D$ . Define the following matrix and vectors as (31) $\begin{aligned} A_{g} (k) & = A_{g}^{T} (k) = \sum_{s = 1}^{N_{s}} S S^{T} (x_{s}) \in R^{N_{G} \times N_{G}} \end{aligned}$ (31) (32) $\begin{aligned} b (k) & = \sum_{s = 1}^{N_{s}} L (x_{s}, u (x_{s}, k)) S (x_{s}) \in R^{N_{G} \times 1} \end{aligned}$ (32) (33) $\begin{aligned} d (k) & = \frac{1}{2} \sum_{s = 1}^{N_{s}} L^{2} (x_{s}, u (x_{s}, k)) \in R^{1} \end{aligned}$ (33) Later in the examples, the domain $D$ is referred to as the sampling region or training region.

The sampled performance index reads (34) $\begin{aligned} J_{s} (w (k + 1)) & = \frac{1}{2} w^{T} (k + 1) A_{g} (k) w (k + 1) \\ + w^{T} (k + 1) b (k) + d (k) . \end{aligned}$ (34) The optimal coefficient vector at each iteration step can be obtained as (35) $\begin{aligned} w (k + 1) = - A_{g}^{- 1} (k) b (k) \end{aligned}$ (35) assuming that the inverse of the matrix $A_{g}^{- 1} (k)$ exists. It has been found that when $N_{s} > N_{G}$ , the inverse usually exists (Wang et al., Citation2022a, Citation2022b, Citation2023). The mean values and standard deviations of the Gaussian function are fixed in this paper. The state space is uniformly sampled along each dimension with a fixed grid size, and the standard deviation is set to be between four and five times the size of the fixed grid.

3.4. Dominant system

If the policy iteration algorithm starts with an initial estimate of the value function in terms of the RBFNN with randomly generated coefficients, the resulting control can push the system on to a track leading to instability for unstable systems. Hence, it is recommended to begin the policy iterations of an unstable system with a stabilising initial control, which will ensure system stability through the iteration process. We introduce the concept of dominant system to create the stabilising initial control. The dominant system is defined as an unstable linear system that has poles with larger positive real parts than that of the original system. A LQR control can be found to stabilise the dominant system and is used as the initial control to start the policy iteration. There are different ways to design the initial control, such as the supervised learning based control (Borovykh et al., Citation2022) and the unconstrained LQR control (Abu-Khalaf & Lewis, Citation2005). It should be noted that for stable systems, randomly generated initial coefficients $w (1)$ for the RBFNN always make the policy iteration convergent.

4. Reduced order model

It has been numerically found that the reduced order model shows better robustness to disturbances and parameter uncertainties compared to the original system (Zhao & Sun, Citation2022). We take advantage of the reduced order model to better focus the control effort to stabilise the system and to reduce the dimension of the state space where the HJB equation applies. This approach will subsequently improve the numerical performance of the RBFNN when applied to find the value function from the HJB equation in high dimensional space.

There are different model order reduction methods. In this study, the balanced truncation model reduction technique is considered, which can be readily extended to become data-driven. This is one of the reasons for us to adopt this technique.

4.1. Balanced truncation

This section presents the procedure of the balanced truncation method, as it was described by Laub et al. (Citation1987). Before exploring the details about the balanced truncation algorithm, it is beneficial to provide a brief introduction about the concept of controllability and observability in control theory. Controllability measures the ability of the control to drive the system from an initial state to any desired state in a finite time. Observability refers to the ability to estimate the full state of the system from output measurements. Controllability and observability play important roles in the balanced truncation algorithm, as elaborated in the subsequent discussion.

Consider a linear time invariant (LTI) dynamical system in the state space, (36) $\begin{aligned} \begin{aligned} \dot{x} (t) & = A x (t) + B u (t) \\ y (t) & = C x (t) \end{aligned} \end{aligned}$ (36) The balanced truncation reduced order model is realised by introducing a transformation $T$ to the state $x (t)$ such that $x = T x_{b}$ which leads to a so-called balanced form of the system. In the following, the process of finding the transformation $T$ to obtain the desired balanced order system is presented.

Let $W_{c}$ and $W_{o}$ be the controllability and observability gramians of the stable subsystem. Consider two continuous time Lyapunov equations, (37) $\begin{aligned} \begin{aligned} A W_{c} + W_{c} A^{T} + B B^{T} & = 0 \\ A^{T} W_{o} + W_{o} A + C^{T} C & = 0 \end{aligned} \end{aligned}$ (37) Since $A$ is stable, the Lyapunov equations have unique symmetric positive definite solutions $W_{c}$ and $W_{o} \in R^{n_{s} \times n_{s}}$ , which are the controllability and observability Gramians of the system (Gawronski & Juang, Citation1990; Phillips et al., Citation2022). The controllability and observability gramians can be defined as, (38) $\begin{aligned} W_{c} & = \int_{0}^{\infty} e^{A t} B B^{T} e^{A^{T} t} d t \end{aligned}$ (38) (39) $\begin{aligned} W_{o} & = \int_{0}^{\infty} e^{A t} C^{T} C e^{A^{T} t} d t \end{aligned}$ (39) After obtaining $W_{c}$ and $W_{o}$ , the transformation $T$ is computed through the following steps.

Evaluate the Cholesky decomposition of gramians $W_{c}$ and $W_{o}$ , (40) $\begin{aligned} W_{c} = {RR}^{T}, W_{o} = {QQ}^{T} \end{aligned}$ (40)
Evaluate the singular value decomposition of the Hankel matrix defined as $H = Q^{T} R$ , (41) $\begin{aligned} H = U_{H} {ΩV}_{H}^{T} \end{aligned}$ (41)
The transformation $T$ can be obtained as, (42) $\begin{aligned} T = {RV}_{H} Ω^{1 / 2}, T^{- 1} = Ω^{- 1 / 2} U_{H}^{T} Q^{T} \end{aligned}$ (42) where $Ω = diag (ω_{1}, ω_{2}, \dots, ω_{n_{s}})$ , $ω_{1} > ω_{2} > \dots ω_{n_{s}} \geq 0$ . $ω_{i}$ are called the Hankel singular values of the matrix $H$ .

Define the transformed gramians as ${\bar{W}}_{c} = T^{- 1} W_{c} T^{- T}$ and ${\bar{W}}_{o} = T^{T} W_{o} T = Ω$ . The singular values $Ω$ can be rewritten as (43) $\begin{aligned} Ω = [\begin{matrix} Ω_{1} & 0 \\ 0 & Ω_{2} \end{matrix}] \end{aligned}$ (43) where $Ω_{1} = diag (ω_{1}, ω_{2}, \dots, ω_{r})$ , $Ω_{2} = diag (ω_{r + 1}, ω_{r + 2}, \dots, ω_{n_{s}})$ . $Ω_{2}$ will be truncated to obtain the reduced order system. Hence, in the transformed state $x_{b} = T^{- 1} x$ , we neglect the components $x_{b, i}$ for $i = r + 1, r + 2, \dots, n_{s}$ . $r < n_{s}$ denotes the number of states to be included in the reduced order model of the stable subsystem.

Let $x_{r} (t) \in R^{r \times 1}$ denote the truncated state vector, $A_{r} \in R^{r \times r}$ be the corresponding state matrix, $B_{r} \in R^{r \times m}$ the reduced order control influence matrix and $C_{r} \in R^{p \times r}$ the reduced order output matrix. The reduced order system output $y_{r} \in R^{p \times 1}$ has the same dimension as that of the original output of the system. It is proved that $y_{r}$ is approximately equal to the original output $y$ . The reduced order system is given as, (44) $\begin{aligned} \begin{aligned} {\dot{x}}_{r} (t) & = A_{r} x_{r} (t) + B_{r} u (t) \\ y_{r} (t) & = C_{r} x_{r} (t) \end{aligned} \end{aligned}$ (44) where $A_{r} = T_{r}^{- 1} A T_{r}$ , $B_{r} = T_{r}^{- 1} B$ and $C_{r} = C T_{r}$ , $x_{r} = T_{r}^{- 1} x$ . $T_{r} \in R^{n_{s} \times r}$ is part of the transformation $T$ after the columns corresponding to the truncated states are removed. The $X_{0}$ -balanced truncation method which was presented in Heinkenschloss et al. (Citation2011) is adopted to include the effects of non-zero initial conditions for the LTI system. Details of application of balanced truncation algorithm with non-zero initial conditions for a high-dimensional LTI system can be found in Zhao and Sun (Citation2022).

4.2. Empirical balanced truncation

The controllability and observability gramians used in the balanced truncation method are defined for linear systems and are usually model-based. For nonlinear systems without a detailed mathematical model, the controllability and observability gramians can be estimated from input-output data. These are called empirical gramians. The empirical gramians yield the exactly same balanced truncation transformations for stable linear system. However, it has been found that the reduced order model from the empirical gramians contains more accurate information than the traditional model-based balanced truncation (Grundel et al., Citation2019; Singh & Hahn, Citation2005). The empirical controllability and observability gramians are defined as, (45) $\begin{aligned} {\hat{W}}_{c} & = \sum_{i = 1}^{p} \sum_{m = 1}^{s} \sum_{l = 1}^{r} \frac{1}{rs c_{m}^{2}} \int_{0}^{\infty} Φ^{ilm} (t) d t \end{aligned}$ (45) (46) $\begin{aligned} {\hat{W}}_{o} & = \sum_{m = 1}^{s} \sum_{l = 1}^{r} \frac{1}{rs c_{m}^{2}} \int_{0}^{\infty} T_{l} ψ^{lm} (t) T_{l}^{T} d t \end{aligned}$ (46) where (47) $\begin{aligned} \begin{aligned} Φ^{ilm} (t) = (x^{ilm} (t) - {\bar{x}}^{ilm}) (x^{ilm} (t) - {\bar{x}}^{ilm})^{T} \\ ψ_{ij}^{lm} (t) = (y^{ilm} (t) - {\bar{y}}^{ilm})^{T} (y^{jlm} (t) - {\bar{y}}^{jlm}) \\ T_{i}^{T} T_{i} = I, T_{i} \in R^{n \times n}, i = 1, \dots, r \\ c_{i} > 0, c_{i} \in R, i = 1, \dots, s \end{aligned} \end{aligned}$ (47) and $x^{ilm} (t)$ is the state response to an impulsive input $u (t) = c_{m} T_{l} e_{i} δ (t)$ , ${\bar{x}}^{ilm} (t)$ is the average of the state response, $y^{ilm} (t)$ is the output of the system starting from the initial condition $x_{0} = c_{m} T_{l} e_{i}$ and $e_{i}$ are unit vectors. The same calculation procedures as described by Equations (Equation40(40) $\begin{aligned} W_{c} = {RR}^{T}, W_{o} = {QQ}^{T} \end{aligned}$ (40) )–(Equation42(42) $\begin{aligned} T = {RV}_{H} Ω^{1 / 2}, T^{- 1} = Ω^{- 1 / 2} U_{H}^{T} Q^{T} \end{aligned}$ (42) ) can be applied to obtain the transformation matrix $T$ for the reduced order model in Equation (Equation44(44) $\begin{aligned} \begin{aligned} {\dot{x}}_{r} (t) & = A_{r} x_{r} (t) + B_{r} u (t) \\ y_{r} (t) & = C_{r} x_{r} (t) \end{aligned} \end{aligned}$ (44) ).

4.3. Control based on reduced order model

Applying the RBFNN solution to the control design of the reduced order model, the optimal policy for stabilisation is obtained as (48) $\begin{aligned} u (t) = - R_{r}^{- 1} B_{r}^{T} D_{g} (x_{r} (t)) w_{r} \end{aligned}$ (48) where the subscript r indicates the reduced order model.

For linear systems, we can also explicitly write a control to track a reference trajectory as $x_{re f_{r}} (t)$ as, (49) $\begin{aligned} u (t) = - R_{r}^{- 1} B_{r}^{T} (D_{g} (x_{r} (t)) - D_{g} (x_{re f_{r}} (t))) w_{r} \end{aligned}$ (49)

5. Transfer learning

When the RBFNN control presented in the previous section is well trained using the data simulated with a sufficiently accurate model of the dynamic system, it exhibits good control performance. When there are model uncertainties, unknown hardware gains and external disturbances, the real time performance of the model-based RBFNN control may deteriorate. To improve the robustness of the RBFNN control in real time, we make use of the transfer learning concept to retrain the neural networks with the experimental data.

Transfer learning is a machine learning technique for fine tuning network weights (Guo et al., Citation2019) for performance improvement when new data become available. It is common to freeze the weights in the hidden layers of the neural networks, and retrain the weights in the output layer with the new data. In this study, we have only one hidden layer whose weights are retrained with the experimental data. This retraining is done by feeding the new data to Equation (Equation30(30) $\begin{aligned} J_{s} (w (k + 1)) & = \frac{1}{2} \sum_{s = 1}^{N_{s}} e^{2} (x_{s}, w (k + 1)) \\ = \frac{1}{2} \sum_{s = 1}^{N_{s}} [S^{T} (x_{s}) w (k + 1) + L (x_{s}, u (x_{s}, k))]^{2} \end{aligned}$ (30) ) and updating the network coefficients according to Equation (Equation35(35) $\begin{aligned} w (k + 1) = - A_{g}^{- 1} (k) b (k) \end{aligned}$ (35) ) during the policy iteration. The system response $x (t)$ is used to update the system dynamics and the performance index. The system response $x (t)$ is collected from experiments and used to retrain the RBFNN. Note that parts of the system dynamics parameters remain model-based during the RBFNN training. The idea here is to make use of the model of the system and introducing the experimental data to update the performance of the optimal controls. The concept of transfer learning is compatible with this approach, allowing knowledge to be transferred from one domain (model-based) to another (experimental).

The retrained RBFNN control is experimentally evaluated and compared with the original model-based RBFNN control. A flowchart in Figure summarises the RBFNN control design procedure discussed in the previous sections.

Figure 1. Flowchart of the RBFNN optimal control algorithm with model reduction and transfer learning for linear systems.

6. RBFNN control for linear systems

The rotary pendulum is a benchmark dynamic system to study control performance. It is taken as a linear example for the RBFNN control validation due to its compatibility with numerical and experimental evaluation. The Quanser-Servo2 inverted pendulum system in Figure is the hardware to validate the performance of the algorithm. The system has two degrees of freedom. The corresponding state space is four dimensional, which makes the computation of the RBFNN control from the HJB equation more complicated. Note that the RBFNN control is first applied on a simple 2D system to validate its performance on the linear system as shown in Appendix.

Figure 2. Quanser-Servo2 Inverted Pendulum system hardware setup (Quanser, Citation2022).

6.1. Numerical simulations

The linearised model of the rotary pendulum at the equilibrium point is given by, (50) $\begin{aligned} J_{r} \ddot{θ} + m_{p} lr \ddot{α} & = τ - b_{r} \dot{θ} \end{aligned}$ (50) (51) $\begin{aligned} J_{p} \ddot{α} + m_{p} lr \ddot{θ} + m_{p} glα & = - b_{p} \dot{α} \end{aligned}$ (51) where θ is the angle of the rotary arm, α is the angle of the pendulum, τ is the applied torque at the base of the rotary arm. $J_{r} = m_{r} r^{2} / 3$ is the moment of inertia of the rotary arm with respect to the axis of rotation of the arm, $J_{p} = m_{p} L_{p}^{2} / 3$ is the moment of inertial of the pendulum link relative to the axis of rotation of the pendulum, $b_{r}$ and $b_{p}$ are the viscous damping coefficients of the rotary arm and the pendulum, respectively. $m_{r}$ is the mass of the rotary arm and $m_{p}$ is the mass of the pendulum, $L_{p}$ is the length of the pendulum and $l = L_{p} / 2$ . The nominal values of the rotary pendulum parameters provided by Quanser are listed in Table .

Table 1. Parameters of the rotary pendulum system.

Display Table

The state equations of the linear model in Equations (Equation50(50) $\begin{aligned} J_{r} \ddot{θ} + m_{p} lr \ddot{α} & = τ - b_{r} \dot{θ} \end{aligned}$ (50) ) and (Equation51(51) $\begin{aligned} J_{p} \ddot{α} + m_{p} lr \ddot{θ} + m_{p} glα & = - b_{p} \dot{α} \end{aligned}$ (51) ) with the nominal values of the parameters are given by, (52) $\begin{aligned} \dot{x} (t) & = [\begin{matrix} 0 & 0 & 1.0 & 0 \\ 0 & 0 & 0 & 1.0 \\ 0 & 152.0057 & - 12.2542 & - 0.5005 \\ 0 & 264.3080 & - 12.1117 & - 0.8702 \end{matrix}] x (t) \\ + [\begin{matrix} 0 \\ 0 \\ 50.6372 \\ 50.0484 \end{matrix}] τ \end{aligned}$ (52) (53) $\begin{aligned} y (t) & = [\begin{matrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \end{matrix}] x (t) \end{aligned}$ (53) where $x = [θ, α, \dot{θ}, \dot{α}]^{T}$ is the state vector of the rotary pendulum. The outputs are the angles of rotary arm θ and pendulum α. In simulations, it is assumed that all the system states are available for the full state feedback control. In experiments, the derivatives of the angle measurements can be obtained through a low pass filter (Quanser, Citation2022), (54) $\begin{aligned} H (s) = \frac{50 s}{s + 50} \end{aligned}$ (54) Model-based balanced truncation and empirical balanced truncation are both applied to the Quanser inverted pendulum system to obtain a reduced order model. The empirical balanced truncation is designed for the stable system with steady state impulse response. However, the empirical balanced truncation has been shown to be a suitable method for systems with unstable modes and non-homogeneous initial conditions due to its data-driven nature (Himpe, Citation2018). In support of this finding, numerical simulations have demonstrated the applicability of empirical balanced truncation to the Quanser inverted pendulum within a short simulation time $t = 1.5 s$ in this study. In this case, due to the instability of the system, the reduced order model obtained from empirical balanced truncation is different from balanced truncation. Details of the reduced order model matrices from model-based and empirical gramians can be found in Table .

Table 2. Summary of reduced order model matrices from model-based gramians and empirical gramians.

Display Table

The Hankel singular values of the rotary pendulum from model-based gramians are shown in Table . The first two rows represent the unstable modes of the system with infinite singular values out of the range of the plot. The last two rows represent the stable modes of the system with finite singular values. Table suggests that the first three eigen-modes dominate the system. Hence, the order of the reduced order model is chosen to be r = 3 to approximate the original model for both balanced truncation and empirical balanced truncation.

Table 3. Hankel singular values of the rotary pendulum.

Display Table

Figure shows the relative output error $e_{m}$ of the two reduced order models compared to the output of the original system in Equations (Equation47(47) $\begin{aligned} \begin{aligned} Φ^{ilm} (t) = (x^{ilm} (t) - {\bar{x}}^{ilm}) (x^{ilm} (t) - {\bar{x}}^{ilm})^{T} \\ ψ_{ij}^{lm} (t) = (y^{ilm} (t) - {\bar{y}}^{ilm})^{T} (y^{jlm} (t) - {\bar{y}}^{jlm}) \\ T_{i}^{T} T_{i} = I, T_{i} \in R^{n \times n}, i = 1, \dots, r \\ c_{i} > 0, c_{i} \in R, i = 1, \dots, s \end{aligned} \end{aligned}$ (47) ) and (Equation48(48) $\begin{aligned} u (t) = - R_{r}^{- 1} B_{r}^{T} D_{g} (x_{r} (t)) w_{r} \end{aligned}$ (48) ) when the same input $u_{t} (t) = e^{- 100 (t - 0.2)^{2}}$ is applied. $e_{m}$ is defined as (55) $\begin{aligned} e_{m} = \sqrt{\frac{\sum_{i = 1}^{n_{s}} | y (i) - y_{r} (i) |^{2}}{\sum_{i = 1}^{n_{s}} | y (i) |^{2}}} \end{aligned}$ (55) where $n_{s}$ is the number of integration steps and i is the discrete time index, corresponding to the physical time $t = i Δt$ . $Δ t$ is the integration time step or sample time in experiments. The relative output error of the empirical balanced truncation is less than the model-based balanced truncation as seen in Figure . This indicates that the empirical gramians contain more accurate information than the model-based balanced gramians.

Figure 3. Top: Relative output errors of the reduced order model by the balanced truncation and empirical balanced truncation. Below: The input signal.

The LQR control of the original system is taken as a benchmark to compare with the performance of the RBFNN controls designed with the reduced order models. The $Q$ and $R$ matrices for the LQR control of the original system and the reduced order models are given by, (56) $\begin{aligned} \begin{aligned} Q & = [\begin{matrix} 5 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}], Q_{bt} = [\begin{matrix} 100 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}], \\ Q_{ebt} & = [\begin{matrix} 100 & 0 & 0 \\ 0 & 100 & 0 \\ 0 & 0 & 30 \end{matrix}], R = R_{bt} = R_{ebt} = 1. \end{aligned} \end{aligned}$ (56) The sampling region for the RBFNN method with the reduced order model is chosen as $X_{s} = [- 1, 1] \times [- 1, 1] \times [- 1, 1]$ . The means of Gaussian neurons are uniformly distributed in the region $X_{g} = [- 4, 4] \times [- 4, 4] \times [- 4, 4]$ . The standard deviations of Gaussian neurons are $σ_{1} = σ_{2} = σ_{3} = 2.6667$ . The number of neurons $N_{G} = 4^{3} = 64$ and the sampling points $N_{s} = 2^{3} N_{G}$ . $w_{r}$ denotes the trainable weights of the RBFNN for the reduced order model.

We first consider tracking control. The reference signal is chosen as, (57) $\begin{aligned} x_{ref} = [θ_{r}, 0, 0, 0]^{T} \end{aligned}$ (57) After that, the reference in the reduced order model space reads, (58) $\begin{aligned} x_{re f_{r}} = T_{r}^{- 1} x_{ref} \end{aligned}$ (58) The optimal control is given Equation (Equation49(49) $\begin{aligned} u (t) = - R_{r}^{- 1} B_{r}^{T} (D_{g} (x_{r} (t)) - D_{g} (x_{re f_{r}} (t))) w_{r} \end{aligned}$ (49) ). A root mean square (RMS) error is defined to quantify the performance of the tracking control, (59) $\begin{aligned} e_{RMS} = \sqrt{\frac{1}{n_{s}} \sum_{i = 1}^{n_{s}} (θ (i) - θ_{r} (i))^{2}} \end{aligned}$ (59) We have chosen the time duration of 50 s and integration step $Δ t = 0.001 s$ . Hence, the number of integration steps $n_{s} = 50,000$ .

Consider a square wave reference. Figure shows the performance of the controls under consideration. From this figure, it is clear that the RBFNN optimal control has the ability to track the desired trajectory with different reduced order models and shows similar results compared to the LQR control of the original system. The RMS tracking errors of the RBFNN controls based on the model-based BT and empirical BT reduced order model and the LQR are 3.13, 3.16 and 3.71, respectively. Figures and show the comparisons of empirical RBFNN with the LQR control designed with the empirical BT and model-based BT.

Figure 4. Tracking the square wave of the rotary army in simulations. In the legend, ‘LQR’ denotes the LQR control designed with the original model; ‘RBFNN’ denotes the RBFNN control designed with the model-based BT; ‘Empirical RBFNN’ denotes the RBFNN control designed with the empirical BT.

Figure 5. Comparisons of tracking response $θ (t)$ of the rotary army in simulations. In the legend, ‘RBFNN_Empirical’ denotes the RBFNN control designed with the empirical BT; ‘LQR_Empirical BT’ denotes the LQR control designed with the empirical BT; ‘LQR_Model-based BT’ denotes the LQR control designed with the model-based BT.

Figure 6. Comparisons of tracking response $α (t)$ of the rotary army in simulations. Legends are the same as in Figure .

6.2. A summary

The rotary pendulum model is inherently unstable. Finding an initial stabilising random control to enable policy iteration convergence in RBFNN control design using the original model is a challenging mask. We have discovered that it is much easier to find an initial stabiliser for the reduced order system to make the policy iteration converge. This is an interesting phenomenon that deserves further investigation.

The RBFNN controls are pre-trained and can be implemented in the hardware. In the following, experimental data of the RBFNN controls were collected from the Quanser-Servo2 inverted pendulum and further improve the performance of these controls with the transfer leaning technique.

6.3. Experimental validation

Two control experiments of pendulum balancing and rotary arm trajectory tracking are carried out on the Quanser-Servo2 Inverted Pendulum system with the RBFNN optimal controllers. All the experiments are done with Quanser HIL and MATLAB Simulink. The sample time is set to $Δ t = 0.002 s$ . The tests run for $60 s$ . The closed-loop responses tracking the square wave as the reference signal for rotary arm are shown in Figure . Both the RBFNN optimal controls deliver better tracking performance than the LQR control.

Figure 7. The closed-loop tracking responses $θ (t)$ of the rotary arm of Quanser-Servo2. Legends are the same as in Figure .

Experimental data have been obtained to improve the control performance through retraining the neural networks. The transient responses of the system have been eliminated and the remaining data have been restricted to the time interval $t \in [3, 12] s$ for the model-based BT design and $t \in [3, 10] s$ for the empirical BT design. The effect of retraining on balancing control is shown in Figure . It is clear that the retraining of the neural networks with experimental data benefits the control designed with the empirical BT most.

Figure 8. The closed-loop responses $θ (t)$ of the rotary arm under various controls for balancing the inverted pendulum of Quanser-Servo2. Top: Responses before retraining. Bottom: Responses after retraining. Legends are the same as in Figure .

Figure shows the effect of retraining on tracking control. Compared to Figure , the performance improvement of tracking the angle $θ (t)$ is obvious. Moreover, suppression of many spiky responses of $α (t)$ angle in Figure is a strong evidence of the benefits of retraining.

Figure 9. The closed-loop tracking response $θ (t)$ of the rotary arm of Quanser-Servo2. Legends are the same as in Figure .

Figure 10. The closed-loop response $α (t)$ of the pendulum in the rotary arm tracking control of Quanser-Servo2. Legends are the same as in Figure .

A measure of control effort is defined in terms of the integrated absolute control voltage $v (t)$ , which is given by, (60) $\begin{aligned} u_{effort} = \sum_{i = 1}^{n_{s}} | v (i) | \end{aligned}$ (60) Table lists the RMS tracking errors and the control effort with a consideration of the effects of retraining. It is worth noting that the RBFNN control designed with the empirical BT has the smallest RMS error.

Table 4. Summary of control performance for LQR, RBFNN, empirical RBFNN, retrained RBFNN and retrained empirical RBFNN.

Display Table

The robustness of the balancing control to disturbance is also investigated through experimental evaluation. A torque disturbance $d (t)$ is injected at $t \in [5, 5.5] s$ by introducing a voltage pulse to the motor. Figure shows that the retrained empirical RBFNN has superb robustness to disturbance compared to the other two controls.

Figure 11. Robustness comparisons of all the controls under consideration. Top: The closed-loop angle response $θ (t)$ of the rotary arm in balancing control of Quanser-Servo2. Bottom: Disturbance $d (t)$ . Legends are the same as in Figure .

7. Nonlinear controls

The RBFNN method is now applied to obtain optimal controls for nonlinear systems and study the robustness of RBFNN control with respect to the model uncertainty.

7.1. A second order system

Neural networks with polynomial activation functions (Poly-NN) have been applied to find the solutions of the HJB equation. This has been a popular example in the literature (L. Chen et al., Citation2023; Du et al., Citation2024; Lin et al., Citation2024; Luo et al., Citation2014; Modares & Lewis, Citation2014; Qin et al., Citation2023; Yang et al., Citation2014). In the following, the performance of the optimal control expressed in terms of the RBFNN with Gaussian neurons is compared with that of the polynomial neural networks (Poly-NN). Consider a second order nonlinear system, (61) $\begin{aligned} \begin{aligned} {\dot{x}}_{1} & = x_{1} + x_{2} - x_{1} (x_{1}^{2} + x_{2}^{2}) \\ {\dot{x}}_{2} & = - x_{1} + x_{2} - x_{2} (x_{1}^{2} + x_{2}^{2}) + u \end{aligned} \end{aligned}$ (61) We take the Poly-NN reported in Abu-Khalaf and Lewis (Citation2005) to compare with the proposed RBFNN control. No control constraints are imposed in the comparison. Both the neural networks are trained in the sampling region $X_{s} \in [- 1, 1] \times [- 1, 1]$ and applied to two regions $X_{s 1} \in [- 1, 1] \times [- 1, 1]$ and $X_{s 2} \in [- 2, 2] \times [- 2, 2]$ to check their ability to provide good control performance beyond the training region. This is a way to study the generalisation of the neural networks.

The Poly-NN has 24 neurons where the RBFNN has $N_{G} = 5 \times 5 = 25$ neurons. The sampling points are $N_{s} = 2^{2} N_{G}$ for both neural networks. The standard deviations of Gaussian neurons are $σ_{1} = σ_{2} = 0.75$ . To investigate robustness of the controls, a disturbance $d (t)$ is considered to each state of the system as shown in Figure . (62) $\begin{aligned} d (t) = {\begin{cases} 1 + g (t), & 10 \leq t \leq 11 \\ 5 + g (t), & 20 \leq t \leq 21 \\ 10 + g (t), & 30 \leq t \leq 31 \\ g (t), & else \end{cases} \end{aligned}$ (62) where $g (t)$ is the Gaussian white noise. Its signal-to-noise (SNR) ratio is 50 dB.

Figure 12. Disturbances to the second order nonlinear system.

The closed-loop responses are shown in Figure . Within the training region $X_{s} \in [- 1, 1] \times [- 1, 1]$ , the RBFNN, Poly-NN and LQR controls have about the same performance when the system starts from the same initial condition $x_{0} = [0.5, 0.5]^{T}$ and is subject to the disturbance $d (t)$ . Note that the LQR control is designed based on the linearised system. When the disturbance becomes stronger, the RBFNN control can still stabilise the system with reasonable effort. However, the Poly-NN control performs poorly.

Figure 13. Performance comparison of RBFNN, Poly-NN control and LQR controls for the nonlinear system in Equation (Equation61(61) $\begin{aligned} \begin{aligned} {\dot{x}}_{1} & = x_{1} + x_{2} - x_{1} (x_{1}^{2} + x_{2}^{2}) \\ {\dot{x}}_{2} & = - x_{1} + x_{2} - x_{2} (x_{1}^{2} + x_{2}^{2}) + u \end{aligned} \end{aligned}$ (61) ). Top: The control $u (t)$ . Middle: The response $x_{1} (t)$ . Bottom: The response $x_{2} (t)$ .

Figure 13. Performance comparison of RBFNN, Poly-NN control and LQR controls for the nonlinear system in Equation (Equation61(61) x˙1=x1+x2−x1(x12+x22)x˙2=−x1+x2−x2(x12+x22)+u(61) ). Top: The control u(t). Middle: The response x1(t). Bottom: The response x2(t).

This is due to the well-known fact that polynomials generate large extrapolation errors in regression applications. We compare both the controls in the training region $X_{s 1}$ and in a larger region $X_{s 2}$ in Figure . As seen from the figure, the Poly-NN control grows unbounded quickly outside the training region and can no longer stabilise the system, while the RBFNN control remains bounded and can still stabilise the system in this larger region. As a matter of fact, the RBFNN control will remain bounded and approach zero at locations far away from the training region, because all the Gaussian neurons with means in the training region decrease exponentially.

Figure 14. Comparison of spatial distribution of RBFNN and Poly-NN optimal controls u as a function of the state $x$ . Left: The control $u (x)$ plotted in the training region $X_{s 1} \in [- 1, 1] \times [- 1, 1]$ . Right: The control $u (x)$ plotted beyond the training region into the larger region $X_{s 2} \in [- 2, 2] \times [- 2, 2]$ .

7.2. Duffing system

Consider the Duffing system as another example to test the robustness of RBFNN controller with changing parameters, (63) $\begin{aligned} [\begin{matrix} {\dot{x}}_{1} \\ {\dot{x}}_{2} \end{matrix}] = [\begin{matrix} 0 & 1 \\ 0 & - 0.1 \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \end{matrix}] + [\begin{matrix} 0 \\ 1 \end{matrix}] u + [\begin{matrix} 0 \\ β \end{matrix}] x_{1}^{3} \end{aligned}$ (63) The sampling region is $X_{s} \in [- 2, 2] \times [- 8, 8]$ . Corresponding to the sampling region, the means of Gaussian neurons are uniformly distributed in the region $X_{g} \in [- 3, 3] \times [- 9, 9]$ .

We take $β = 2$ as the nominal value and allow β to vary in a known range $1 \leq β \leq 6$ . Both the RBFNN and LQR controls are designed with the nominal value of β. Note that the LQR control is designed for the linearised system at $β = 2$ . The performance index J of the LQR design is defined as, (64) $\begin{aligned} J (x, u) = \frac{1}{2} \sum_{k = 0}^{n_{s}} [x^{T} (k) Qx (k) + R u^{2} (k)] . \end{aligned}$ (64) where k is the discrete time index, and $n_{s}$ is number of integration steps of the simulation. The matrices are given by (65) $\begin{aligned} Q = [\begin{matrix} 10 & 0 \\ 0 & 10 \end{matrix}], R = 1 \end{aligned}$ (65) We have chosen $n_{s} = 1000$ . The integration time step is $Δ t = 0.01 s$ . We simulate the closed-loop system starting from the same initial condition $x_{0} = [0.7, 0.7]^{T}$ .

The index J is used to evaluate the performance of both the RBFNN and LQR controls. Figure shows the performance index J as a function of β for both the RBFNN and LQR controls. When the actual value of β is less than the nominal value 2, the linearised model of the system remains valid. As a result, the LQR control is slightly better than the RBFNN control. However, when $β > 2$ , the linearised model at $β = 2$ becomes less accurate, resulting in poor performance of the LQR control compared with the RBFNN control. When β reaches a critical value $β_{cr}$ , marked by the vertical lines in the figure, the closed-loop system becomes unstable and the performance index J grows unbounded. For the LQR control, $β_{cr} = 4.1579$ , while for the RBFNN control, $β_{cr} = 4.5263$ . This is a quantitative measurement of robustness to model uncertainty.

Figure 15. Robustness of the RBFNN and LQR controls with respect to the model uncertainty β. The vertical dash lines mark the critical value of β, beyond which the closed-loop system becomes unstable.

Finally, we should point out that the RBFNN with one hidden layer can find optimal controls with good performance and robustness for nonlinear dynamic systems. The design of RBFNN itself is not optimised in this study. However, there are studies on how to design RBFNN in the literature, such as ErrCor algorithm (Yu et al., Citation2011), hierarchical growing strategy (GGAP algorithm) (Sundararajan et al., Citation1999), MRAN algorithm (Kadirkamanathan & Niranjan, Citation1993), resource allocation algorithm (Platt, Citation1991), and adaptive RBF algorithms (Alwardi et al., Citation2013, Citation2012; González-Casanova et al., Citation2009). These algorithms can be considered for optimal design of RBFNN architecture in the future study.

Remark 7.1

This paper proposes a method to compute the solutions of nonlinear optimal control expressed in terms of the RBFNN with Gaussian activation functions for multi-degree-of-freedom dynamic systems. Combined with model reduction and transfer learning algorithm, the computational efficiency and the control performance have been significantly improved. Extensive numerical simulations and experimental studies show that the RBFNN offer excellent control performance and much-improved robustness to disturbances and model uncertainties. It is worth mentioning that the RBFNN control delivers the same performance as the LQR control for the LTI system. Our previous work (Zhao & Sun, Citation2022) indicates that the BT reduction for the LTI system leads to significant performance improvements compared to the LQR control. For nonlinear systems, such as the Duffing system shown in Section 7.2, the nonlinear optimal control solution from RBFNN shows improved robustness compared to the LQR control for the linearised system.

Remark 7.2

Our early work (Gholami et al., Citation2023) indicates that other neural networks, such as multi-layer perceptron, are not suitable for solving the HJB equation. In the paper (Gholami et al., Citation2023), we adopted a two-layer perceptron neural networks with hyperbolic tangent functions to approximate the derivative of the value function in the HJB equation. However, it is very time-consuming to solve the optimal control. Additionally, the structure of multi-layer perceptron is not beneficial to obtain the explicit expression of the optimal coefficient in Equation (Equation35(35) $\begin{aligned} w (k + 1) = - A_{g}^{- 1} (k) b (k) \end{aligned}$ (35) ), which is one of the reasons why the proposed RBFNN method is efficient and accurate.

8. Conclusion

This paper has presented a RBFNN optimal control approach with Gaussian neurons. Instead of analytically solving the HJB equation, the RBFNN computes the optimal control solution with high efficiency. The RBFNN optimal controls are obtained off-line in the state space for both linear and nonlinear dynamic systems. Extensive simulations and experimental studies have been presented and suggest that the RBFNN optimal control can deliver excellent performance and robustness. We have also implemented model reduction techniques to reduce computational burden of the RBFNN design when solving HJB equations in high dimensional state space, and explored transfer learning concept to update the coefficients of the neural networks with new experimental data leading to a better control performance in experiments. The extensive numerical simulations and experimental validations prove that the neural networks with one-hidden layer have significant potential for different electro-mechanical systems, particularly high-dimensional nonlinear dynamical systems. The work demonstrates that with a limited number of neurons, the RBFNN can be successfully implemented on the micro-controller hardware. Moreover, a comparative analysis between RBFNN and other similar machine learning algorithms, such as LANN-SVD algorithm can be an interesting research topic for the future.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This material is based upon work supported by the AI Research Institutes program supported by NSF and USDA-NIFA under the AI Institute: Agricultural AI for Transforming Workforce and Decision Support (AgAID) [award number 2021-67021-35344]. Partial support to the work was from a National Natural Science Foundation of China [grant number 11972070].

References

Abu-Khalaf, M., & Lewis, F. L. (2005). Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica, 41(5), 779–791. https://doi.org/10.1016/j.automatica.2004.11.034
Web of Science ®Google Scholar
Alwardi, H., Wang, S., & Jennings, L. S. (2013). An adaptive domain decomposition method for the Hamilton–Jacobi–Bellman equation. Journal of Global Optimization, 56(4), 1361–1373. https://doi.org/10.1007/s10898-012-9850-2
Web of Science ®Google Scholar
Alwardi, H., Wang, S., Jennings, L. S., & Richardson, S. (2012). An adaptive least-squares collocation radial basis function method for the HJB equation. Journal of Global Optimization, 52(2), 305–322. https://doi.org/10.1007/s10898-011-9667-4
Web of Science ®Google Scholar
Borovykh, A., Kalise, D., Laignelet, A., & Parpas, P. (2022). Data-driven initialization of deep learning solvers for Hamilton–Jacobi–Bellman PDEs. IFAC-PapersOnLine, 55(30), 168–173. https://doi.org/10.1016/j.ifacol.2022.11.047
Google Scholar
Chen, J.-S., Hillman, M., & Chi, S.-W. (2017). Meshfree methods: Progress made after 20 years. Journal of Engineering Mechanics, 143(4), 04017001. https://doi.org/10.1061/(ASCE)EM.1943-7889.0001176
Web of Science ®Google Scholar
Chen, L., Dong, C., & Dai, S.-L. (2023). Adaptive optimal consensus control of multiagent systems with unknown dynamics and disturbances via reinforcement learning. IEEE Transactions on Artificial Intelligence. https://doi.org/10.1109/TAI.2023.3324895
PubMedGoogle Scholar
Cheng, T., Lewis, F. L., & Abu-Khalaf, M. (2007a). A neural network solution for fixed-final time optimal control of nonlinear systems. Automatica, 43(3), 482–490. https://doi.org/10.1016/j.automatica.2006.09.021
Web of Science ®Google Scholar
Cheng, T., Lewis, F. L., & Abu-Khalaf, M. (2007b). Fixed-final-time-constrained optimal control of nonlinear systems using neural network HJB approach. IEEE Transactions on Neural Networks, 18(6), 1725–1737. https://doi.org/10.1109/TNN.2007.905848
Web of Science ®Google Scholar
Crandall, M. G., & Lions, P.-L. (1983). Viscosity solutions of Hamilton–Jacobi equations. Transactions of the American Mathematical Society, 277(1), 1–42. https://doi.org/10.1090/tran/1983-277-01
Web of Science ®Google Scholar
Darbon, J., Dower, P. M., & Meng, T. (2021). Neural network architectures using min plus algebra for solving certain high dimensional optimal control problems and Hamilton–Jacobi PDEs. Preprint. arXiv:2105.03336.
Google Scholar
Darbon, J., & Osher, S. (2016). Algorithms for overcoming the curse of dimensionality for certain Hamilton–Jacobi equations arising in control theory and elsewhere. Research in the Mathematical Sciences, 3(1), 1–26. https://doi.org/10.1186/s40687-016-0068-7
Google Scholar
Djeridane, B., & Lygeros, J. (2006). Neural approximation of PDE solutions: An application to reachability computations. In Proceedings of the 45th IEEE Conference on Decision and Control (pp. 3034–3039). San Diego, CA, USA: IEEE.
Google Scholar
Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245. https://doi.org/10.1162/089976600300015961
PubMed Web of Science ®Google Scholar
Du, Y., Jiang, B., & Ma, Y. (2024). Adaptive optimal sliding-mode fault-tolerant control for nonlinear systems with disturbances and estimation errors. Complex & Intelligent Systems, 10, 1087–1101.
Web of Science ®Google Scholar
Gawronski, W., & Juang, J.-N. (1990). Model reduction in limited time and frequency intervals. International Journal of Systems Science, 21(2), 349–376. https://doi.org/10.1080/00207729008910366
Web of Science ®Google Scholar
Gholami, A., Sun, J.-Q., & Ehsani, R. (2023). Neural networks based optimal tracking control of a delta robot with unknown dynamics. International Journal of Control, Automation and Systems, 21(10), 3382–3390. https://doi.org/10.1007/s12555-022-0745-9
Web of Science ®Google Scholar
González-Casanova, P., Muñoz-Gómez, J. A., & Rodríguez-Gómez, G. (2009). Node adaptive domain decomposition method by radial basis functions. Numerical Methods for Partial Differential Equations: An International Journal, 25(6), 1482–1501. https://doi.org/10.1002/num.v25:6
Web of Science ®Google Scholar
Greif, C. (2017). Numerical methods for Hamilton–Jacobi–Bellman equations [PhD thesis]. University of Wisconsin, Milwaukee].
Google Scholar
Grundel, S., Himpe, C., & Saak, J. (2019). On empirical system gramians. PAMM, 19(1), e201900006. https://doi.org/10.1002/pamm.v19.1
Google Scholar
Gugercin, S., & Antoulas, A. C. (2004). A survey of model reduction by balanced truncation and some new results. International Journal of Control, 77(8), 748–766. https://doi.org/10.1080/00207170410001713448
Web of Science ®Google Scholar
Guo, Y., Shi, H., Kumar, A., Grauman, K., Rosing, T., & Feris, R. (2019). Spottune: Transfer learning through adaptive fine-tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4805–4814). Long Beach, CA, USA.
Google Scholar
Han, J., Jentzen, A., & E, W. (2018). Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences, 115(34), 8505–8510. https://doi.org/10.1073/pnas.1718942115
PubMed Web of Science ®Google Scholar
Heinkenschloss, M., Reis, T., & Antoulas, A. C. (2011). Balanced truncation model reduction for systems with inhomogeneous initial conditions. Automatica, 47(3), 559–564. https://doi.org/10.1016/j.automatica.2010.12.002
Web of Science ®Google Scholar
Himpe, C. (2018). Emgr – the empirical gramian framework. Algorithms, 11(7), 91. https://doi.org/10.3390/a11070091
Google Scholar
Huang, Y., & Kramer, B. (2020). Balanced reduced-order models for iterative nonlinear control of large-scale systems. IEEE Control Systems Letters, 5(5), 1699–1704. https://doi.org/10.1109/LCSYS.7782633
Google Scholar
Jiang, F., Chou, G., Chen, M., & Tomlin, C. J. (2016). Using neural networks to compute approximate and guaranteed feasible Hamilton–Jacobi–Bellman PDE solutions. Preprint. arXiv:1611.03158.
Google Scholar
Kadirkamanathan, V., & Niranjan, M. (1993). A function estimation approach to sequential learning with neural networks. Neural Computation, 5(6), 954–975. https://doi.org/10.1162/neco.1993.5.6.954
Web of Science ®Google Scholar
Lagaris, I. E., Likas, A. C., & Papageorgiou, D. G. (2000). Neural-network methods for boundary value problems with irregular boundaries. IEEE Transactions on Neural Networks, 11(5), 1041–1049. https://doi.org/10.1109/72.870037
PubMed Web of Science ®Google Scholar
Laub, A., Heath, M., Paige, C., & Ward, R. (1987). Computation of system balancing transformations and other applications of simultaneous diagonalization algorithms. IEEE Transactions on Automatic Control, 32(2), 115–122. https://doi.org/10.1109/TAC.1987.1104549
Web of Science ®Google Scholar
Lewis, F., Jagannathan, S., & Yesildirak, A. (1998). Neural network control of robot manipulators and non-linear systems. CRC Press.
Google Scholar
Lin, J., Zhao, B., Liu, D., & Wang, Y. (2024). Dynamic compensator-based near-optimal control for unknown nonaffine systems via integral reinforcement learning. Neurocomputing, 564, 126973. https://doi.org/10.1016/j.neucom.2023.126973
Web of Science ®Google Scholar
Liu, D., Li, H., & Wang, D. (2014). Online synchronous approximate optimal learning algorithm for multi-player non-zero-sum games with unknown dynamics. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 44(8), 1015–1027. https://doi.org/10.1109/TSMC.2013.2295351
Web of Science ®Google Scholar
Luo, B., Wu, H.-N., Huang, T., & Liu, D. (2014). Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design. Automatica, 50(12), 3281–3290. https://doi.org/10.1016/j.automatica.2014.10.056
Web of Science ®Google Scholar
Medagam, P. V., & Pourboghrat, F. (2009). Optimal control of nonlinear systems using RBF neural network and adaptive extended Kalman filter. In American Control Conference (pp. 355–360). St. Louis, MO, USA: IEEE.
Google Scholar
Meyn, S. (2022). Control systems and reinforcement learning. Cambridge University Press.
Google Scholar
Modares, H., & Lewis, F. L. (2014). Optimal tracking control of nonlinear partially-unknown constrained-input systems using integral reinforcement learning. Automatica, 50(7), 1780–1792. https://doi.org/10.1016/j.automatica.2014.05.011
Web of Science ®Google Scholar
Modares, H., Lewis, F. L., & Naghibi-Sistani, M.-B. (2014). Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica, 50(1), 193–202. https://doi.org/10.1016/j.automatica.2013.09.043
Web of Science ®Google Scholar
Munos, R., Baird, L. C., & Moore, A. W. (1999). Gradient descent approaches to neural-net-based solutions of the Hamilton–Jacobi–Bellman equation. In Proceedings of International Joint Conference on Neural Networks (Vol. 3, pp. 2152–2157). Washington, DC, USA: IEEE.
Google Scholar
Nakamura-Zimmerer, T., Gong, Q., & Kang, W. (2020). A causality-free neural network method for high-dimensional Hamilton–Jacobi–Bellman equations. In 2020 American Control Conference (pp. 787–793). Denver, CO, USA: IEEE.
Google Scholar
Nakamura-Zimmerer, T., Gong, Q., & Kang, W. (2021). Adaptive deep learning for high-dimensional Hamilton–Jacobi–Bellman equations. SIAM Journal on Scientific Computing, 43(2), A1221–A1247. https://doi.org/10.1137/19M1288802
Web of Science ®Google Scholar
Nakamura-Zimmerer, T. E. (2022). A deep learning framework for optimal feedback control of high-dimensional nonlinear systems [Thesis]. UC Santa Cruz.
Google Scholar
Park, J., & Sandberg, I. W. (1991). Universal approximation using radial-basis-function networks. Neural Computation, 3(2), 246–257. https://doi.org/10.1162/neco.1991.3.2.246
PubMed Web of Science ®Google Scholar
Pernebo, L., & Silverman, L. (1982). Model reduction via balanced state space representations. IEEE Transactions on Automatic Control, 27(2), 382–387. https://doi.org/10.1109/TAC.1982.1102945
Web of Science ®Google Scholar
Phillips, J., Daniel, L., & Silveira, L. M. (2022). Guaranteed passive balancing transformations for model order reduction. In Proceedings of the 39th Annual Design Automation Conference (pp. 52–57). New Orleans, LA, USA.
Google Scholar
Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation, 3(2), 213–225. https://doi.org/10.1162/neco.1991.3.2.213
PubMed Web of Science ®Google Scholar
Qin, C., Qiao, X., Wang, J., Zhang, D., Hou, Y., & Hu, S. (2023). Barrier-critic adaptive robust control of nonzero-sum differential games for uncertain nonlinear systems with state constraints. IEEE Transactions on Systems, Man, and Cybernetics: Systems, 54, 50–63.
Google Scholar
Quanser (2022). QUBE-Servo2. https://www.quanser.com/products/qube-servo-2/
Google Scholar
Saridis, G. N., & Lee, C.-S. G. (1979). An approximation theory of optimal control for trainable manipulators. IEEE Transactions on Systems, Man, and Cybernetics, 9(3), 152–159. https://doi.org/10.1109/TSMC.1979.4310171
Web of Science ®Google Scholar
Silverman, B. (2018). Density estimation for statistics and data analysis. CRC Press.
Google Scholar
Singh, A. K., & Hahn, J. (2005). On the use of empirical gramians for controllability and observability analysis. In Proceedings of the American Control Conference (pp. 140–141). Portland, OR, USA: IEEE.
Google Scholar
Sirignano, J., & Spiliopoulos, K. (2018). DGM: A deep learning algorithm for solving partial differential equations. Journal of Computational Physics, 375, 1339–1364. https://doi.org/10.1016/j.jcp.2018.08.029
Web of Science ®Google Scholar
Stefansson, E., & Leong, Y. P. (2016). Sequential alternating least squares for solving high dimensional linear Hamilton–Jacobi–Bellman equation. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 3757–3764). Daejeon, Korea (South).
Google Scholar
Sun, J. Q., & Hsu, C. S. (1988). A statistical study of generalized cell mapping. Journal of Applied Mechanics, 55, 694–701. https://doi.org/10.1115/1.3125851
Web of Science ®Google Scholar
Sundararajan, N., Saratchandran, P., & Lu, Y. W. (1999). Radial basis function neural networks with sequential learning: MRAN and its applications (Vol. 11). World Scientific.
Google Scholar
Tassa, Y., & Erez, T. (2007). Least squares solutions of the HJB equation with neural network value-function approximators. IEEE Transactions on Neural Networks, 18(4), 1031–1041. https://doi.org/10.1109/TNN.2007.899249
PubMedGoogle Scholar
Wang, X., Jiang, J., Hong, L., & Sun, J.-Q. (2022a). First-passage problem in random vibrations with radial basis function neural networks. Journal of Vibration and Acoustics, 144(5), 051014. https://doi.org/10.1115/1.4054437
Web of Science ®Google Scholar
Wang, X., Jiang, J., Hong, L., & Sun, J.-Q. (2022b). Stochastic bifurcations and transient dynamics of probability responses with radial basis function neural networks. International Journal of Non-Linear Mechanics, 147, 104244. https://doi.org/10.1016/j.ijnonlinmec.2022.104244
Web of Science ®Google Scholar
Wang, X., Jiang, J., Hong, L., Zhao, A., & Sun, J.-Q. (2023). Radial basis function neural networks solution for stationary probability density function of nonlinear stochastic systems. Probabilistic Engineering Mechanics, 71, 103408. https://doi.org/10.1016/j.probengmech.2022.103408
Web of Science ®Google Scholar
Wu, Y., Hamroun, B., Le Gorrec, Y., & Maschke, B. (2020). Reduced order LQG control design for infinite dimensional port Hamiltonian systems. IEEE Transactions on Automatic Control, 66(2), 865–871. https://doi.org/10.1109/TAC.9
Web of Science ®Google Scholar
Wu, Y., Wang, H., Zhang, B., & Du, K.-L. (2012). Using radial basis function networks for function approximation and classification. ISRN Applied Mathematics, 2012, 1–34. https://doi.org/10.5402/2012/324194
Google Scholar
Yang, X., Liu, D., & Wang, D. (2014). Reinforcement learning for adaptive optimal control of unknown continuous-time nonlinear systems with input constraints. International Journal of Control, 87(3), 553–566. https://doi.org/10.1080/00207179.2013.848292
Web of Science ®Google Scholar
Yegorov, I., & Dower, P. M. (2021). Perspectives on characteristics based curse-of-dimensionality-free numerical approaches for solving Hamilton–Jacobi equations. Applied Mathematics & Optimization, 83(1), 1–49. https://doi.org/10.1007/s00245-018-9509-6
Web of Science ®Google Scholar
Yu, H., Xie, T., Paszczyñski, S., & Wilamowski, B. M. (2011). Advantages of radial basis function networks for dynamic system design. IEEE Transactions on Industrial Electronics, 58(12), 5438–5450. https://doi.org/10.1109/TIE.2011.2164773
Web of Science ®Google Scholar
Zhang, D., Liu, W., Qin, C., & Chen, H. (2016). Adaptive RBF neural-networks control for discrete nonlinear systems based on data. In Proceedings of the 12th World Congress on Intelligent Control and Automation (pp. 2580–2585). Guilin, China: IEEE.
Google Scholar
Zhang, W., Shen, J., Ye, X., & Zhou, S. (2022). Error model-oriented vibration suppression control of free-floating space robot with flexible joints based on adaptive neural network. Engineering Applications of Artificial Intelligence, 114, 105028. https://doi.org/10.1016/j.engappai.2022.105028
Web of Science ®Google Scholar
Zhao, A., & Sun, J.-Q. (2022). Control for stability improvement of high-speed train bogie with a balanced truncation reduced order model. Vehicle System Dynamics, 60(12), 4343–4363. https://doi.org/10.1080/00423114.2021.2025408
Web of Science ®Google Scholar

Appendix. A linear 2D system

In this section, we adopt a linear 2D system as an example to validate the performance of the RBFNNs control presented in Sections 2 and 3. It is noted that the model reduction algorithm is not applied since the system has a low dimension and requires reasonably low computational cost. The linear 2D system is defined as,

(A1)

\begin{aligned} [\begin{matrix} {\dot{x}}_{1} \\ {\dot{x}}_{2} \end{matrix}] = [\begin{matrix} 0 & 1 \\ - \frac{k}{m} & - \frac{c}{m} \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \end{matrix}] + [\begin{matrix} 0 \\ 1 \end{matrix}] u \end{aligned}

(A1) where m = 1, c = 2 and k = 1. The region to distribute Gaussian neurons is chosen as

X_{1 g} \times X_{2 g} = [- 2, 2] \times [- 6, 6]

. The region to sample points for integration is chosen as

X_{1 s} \times X_{2 s} = [- 1, 1] \times [- 3, 3]

. The number of neurons

N_{G} \times N_{G}

are

9 \times 9

and the sampling points are

2 N_{G} \times 2 N_{G}

Q

and

R

are chosen as,

(A2)

\begin{aligned} Q = [\begin{matrix} 10 & 0 \\ 0 & 10 \end{matrix}], R = 1 \end{aligned}

(A2) The results are shown in Figure . From Figure , we can see that RBFNNs can find the exactly same optimal solution as LQR for the linear 2D system.

Figure A1. Comparison of RBFNNs and LQR control performances for the linear 2D system. Top: Control $u (t)$ . Middle: Response $x_{1} (t)$ . Bottom: Response $x_{2} (t)$ . The initial condition of the system is $x (0) = [1, 1]^{T}$ .

Robustness improvement of optimal control in terms of RBFNN with empirical model reduction and transfer learning

Abstract

1. Introduction

2. The Hamilton–Jacobi–Bellman equation