Full article: Personalized control system via reinforcement learning: maximizing utility based on user ratings

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In this paper, we address the design of personalized control systems, which pursue individual objectives defined for each user. To this end, a problem of reinforcement learning is formulated where an individual objective function is estimated based on the user rating on his/her current control system and its corresponding optimal controller is updated. The novelty of the problem setting is in the modelling of the user rating. The rating is modelled by a quantization of the user utility gained from his/her control system, defined by the value of the objective function at his/her control experience. We propose an algorithm of the estimation to update the control law. Through a numerical experiment, we find out that the proposed algorithm realizes the personalized control system.

Keywords:

1. Introduction

Robust design has been a basic design concept of control systems: common control systems are designed for various system-users such that the systems operate stably without any concern for the difference between user-environment. Beyond the robustness, “personalization” can be an advanced design concept of control systems: an individual control system is designed for each system-user such that the system operates pursuing high control performance and improving user utility [Citation1,Citation2]. In this paper, we address the personalization of control systems, in particular, a design methodology of the control systems that possess a function of adaptation of improving user utility.

As studied in other research fields, a key to personalization is the modelling of system-users [Citation3,Citation4]. For example, in [Citation5,Citation6], data on manual driving is used to model driver intent. In the previous works, the user modelling is addressed based on the measured data on the user actions to the control system. On the other hand, in this paper, the data on the action is not available, and instead, the result of user rating on the control system is available. The user rating is collected for every control operation and is used to estimate the private objective function of the user, and the optimal controller is updated based on the estimated objective function. By repeating the controller update, we aim at maximizing the user utility gained in the control system.

As an application of the presented personalized control system, let us imagine an automatic driving system. The system drives automatically, and the user rates the driving control system based on his/her comfort, for example, once a month. The control system accesses the result of the user rating to estimate his/her preference, modelled by a parameterized objective function. Then, the implemented control law is updated based on the estimated objective function.

In general, a problem of reinforcing system performance by learning techniques is called reinforcement learning (RL), which has been addressed in the literature. See e.g. [Citation7] and its references therein. RL has been applied to the design of feedback control systems [Citation8–11]. In [Citation8,Citation9], continuous control problems are addressed unlike standard RL problems, and the control law is updated directly. Furthermore, in [Citation10,Citation11], RL is combined with model predictive control, and the objective function is tuned to update the optimal control law indirectly. The main difference in this paper and the literature is the assumption on the objective function and/or reward. In the literature, the objective function is designable, while in this paper, the function is not designable and is pre-defined but hidden by a system-user.

The rest of this paper is organized as follows. In Section 2, the models of control systems and system-user are given, and the problem of the control system update is formulated. In Section 3, we propose an algorithm of estimating the user objective function, which plays a central role in personalization. In Section 4, we give a convergence analysis of the proposed algorithm. In Section 5, we present a numerical experiment using the proposed algorithm. In Section 6, the conclusion is given.

2. Problem formulation

2.1. Model of control systems

We consider a control system that is composed of a plant system and a controller, which are denoted by P and K, respectively. The plant system is modelled by the following discrete-time state space equation $P : x (k + 1) = h (x (k), u (k)),$ where $k \in N_{+}$ is the discrete-time, and $x \in R^{n}$ and $u \in R^{m}$ are the state and control input, respectively. Symbol $h : R^{n} \times R^{m} \mapsto R^{n}$ represents a function.

The controller is designed based on the estimate of an individual objective function, which is private and is defined for each user. Let J and $\hat{J}$ represent the objective function and its estimate, respectively. Then, the control law is described by the following optimization problem $K : {\begin{cases} min_{u} & \hat{J} (x, u) \\ s . t . & x (k + 1) = h (x (k), u (k)),^{\forall} k \\ (x (k), u (k)) \in X \times U,^{\forall} k, \end{cases}$ where $x$ and $u$ are stacked vectors composed of the sequences of the state and input, respectively, i.e. $\begin{aligned} x & = [x (k + 1)^{⊤}, x (k + 2)^{⊤}, \dots, x (k + N)^{⊤}]^{⊤}, \\ u & = [u (k)^{⊤}, u (k + 1)^{⊤}, \dots, u (k + N - 1)^{⊤}]^{⊤}, \end{aligned}$ and $X$ and $U$ are the state and input constraints, respectively. In the following discussion, ${x^{(d)}, u^{(d)}}$ represents the measured data on $x$ and $u$ , obtained in a control experiment and called experience for the control system.

In addition to P and K, we should note that a system-user participates in the control system. His/her objective function is modelled by (1) $J (x, u) = f_{0} (x, u) + \sum_{i = 1}^{ℓ} q_{i} f_{i} (x, u),$ (1) where $f_{i} (x, u) : R^{n N} \times R^{m N} \mapsto R_{+}, i \in {0, 1, \dots, ℓ}$ are non-negative functions defined by a system-designer and $q_{i} > 0, i \in {1, 2, \dots, ℓ}$ are weighting parameters to be estimated. As (Equation1(1) $J (x, u) = f_{0} (x, u) + \sum_{i = 1}^{ℓ} q_{i} f_{i} (x, u),$ (1) ), user's objective is parameterized by $q_{i}, i \in {1, 2, \dots, ℓ}$ . The system-user has his/her objective function (Equation1(1) $J (x, u) = f_{0} (x, u) + \sum_{i = 1}^{ℓ} q_{i} f_{i} (x, u),$ (1) ) in his/her mind, and the user's weighting parameter $q_{i}, i \in {1, 2, \dots, ℓ}$ are not accessible directly for control systems update. Instead, the user rating including the information on $q_{i}, i \in {1, 2, \dots, ℓ}$ is available for the update.

In this paper, we aim at maximizing the users utility achieved in the control system $(P, K)$ by updating K. To this end, the individual and private objective function J is estimated, and the control law K is updated based on the estimate $\hat{J}$ . In other words, such “personalization” of the control system is achieved by accurate estimate of $q_{i}, i \in {1, 2, \dots, ℓ}$ . The estimation of J, i.e. that of $q_{i}, i \in {1, 2, \dots, ℓ}$ , is based on user rating without accessing bare data ${x^{(d)}, u^{(d)}}$ unlike the conventional works [Citation10,Citation11].

2.2. Model of system-user

In the problem setting, we explicitly take into account the presence of a system-user. The user rates the control system $(P, K)$ based on the utility determined by his/her experience, denoted by $J^{*} (x^{(d)}, u^{(d)})$ . An example of the user rating is realized by a questionnaire. The user gives m-grade evaluation based on his/her satisfaction with the control system as illustrated in Figure .

Figure 1. An example of user rating: five-grade evaluation is given to the control system, such as Excellent/Very Good/Good/Average/Poor.

Recall that in control system $(P, K)$ , controller K pursues the performance in the sense of the estimated objective function $\hat{J}$ , parameterized by ${\hat{q}}_{i}, i \in {1, 2, \dots, ℓ}$ , to give experience ${x^{(d)}, u^{(d)}}$ to the user. There exists a gap in the estimated utility $\hat{J} (x^{(d)}, u^{(d)})$ and true utility $J^{*} (x^{(d)}, u^{(d)})$ . We assume that the user rating depends on the gap: the user rates the control system high (low) if the gap is small (large).

To model the user rating, we define the gap in the utility gained from experience ${x^{(d)}, u^{(d)}}$ as (2) $E_{J} (x^{(d)}, u^{(d)}) = \frac{\hat{J} (x^{(d)}, u^{(d)}) - J^{*} (x^{(d)}, u^{(d)})}{J^{*} (x^{(d)}, u^{(d)})} .$ (2) Based on (Equation2(2) $E_{J} (x^{(d)}, u^{(d)}) = \frac{\hat{J} (x^{(d)}, u^{(d)}) - J^{*} (x^{(d)}, u^{(d)})}{J^{*} (x^{(d)}, u^{(d)})} .$ (2) ), the user rating r is modelled by a piecewise constant function as (3) $r = {\begin{cases} M_{1} & i f | E_{J} | \in [0, a_{1}] \\ M_{2} & i f | E_{J} | \in (a_{1}, a_{2}] \\ ⋮ \\ M_{m - 1} & i f | E_{J} | \in (a_{m - 2}, a_{m - 1}] \\ M_{m} & i f | E_{J} | \in (a_{m - 1}, \infty), \end{cases}$ (3)

where $M_{i}, i \in {1, 2, \dots, m}$ are positive constants satisfying $M_{1} > M_{2} > \dots > M_{m}$ , and $a_{i}, i \in {1, 2, \dots, m - 1}$ are also positive constants that indicate the range of existence of $| E_{J} |$ . One can assume the rating as $(M_{1}, M_{2}, \dots, M_{5}, M_{6}) = (100, 80, \dots, 20, 0)$ for six-grade evaluation. In this setting, since the value of r is quantized, controller K cannot access the exact value of $E_{J} (x^{(d)}, u^{(d)})$ .

Remark 2.1

$E_{J} > - 1$ holds since $\hat{J} > 0$ . This fact is used for the analysis given in Section 4.

We impose a technical assumption on the model of the system-user, denoted by H. In addition to user rating r, system-user H gives the sign of $E_{J}$ to the control system based on his/her experience ${x^{(d)}, u^{(d)}}$ . Then, the model of H is described by (4) $\begin{aligned} H : R (x^{(d)}, u^{(d)}) \\ = {r (x^{(d)}, u^{(d)}), sgn (\hat{J} (x^{(d)}, u^{(d)}) - J^{*} (x^{(d)}, u^{(d)}))}, \end{aligned}$ (4) where $sgn (\cdot)$ is the sign function. We see that this R is a “reward” in the reinforcement learning framework. Controller K can access reward R, which depends on user-experience ${x^{(d)}, u^{(d)}}$ , to estimate $J^{*}$ and to update its control law.

Remark 2.2

A similar problem of estimating objective functions and/or rewards, which generate the control actions, is known as inverse reinforcement learning (IRL). See e.g. [Citation12,Citation13] for the problem setting and e.g. [Citation14–16] for its applications. In most of the IRL frameworks, the control law is pre-defined and fixed, and its generating data is available for the estimation. On the other hand, in this paper, the control law is not fixed and to be updated, and the rating of a system-user, who is not included in the control loop, is available.

The block diagram of the control system with the user rating is illustrated in Figure . In the figure, the blue line connecting the controller and the plant indicates the loop of the control operation, while the red line connecting the user, controller, and plant indicates the loop of the controller update.

Figure 2. Personalized control system updated based on user rating.

2.3. Problem of controller update

Control system $(P, K)$ is updated based on user rating $R (x^{(d)}, u^{(d)})$ . The flow of the update is given as follows.

Flow of controller update

The control system at version τ, denoted by $(P, K_{τ})$ , is operated, and the user gains experience ${x_{τ}^{(d)}, u_{τ}^{(d)}}$ .
The user gives his/her rating $R (x_{τ}^{(d)}, u_{τ}^{(d)})$ , i.e. a reward for $(P, K_{τ})$ is given to the controller.
Parameters $q_{i}, i \in {1, 2, \dots, ℓ}$ are estimated.
Control law K is updated based on the estimated objective function ${\hat{J}}_{τ}$ .
The version of the control system is updated as $τ \leftarrow τ + 1$ , and go to Stage 1.

Note that control law K is uniquely determined once the parameter estimation is performed. This implies the essence of the controller update is the parameter estimation, addressed at Stage 3. The estimation problem is given as follows:

Problem 2.1

Given $R (x^{(d)}, u^{(d)})$ , estimate $q_{i}^{*}, i \in {1, 2, \dots, ℓ}$ .

3. Parameter estimation algorithm

In this section, we propose an algorithm of estimating parameters that characterize the user objective function given in (Equation1(1) $J (x, u) = f_{0} (x, u) + \sum_{i = 1}^{ℓ} q_{i} f_{i} (x, u),$ (1) ). To simplify the discussion, the objective function is characterized by only two parameters $q_{1}$ and $q_{2}$ , i.e. $J (x, u) = f_{0} (x, u) + q_{1} f_{1} (x, u) + q_{2} f_{2} (x, u) .$ The following discussion and the derived algorithm are extended to more general ℓ-parameters cases in a straightforward manner.

We aim at deriving the algorithm of estimating the parameters. Since the user rating is modelled by a quantized function as (Equation3(3) $r = {\begin{cases} M_{1} & i f | E_{J} | \in [0, a_{1}] \\ M_{2} & i f | E_{J} | \in (a_{1}, a_{2}] \\ ⋮ \\ M_{m - 1} & i f | E_{J} | \in (a_{m - 2}, a_{m - 1}] \\ M_{m} & i f | E_{J} | \in (a_{m - 1}, \infty), \end{cases}$ (3) ), the parameter estimation can be reduced to a class of set-membership estimation [Citation17] as studied for state estimation problems [Citation18,Citation19]. Section 3.1 is devoted to the estimation of a parameter region. In Section 3.2, the parameter estimate is given from the region, and the estimation algorithm is presented.

3.1. Estimate of parameter region

Recall first that user rating R, given in (Equation4(4) $\begin{aligned} H : R (x^{(d)}, u^{(d)}) \\ = {r (x^{(d)}, u^{(d)}), sgn (\hat{J} (x^{(d)}, u^{(d)}) - J^{*} (x^{(d)}, u^{(d)}))}, \end{aligned}$ (4) ), includes rough information on his/her utility gained from experience ${x^{(d)}, u^{(d)}}$ . We suppose that $r = M_{s}$ holds for the experience, which implies the user gives s-grade for a current control system. Further supposing $sgn ({\hat{J}}^{(d)} - J^{* (d)}) \geq 0$ , we have the following inequality (5) $a_{s - 1} < E_{J^{(d)}} \leq a_{s},$ (5) where $E_{J^{(d)}} := E_{J} (x^{(d)}, u^{(d)})$ . Further, we let $f_{i}^{(d)} := f_{i} (x^{(d)}, u^{(d)}), i \in {0, 1, 2}$ . Then, we see that $E_{J^{(d)}}$ is described by (6) $E_{J^{(d)}} = \frac{({\hat{q}}_{1, τ - 1} f_{1}^{(d)} + {\hat{q}}_{2, τ - 1} f_{2}^{(d)}) - (q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)})}{f_{0}^{(d)} + q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}},$ (6) where ${\hat{q}}_{1, τ - 1}$ and ${\hat{q}}_{2, τ - 1}$ are the estimated parameters at $(τ - 1)$ th trial on the controller update. Then, by substituting (Equation6(6) $E_{J^{(d)}} = \frac{({\hat{q}}_{1, τ - 1} f_{1}^{(d)} + {\hat{q}}_{2, τ - 1} f_{2}^{(d)}) - (q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)})}{f_{0}^{(d)} + q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}},$ (6) ) into (Equation5(5) $a_{s - 1} < E_{J^{(d)}} \leq a_{s},$ (5) ), we have (7) $\begin{aligned} a_{s - 1} (f_{0}^{(d)} + q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) \\ < ({\hat{q}}_{1, τ - 1} f_{1}^{(d)} + {\hat{q}}_{2, τ - 1} f_{2}^{(d)}) - (q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) \\ \leq a_{s} (f_{0}^{(d)} + q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) . \end{aligned}$ (7) The set of the linear inequalities in (Equation7(7) $\begin{aligned} a_{s - 1} (f_{0}^{(d)} + q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) \\ < ({\hat{q}}_{1, τ - 1} f_{1}^{(d)} + {\hat{q}}_{2, τ - 1} f_{2}^{(d)}) - (q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) \\ \leq a_{s} (f_{0}^{(d)} + q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) . \end{aligned}$ (7) ) represents the existence region of $q_{1}^{*}$ and $q_{2}^{*}$ .

In a similar manner, supposing $sgn ({\hat{J}}^{(d)} - J^{* (d)}) < 0$ , we see that (8) $\begin{aligned} - a_{s} (f_{0}^{(d)} + q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) \\ \leq ({\hat{q}}_{1, τ - 1} f_{1}^{(d)} + {\hat{q}}_{2, τ - 1} f_{2}^{(d)}) - (q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) \\ < - a_{s - 1} (f_{0}^{(d)} + q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) \end{aligned}$ (8) holds.

Consider again the τth trial on the controller update for derive the parameter estimate algorithm: a control operation is performed, and the system-user rates the control system $(P, K_{τ})$ based on his/her experience ${x_{τ}^{(d)}, u_{τ}^{(d)}}$ . We suppose here that in the rating $R (x_{τ}^{(d)}, u_{τ}^{(d)})$ , $r (x_{τ}^{(d)}, u_{τ}^{(d)}) = M_{s}$ holds, i.e. the user rates the system by s-grade. Then, letting (9) $S_{τ} = {(q_{1}^{*}, q_{2}^{*}) | \begin{array}{ll} (7) & i f sgn ({\hat{J}}^{(d)} - J^{* (d)}) \geq 0 \\ (8) & i f sgn ({\hat{J}}^{(d)} - J^{* (d)}) < 0 \end{array}} .$ (9) Recalling the user rating given at the first to $(τ - 1)$ th control operations, we show that $(q_{1}^{*}, q_{2}^{*}) \in Q_{τ},$ where (10) $Q_{τ} = S_{0} \cap S_{1} \cap S_{2} \cap \dots \cap S_{τ}$ (10) and $S_{0}$ the initial guess on the parameter region. The update of the parameter existence region is illustrated in Figure . In the figure, the region enclosed by the red line is $S_{0}$ , and the region enclosed by the blue line is $S_{1}$ . The coloured area represents the parameter existence region $Q_{1}$ . In this way, we contract the parameter existence region by taking the intersection repeatedly. In the next subsection, we define the parameter estimate $({\hat{q}}_{1, τ}, {\hat{q}}_{2, τ})$ from $Q_{τ}$ .

Figure 3. Estimate of parameter existence region.

3.2. Update method of estimated parameters

One can define the estimate $({\hat{q}}_{1, τ}, {\hat{q}}_{2, τ})$ by the “centre” of region $Q_{τ}$ . A drawback of taking the centre is its complexity: when the number of the controller update increases, i.e. τ increases, it is difficult to find the centre from $Q_{τ}$ of (Equation10(10) $Q_{τ} = S_{0} \cap S_{1} \cap S_{2} \cap \dots \cap S_{τ}$ (10) ), which is a polytope. To find the estimate $({\hat{q}}_{2, τ}, {\hat{q}}_{2, τ})$ in a computationally tractable way, we apply an approximation of $Q_{τ}$ . Let $Q_{r e c t, τ}$ be the rectangle region that approximates $Q_{τ}$ , and is defined by (11) $Q_{r e c t, τ} = {(q_{1}^{*}, q_{2}^{*}) | q_{i}^{*} \in [{\underline{q}}_{i, τ}, {\bar{q}}_{i, τ}], i \in {1, 2}},$ (11) where ${\underline{q}}_{i, τ}$ and ${\bar{q}}_{i, τ}$ are defined by ${\underline{q}}_{i, τ} = min_{q_{i} \in Q_{r e c t, τ - 1} \cap S_{τ}} q_{i}, {\bar{q}}_{i, τ} = max_{q_{i} \in Q_{r e c t, τ - 1} \cap S_{τ}} q_{i},$ respectively. Note that region $Q_{r e c t, τ}$ is an “outer” approximation of $Q_{τ}$ , i.e. it holds that $Q_{τ} \subseteq Q_{r e c t, τ} .$ By taking the approximation of $Q_{τ}$ as (Equation11(11) $Q_{r e c t, τ} = {(q_{1}^{*}, q_{2}^{*}) | q_{i}^{*} \in [{\underline{q}}_{i, τ}, {\bar{q}}_{i, τ}], i \in {1, 2}},$ (11) ) at every controller update τ, we can find the estimate $({\hat{q}}_{1, τ}, {\hat{q}}_{2, τ})$ in a simplified manner as (12) $({\hat{q}}_{1, τ}, {\hat{q}}_{2, τ}) = C (Q_{r e c t, τ}),$ (12) where $C (\cdot)$ represents the centre of gravity, i.e. ${\hat{q}}_{1, τ} = \frac{{\underline{q}}_{1, τ} + {\bar{q}}_{1, τ}}{2}, {\hat{q}}_{2, τ} = \frac{{\underline{q}}_{2, τ} + {\bar{q}}_{2, τ}}{2} .$ The approximation of the parameter existence region and parameter estimate in (Equation12(12) $({\hat{q}}_{1, τ}, {\hat{q}}_{2, τ}) = C (Q_{r e c t, τ}),$ (12) ) are illustrated in Figure . In the figure, the black dotted line represents the rectangle region $Q_{r e c t, 1}$ while the blue triangle represents the estimated parameters $({\hat{q}}_{1, 1}, {\hat{q}}_{2, 1})$ .

Figure 4. Outer approximation of $Q_{τ}$ to define $Q_{r e c t, τ}$ , and the centre of $Q_{r e c t, τ}$ to define $({\hat{q}}_{1, τ}, {\hat{q}}_{2, τ})$ .

Figure 4. Outer approximation of Qτ to define Qrect,τ, and the centre of Qrect,τ to define (qˆ1,τ,qˆ2,τ).

Finally, the algorithm of estimating parameters ${q_{1}^{*}, q_{2}^{*}}$ is summarized in Algorithm 1.

4. Analysis of algorithm

In this section, we address the convergence analysis of the proposed algorithm. We present the following theorem, which states the contraction of region $Q_{r e c t, τ}$ .

Theorem 4.1

In Algorithm 1, it holds that (13) $v o l (Q_{r e c t, τ + 1}) < v o l (Q_{r e c t, τ}),$ (13) where $v o l (S)$ represents the volume of S.

Proof.

We consider that by the user rating, (14) $a_{s - 1} < | E_{J^{(d)}} | \leq a_{s}$ (14) holds for some $s \in {1, 2, \dots, m - 1}$ and $Q_{r e c t, τ + 1}$ is obtained as $\begin{aligned} Q_{r e c t, τ + 1} \\ = {(q_{1}^{*}, q_{2}^{*}) | q_{i}^{*} \in [{\underline{q}}_{i, τ + 1}, {\bar{q}}_{i, τ + 1}], i \in {1, 2}} . \end{aligned}$ Consider here the following two cases: $sgn ({\hat{J}}^{(d)} - J^{* (d)}) > 0$ and $sgn ({\hat{J}}^{(d)} - J^{* (d)}) < 0$ in user rating R. First, suppose $s g n ({\hat{J}}^{(d)} - J^{* (d)}) > 0$ implying that (Equation14(14) $a_{s - 1} < | E_{J^{(d)}} | \leq a_{s}$ (14) ) is reduced to $a_{s - 1} < E_{J^{(d)}} \leq a_{s}$ .

To prove (Equation13(13) $v o l (Q_{r e c t, τ + 1}) < v o l (Q_{r e c t, τ}),$ (13) ), we show that one of the following four conditions holds. (15) $\begin{aligned} (i) {\underline{q}}_{1, τ} < {\underline{q}}_{1, τ + 1}, (i i) {\bar{q}}_{1, τ} > {\bar{q}}_{1, τ + 1}, \\ (i i i) {\underline{q}}_{2, τ} < {\underline{q}}_{2, τ + 1}, (i v) {\bar{q}}_{2, τ} > {\bar{q}}_{2, τ + 1} . \end{aligned}$ (15) The graphical interpretation of conditions (i)–(iv) is given in Figure . We see that (i) holds if and only if line $E_{J^{(d)}} = a_{s}$ intersects segment ${(q_{1}^{*}, q_{2}^{*}) ∣ q_{1}^{*} \in ({\underline{q}}_{1, τ}, {\bar{q}}_{1, τ}), q_{2}^{*} = {\bar{q}}_{2, τ}}$ , which is a part of the boundary of $Q_{r e c t, τ}$ . In the same way as the derivation of (Equation7(7) $\begin{aligned} a_{s - 1} (f_{0}^{(d)} + q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) \\ < ({\hat{q}}_{1, τ - 1} f_{1}^{(d)} + {\hat{q}}_{2, τ - 1} f_{2}^{(d)}) - (q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) \\ \leq a_{s} (f_{0}^{(d)} + q_{1}^{*} f_{1}^{(d)} + q_{2}^{*} f_{2}^{(d)}) . \end{aligned}$ (7) ) from (Equation5(5) $a_{s - 1} < E_{J^{(d)}} \leq a_{s},$ (5) ), the condition for the intersection is described by ${(i)}^{'} {\underline{q}}_{1, τ} f_{1}^{(d)} + {\bar{q}}_{2, τ} f_{2}^{(d)} < X,$ where $X = \frac{{\hat{q}}_{1, τ} f_{1}^{(d)} + {\hat{q}}_{2, τ} f_{2}^{(d)} - a_{s} f_{0}^{(d)}}{1 + a_{s}} .$ In a similar manner, (ii)–(iv) are equivalently reduced to $\begin{aligned} {(i i)}^{'} {\underline{q}}_{1, τ} f_{1}^{(d)} + {\bar{q}}_{2, τ} f_{2}^{(d)} > Y, \\ {(i i i)}^{'} {\bar{q}}_{1, τ} f_{1}^{(d)} + {\underline{q}}_{2, τ} f_{2}^{(d)} < X, \\ {(i v)}^{'} {\bar{q}}_{1, τ} f_{1}^{(d)} + {\underline{q}}_{2, τ} f_{2}^{(d)} > Y, \end{aligned}$ respectively, where $Y = \frac{{\hat{q}}_{1, τ} f_{1}^{(d)} + {\hat{q}}_{2, τ} f_{2}^{(d)} - a_{s - 1} f_{0}^{(d)}}{1 + a_{s - 1}} .$ Now, we suppose that none of (i) $^{'}$ –(iv) $^{'}$ holds to prove (Equation13(13) $v o l (Q_{r e c t, τ + 1}) < v o l (Q_{r e c t, τ}),$ (13) ) by contradiction. Then, it follows that both of the following inequalities $\begin{aligned} X & \leq {\underline{q}}_{1, τ} f_{1}^{(d)} + {\bar{q}}_{2, τ} f_{2}^{(d)} \leq Y, \\ X & \leq {\bar{q}}_{1, τ} f_{1}^{(d)} + {\underline{q}}_{2, τ} f_{2}^{(d)} \leq Y \end{aligned}$ hold. then, we have $X \leq {\hat{q}}_{1, τ} f_{1}^{(d)} + {\hat{q}}_{2, τ} f_{2}^{(d)} \leq Y .$ Here, we note that ${\hat{q}}_{1, τ} f_{1}^{(d)} + {\hat{q}}_{2, τ} f_{2}^{(d)} > Y$ must hold since $a_{s - 1} > 0$ and $f_{0} \geq 0$ . This contradicts that none of (i) $^{'}$ –(iv) $^{'}$ holds.

Figure 5. Condition for contraction, given in (Equation15(15) $\begin{aligned} (i) {\underline{q}}_{1, τ} < {\underline{q}}_{1, τ + 1}, (i i) {\bar{q}}_{1, τ} > {\bar{q}}_{1, τ + 1}, \\ (i i i) {\underline{q}}_{2, τ} < {\underline{q}}_{2, τ + 1}, (i v) {\bar{q}}_{2, τ} > {\bar{q}}_{2, τ + 1} . \end{aligned}$ (15) ).

Figure 5. Condition for contraction, given in (Equation15(15) (i)q_1,τ<q_1,τ+1,(ii)q¯1,τ>q¯1,τ+1,(iii)q_2,τ<q_2,τ+1,(iv)q¯2,τ>q¯2,τ+1.(15) ).

Next, we consider the case $sgn ({\hat{J}}^{(d)} - J^{* (d)}) < 0$ , which implies that (Equation14(14) $a_{s - 1} < | E_{J^{(d)}} | \leq a_{s}$ (14) ) is reduced to $- a_{s} \leq E_{J^{(d)}} < - a_{s - 1}$ .

In the same way as the case of $s g n ({\hat{J}}^{(d)} - J^{* (d)}) > 0$ , to prove (Equation13(13) $v o l (Q_{r e c t, τ + 1}) < v o l (Q_{r e c t, τ}),$ (13) ), we show that one of the following conditions holds. $\begin{aligned} {(i)}^{″} {\underline{q}}_{1, τ} f_{1}^{(d)} + {\bar{q}}_{2, τ} f_{2}^{(d)} < X^{'}, \\ {(i i)}^{″} {\underline{q}}_{1, τ} f_{1}^{(d)} + {\bar{q}}_{2, τ} f_{2}^{(d)} > Y^{'}, \\ {(i i i)}^{″} {\bar{q}}_{1, τ} f_{1}^{(d)} + {\underline{q}}_{2, τ} f_{2}^{(d)} < X^{'}, \\ {(i v)}^{″} {\bar{q}}_{1, τ} f_{1}^{(d)} + {\underline{q}}_{2, τ} f_{2}^{(d)} > Y^{'}, \end{aligned}$ where $\begin{aligned} X^{'} & = \frac{{\hat{q}}_{1, τ} f_{1}^{(d)} + {\hat{q}}_{2, τ} f_{2}^{(d)} + a_{s - 1} f_{0}^{(d)}}{1 - a_{s - 1}}, \\ Y^{'} & = \frac{{\hat{q}}_{1, τ} f_{1}^{(d)} + {\hat{q}}_{2, τ} f_{2}^{(d)} + a_{s} f_{0}^{(d)}}{1 - a_{s}} . \end{aligned}$ Supposing that none of (i) $^{″}$ –(iv) $^{″}$ holds, we have $X^{'} \leq {\hat{q}}_{1, τ} f_{1}^{(d)} + {\hat{q}}_{2, τ} f_{2}^{(d)} \leq Y^{'} .$ Recall $E_{J} > - 1$ as stated in Remark 2.1. It follows that $- a_{s - 1} > - 1$ holds. Consequently, it holds that ${\hat{q}}_{1, τ} f_{1}^{(d)} + {\hat{q}}_{2, τ} f_{2}^{(d)} < X^{'}$ This contradicts that none of (i) $^{″}$ –(iv) $^{″}$ holds. This concludes the statement of the theorem holds.

Remark 4.1

As implied by Theorem 4.1, by iterating the controller update with Algorithm 1, the parameter region $Q_{τ}$ contracts monotonically in the sense of the outer approximation. The asymptotic convergence of $Q_{τ}$ to $(q_{1}^{*}, q_{2}^{*})$ is not guaranteed in the analysis but is numerically verified in a demonstration given in Section 5.

5. Numerical experiment

In this section, we present a numerical experiment of the proposed control system with Algorithm 1. In the experiment, we demonstrate personalization using the proposed control system, considering two users.

5.1. Problem setting

We address the LQR problem. A plant system is given by a discrete-time linear state space equation: (16) $P : x (k + 1) = A x (k) + B u (k),$ (16) where system matrices A and B are given by $A = [\begin{array}{cc} 0 & 1 \\ - 10 & 1 \end{array}], B = [\begin{matrix} - 1 \\ 1 \end{matrix}] .$ The objective function of a system-user is described by $J = \sum_{k = 0}^{\infty} x (k + 1)^{⊤} Q x (k + 1) + u (k)^{⊤} W u (k),$ where Q and W are weighting matrices. The corresponding optimal control law is given by (17) $K : u (k) = - K (Q, W) x (k),$ (17) where $K (Q, W) = W^{- 1} B^{⊤} P$ is the optimal feedback gain and P is the solution to the Riccati equation $P A + A^{⊤} P - P B W^{- 1} B^{⊤} P + Q = 0$ . We consider two users, user A and user B. The parameters in J for user A are given by $Q *_{A} = d i a g {q_{1, A}^{*}, q_{2, A}^{*}} = d i a g {50, 1}$ and W = 5. Then, the parameters for user B are given by $Q *_{B} = d i a g {q_{1, B}^{*}, q_{2, B}^{*}} = d i a g {1, 50}$ and W = 5.

It should be noted again that J is private, i.e. the system-designer cannot access Q directly. In this experiment, we try to estimate $q_{1}^{*}$ and $q_{2}^{*}$ based on the following user rating; (18) $r = {\begin{cases} 100 & (| E_{J} | = 0) \\ 80 & (0 < | E_{J} | \leq 0.003) \\ 60 & (0.003 < | E_{J} | \leq 0.006) \\ 40 & (0.006 < | E_{J} | \leq 0.009) \\ 20 & (0.009 < | E_{J} | \leq 0.012) \\ 0 & (0.012 < | E_{J} |) . \end{cases}$ (18) The flow of the experiment is shown below.

Set some initial estimate $q_{1, 0}, q_{2, 0}$ .
A control experiment is performed where the initial state $x_{0}$ is determined by random values chosen from $[0, 10]^{2}$
A system-user evaluates the temporal control system and gives a rating to a system-designer based on (Equation18(18) $r = {\begin{cases} 100 & (| E_{J} | = 0) \\ 80 & (0 < | E_{J} | \leq 0.003) \\ 60 & (0.003 < | E_{J} | \leq 0.006) \\ 40 & (0.006 < | E_{J} | \leq 0.009) \\ 20 & (0.009 < | E_{J} | \leq 0.012) \\ 0 & (0.012 < | E_{J} |) . \end{cases}$ (18) ).
Apply Algorithm 1 to the user rating to update the estimate of $q_{i}, i \in {1, 2}$ .
Update control law (Equation17(17) $K : u (k) = - K (Q, W) x (k),$ (17) ) with the updated parameters.
Back to Stage 2.

The experiment was performed under $(q_{1, 0}, q_{2, 0}) = (60, 5)$ , which is the initial guess.

5.2. Experiment results

The result of the experiment is given in Figures . The transitions of the parameters $q_{1}$ and $q_{2}$ obtained from the experiment are shown in Figure . In the figure, the horizontal axis represents the number of parameter updates ϵ and the vertical axis represents the parameter value. We see that the parameter estimate $({\hat{q}}_{1}, {\hat{q}}_{2})$ converges to $(q_{1, A}^{*}, q_{2, A}^{*}) = (50, 1)$ for user A, and to $(q_{1, B}^{*}, q_{2, B}^{*}) = (1, 50)$ for user B.

Figure 6. Transitions of parameter estimate. (a) user A and (b) user B.

Next, the transitions of the user rating are shown in Figure . In the figure, the horizontal axis represents the number of parameter updates ϵ and the vertical axis represents the value of the utility gained from the control system. For either user A or B, we see that the utility is maximized by the algorithm even if the system-designer cannot access full information on the objective function.

Finally, we show the state transitions of personalized control systems in Figure , which use converged estimated parameters for the control. In the figure, the horizontal axis represents the discrete-time k and the vertical axis represents the state of the plant system (Equation16(16) $P : x (k + 1) = A x (k) + B u (k),$ (16) ). We see that the control behaviour is different for each user, which indicates that the control system is personalized.

Figure 7. Transitions of utility. (a) user A and (b) user B.

Figure 8. State transitions of personalized control system. (a) user A and (b) user B.

6. Conclusion

In this paper, we addressed the personalization of control systems where the optimal controller is updated according to users' private objective. We formulated a problem of estimating the individual objective function based on the user ratings and proposed its solution algorithm. The algorithm was analysed for a special case where the objective function is characterized by only two parameters. Finally, a numerical experiment showed the usefulness of the algorithm.

We have not proved that Theorem 4.1 holds for the case with more than three parameters, so future works include the extension of the analysis where a general objective function is addressed. Another future work is to model the user rating in a different manner to (Equation3(3) $r = {\begin{cases} M_{1} & i f | E_{J} | \in [0, a_{1}] \\ M_{2} & i f | E_{J} | \in (a_{1}, a_{2}] \\ ⋮ \\ M_{m - 1} & i f | E_{J} | \in (a_{m - 2}, a_{m - 1}] \\ M_{m} & i f | E_{J} | \in (a_{m - 1}, \infty), \end{cases}$ (3) ).

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by Grant-in-Aid for Scientific Research (B), No. 20H02173 JSPS.

Notes on contributors

Tomotaka Nii

Tomotaka Nii received the B.E. degrees in the Department of Applied Physico-informatics from Keio University in 2022. His research interests include control theory for human-in-the-loop systems.

Masaki Inoue

Masaki Inoue received the M.E. and Ph.D. degrees in mechanical engineering from Osaka University in 2009 and 2012, respectively. He served as a Research Fellow of the Japan Society for the Promotion of Science from 2010 to 2012. From 2012 to 2014, he was a Project Researcher of FIRST, Aihara Innovative Mathematical Modelling Project, and also a Doctoral Researcher of the Graduate School of Information Science and Engineering, Tokyo Institute of Technology. Currently, he is an Associate Professor of the Department of Applied Physico-informatics, Keio University. His research interests include control theory for human-in-the-loop systems. He is a member of IEEE, SICE, and ISCIE.

References

Fan H, Poole MS. What is personalization? Perspectives on the design and implementation of personalization in information systems. J Organ Comput Electron Commer. 2006;16(3–4):179–202.
Web of Science ®Google Scholar
Tuzhilin A. Personalization: the state of the art and future directions. Bus Comput. 2009;3(3):3–43.
Google Scholar
Hasenjager M, Heckmann M, Wersing H. A survey of personalization for advanced driver assistance systems. IEEE Trans Intell Veh. 2020;5(2):335–344.
Google Scholar
Yi D, Su J, Hu L, et al. Implicit personalization in driving assistance: state-of-the-art and open issues. IEEE Trans Intell Veh. 2019;5(3):397–413.
Google Scholar
Lu C, Gong J, Lv C, et al. A personalized behavior learning system for human-like longitudinal speed control of autonomous vehicles. Sensors. 2019;19(17):3672.
PubMed Web of Science ®Google Scholar
Noto N, Okuda H, Tazaki Y, et al. Steering assisting system for obstacle avoidance based on personalized potential field. In: 2012 15th International IEEE Conference on Intelligent Transportation Systems. IEEE; 2012. p. 1702–1707.
Google Scholar
Wiering MA, Van Otterlo M. Reinforcement learning. Adapt Learn Optim. 2012;12(3):729.
Google Scholar
Lewis FL, Vrabie D, Vamvoudakis KG. Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. IEEE Contr Syst Mag. 2012;32(6):76–105.
Web of Science ®Google Scholar
Kiumarsi B, Vamvoudakis KG, Modares H, et al. Optimal and autonomous control using reinforcement learning: a survey. IEEE Trans Neural Netw Learn Syst. 2018;29(6):2042–2062.
PubMed Web of Science ®Google Scholar
Zanon M, Gros S, Bemporad A. Practical reinforcement learning of stabilizing economic MPC. In: 2019 18th European Control Conference (ECC), Naples; 2019. p. 2258–2263.
Google Scholar
Ernst D, Glavic M, Capitanescu F, et al. Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Trans Syst Man Cybern B. 2009;39(2):517–529.
PubMed Web of Science ®Google Scholar
Ng AY, Russell S. Algorithms for inverse reinforcement learning. In: International Conference on Machine Learning, Stanford, CA; Vol. 1; 2000. p. 2.
Google Scholar
Arora S, Doshi P. A survey of inverse reinforcement learning: challenges, methods and progress. Artif Intell. 2021;297:Article ID 103500.
Web of Science ®Google Scholar
Ozkan MF, Ma Y. Modeling driver behavior in car-following interactions with automated and human-driven vehicles and energy efficiency evaluation. IEEE Access. 2021;9:64696–64707.
Web of Science ®Google Scholar
Ozkan MF, Rocque AJ, Ma Y. Inverse reinforcement learning based stochastic driver behavior learning. IFAC-PapersOnLine. 2021;54(20):882–888.
Google Scholar
Ozkan MF, Ma Y. Personalized adaptive cruise control and impacts on mixed traffic. In: 2021 American Control Conference (ACC), New Orleans; 2021. p. 412–417.
Google Scholar
Milanese M, Vicino A. Optimal estimation theory for dynamic systems with set membership uncertainty: an overview. Automatica. 1991;27(6):997–1009.
Web of Science ®Google Scholar
Savkin AV, Petersen IR. Set-valued state estimation via a limited capacity communication channel. IEEE Trans Automat Contr. 2003;48(4):676–680.
Web of Science ®Google Scholar
Shi D, Chen T, Shi L. On set-valued kalman filtering and its application to event-based state estimation. IEEE Trans Automat Control. 2015;60(5):1275–1290.
Web of Science ®Google Scholar

Personalized control system via reinforcement learning: maximizing utility based on user ratings

Abstract

1. Introduction