978
Views
0
CrossRef citations to date
0
Altmetric
Research Article

Personalized control system via reinforcement learning: maximizing utility based on user ratings

&
Pages 18-26 | Received 08 Sep 2022, Accepted 27 Dec 2022, Published online: 21 Jan 2023

Abstract

In this paper, we address the design of personalized control systems, which pursue individual objectives defined for each user. To this end, a problem of reinforcement learning is formulated where an individual objective function is estimated based on the user rating on his/her current control system and its corresponding optimal controller is updated. The novelty of the problem setting is in the modelling of the user rating. The rating is modelled by a quantization of the user utility gained from his/her control system, defined by the value of the objective function at his/her control experience. We propose an algorithm of the estimation to update the control law. Through a numerical experiment, we find out that the proposed algorithm realizes the personalized control system.

1. Introduction

Robust design has been a basic design concept of control systems: common control systems are designed for various system-users such that the systems operate stably without any concern for the difference between user-environment. Beyond the robustness, “personalization” can be an advanced design concept of control systems: an individual control system is designed for each system-user such that the system operates pursuing high control performance and improving user utility [Citation1,Citation2]. In this paper, we address the personalization of control systems, in particular, a design methodology of the control systems that possess a function of adaptation of improving user utility.

As studied in other research fields, a key to personalization is the modelling of system-users [Citation3,Citation4]. For example, in [Citation5,Citation6], data on manual driving is used to model driver intent. In the previous works, the user modelling is addressed based on the measured data on the user actions to the control system. On the other hand, in this paper, the data on the action is not available, and instead, the result of user rating on the control system is available. The user rating is collected for every control operation and is used to estimate the private objective function of the user, and the optimal controller is updated based on the estimated objective function. By repeating the controller update, we aim at maximizing the user utility gained in the control system.

As an application of the presented personalized control system, let us imagine an automatic driving system. The system drives automatically, and the user rates the driving control system based on his/her comfort, for example, once a month. The control system accesses the result of the user rating to estimate his/her preference, modelled by a parameterized objective function. Then, the implemented control law is updated based on the estimated objective function.

In general, a problem of reinforcing system performance by learning techniques is called reinforcement learning (RL), which has been addressed in the literature. See e.g. [Citation7] and its references therein. RL has been applied to the design of feedback control systems [Citation8–11]. In [Citation8,Citation9], continuous control problems are addressed unlike standard RL problems, and the control law is updated directly. Furthermore, in [Citation10,Citation11], RL is combined with model predictive control, and the objective function is tuned to update the optimal control law indirectly. The main difference in this paper and the literature is the assumption on the objective function and/or reward. In the literature, the objective function is designable, while in this paper, the function is not designable and is pre-defined but hidden by a system-user.

The rest of this paper is organized as follows. In Section 2, the models of control systems and system-user are given, and the problem of the control system update is formulated. In Section 3, we propose an algorithm of estimating the user objective function, which plays a central role in personalization. In Section 4, we give a convergence analysis of the proposed algorithm. In Section 5, we present a numerical experiment using the proposed algorithm. In Section 6, the conclusion is given.

2. Problem formulation

2.1. Model of control systems

We consider a control system that is composed of a plant system and a controller, which are denoted by P and K, respectively. The plant system is modelled by the following discrete-time state space equation P:x(k+1)=h(x(k),u(k)),where kN+ is the discrete-time, and xRn and uRm are the state and control input, respectively. Symbol h:Rn×RmRn represents a function.

The controller is designed based on the estimate of an individual objective function, which is private and is defined for each user. Let J and Jˆ represent the objective function and its estimate, respectively. Then, the control law is described by the following optimization problem K:{minuJˆ(x,u)s.t.x(k+1)=h(x(k),u(k)),k(x(k),u(k))X×U,k,where x and u are stacked vectors composed of the sequences of the state and input, respectively, i.e. x=[x(k+1),x(k+2),,x(k+N)],u=[u(k),u(k+1),,u(k+N1)],and X and U are the state and input constraints, respectively. In the following discussion, {x(d),u(d)} represents the measured data on x and u, obtained in a control experiment and called experience for the control system.

In addition to P and K, we should note that a system-user participates in the control system. His/her objective function is modelled by (1) J(x,u)=f0(x,u)+i=1qifi(x,u),(1) where fi(x,u):RnN×RmNR+,i{0,1,,} are non-negative functions defined by a system-designer and qi>0,i{1,2,,} are weighting parameters to be estimated. As (Equation1), user's objective is parameterized by qi,i{1,2,,}. The system-user has his/her objective function (Equation1) in his/her mind, and the user's weighting parameter qi,i{1,2,,} are not accessible directly for control systems update. Instead, the user rating including the information on qi,i{1,2,,} is available for the update.

In this paper, we aim at maximizing the users utility achieved in the control system (P,K) by updating K. To this end, the individual and private objective function J is estimated, and the control law K is updated based on the estimate Jˆ. In other words, such “personalization” of the control system is achieved by accurate estimate of qi,i{1,2,,}. The estimation of J, i.e. that of qi,i{1,2,,}, is based on user rating without accessing bare data {x(d),u(d)} unlike the conventional works [Citation10,Citation11].

2.2. Model of system-user

In the problem setting, we explicitly take into account the presence of a system-user. The user rates the control system (P,K) based on the utility determined by his/her experience, denoted by J(x(d),u(d)). An example of the user rating is realized by a questionnaire. The user gives m-grade evaluation based on his/her satisfaction with the control system as illustrated in Figure .

Figure 1. An example of user rating: five-grade evaluation is given to the control system, such as Excellent/Very Good/Good/Average/Poor.

Figure 1. An example of user rating: five-grade evaluation is given to the control system, such as Excellent/Very Good/Good/Average/Poor.

Recall that in control system (P,K), controller K pursues the performance in the sense of the estimated objective function Jˆ, parameterized by qˆi,i{1,2,,}, to give experience {x(d),u(d)} to the user. There exists a gap in the estimated utility Jˆ(x(d),u(d)) and true utility J(x(d),u(d)). We assume that the user rating depends on the gap: the user rates the control system high (low) if the gap is small (large).

To model the user rating, we define the gap in the utility gained from experience {x(d),u(d)} as (2) EJ(x(d),u(d))=Jˆ(x(d),u(d))J(x(d),u(d))J(x(d),u(d)).(2) Based on (Equation2), the user rating r is modelled by a piecewise constant function as (3) r={M1if |EJ|[0,a1]M2if |EJ|(a1,a2]Mm1if |EJ|(am2,am1]Mmif |EJ|(am1,),(3)

where Mi,i{1,2,,m} are positive constants satisfying M1>M2>>Mm, and ai,i{1,2,,m1} are also positive constants that indicate the range of existence of |EJ|. One can assume the rating as (M1,M2,,M5,M6)=(100,80,,20,0) for six-grade evaluation. In this setting, since the value of r is quantized, controller K cannot access the exact value of EJ(x(d),u(d)).

Remark 2.1

EJ>1 holds since Jˆ>0. This fact is used for the analysis given in Section 4.

We impose a technical assumption on the model of the system-user, denoted by H. In addition to user rating r, system-user H gives the sign of EJ to the control system based on his/her experience {x(d),u(d)}. Then, the model of H is described by (4) H: R(x(d),u(d))={r(x(d),u(d)),sgn(Jˆ(x(d),u(d))J(x(d),u(d)))},(4) where sgn() is the sign function. We see that this R is a “reward” in the reinforcement learning framework. Controller K can access reward R, which depends on user-experience {x(d),u(d)}, to estimate J and to update its control law.

Remark 2.2

A similar problem of estimating objective functions and/or rewards, which generate the control actions, is known as inverse reinforcement learning (IRL). See e.g. [Citation12,Citation13] for the problem setting and e.g. [Citation14–16] for its applications. In most of the IRL frameworks, the control law is pre-defined and fixed, and its generating data is available for the estimation. On the other hand, in this paper, the control law is not fixed and to be updated, and the rating of a system-user, who is not included in the control loop, is available.

The block diagram of the control system with the user rating is illustrated in Figure . In the figure, the blue line connecting the controller and the plant indicates the loop of the control operation, while the red line connecting the user, controller, and plant indicates the loop of the controller update.

Figure 2. Personalized control system updated based on user rating.

Figure 2. Personalized control system updated based on user rating.

2.3. Problem of controller update

Control system (P,K) is updated based on user rating R(x(d),u(d)). The flow of the update is given as follows.

Flow of controller update

  1. The control system at version τ, denoted by (P,Kτ), is operated, and the user gains experience {xτ(d),uτ(d)}.

  2. The user gives his/her rating R(xτ(d),uτ(d)), i.e. a reward for (P,Kτ) is given to the controller.

  3. Parameters qi,i{1,2,,} are estimated.

  4. Control law K is updated based on the estimated objective function Jˆτ.

  5. The version of the control system is updated as ττ+1, and go to Stage 1.

Note that control law K is uniquely determined once the parameter estimation is performed. This implies the essence of the controller update is the parameter estimation, addressed at Stage 3. The estimation problem is given as follows:

Problem 2.1

Given R(x(d),u(d)), estimate qi, i{1,2,,}.

3. Parameter estimation algorithm

In this section, we propose an algorithm of estimating parameters that characterize the user objective function given in (Equation1). To simplify the discussion, the objective function is characterized by only two parameters q1 and q2, i.e. J(x,u)=f0(x,u)+q1f1(x,u)+q2f2(x,u).The following discussion and the derived algorithm are extended to more general ℓ-parameters cases in a straightforward manner.

We aim at deriving the algorithm of estimating the parameters. Since the user rating is modelled by a quantized function as (Equation3), the parameter estimation can be reduced to a class of set-membership estimation [Citation17] as studied for state estimation problems [Citation18,Citation19]. Section 3.1 is devoted to the estimation of a parameter region. In Section 3.2, the parameter estimate is given from the region, and the estimation algorithm is presented.

3.1. Estimate of parameter region

Recall first that user rating R, given in (Equation4), includes rough information on his/her utility gained from experience {x(d),u(d)}. We suppose that r=Ms holds for the experience, which implies the user gives s-grade for a current control system. Further supposing sgn(Jˆ(d)J(d))0, we have the following inequality (5) as1<EJ(d)as,(5) where EJ(d):=EJ(x(d),u(d)). Further, we let fi(d):=fi(x(d),u(d)), i{0,1,2}. Then, we see that EJ(d) is described by (6) EJ(d)=(qˆ1,τ1f1(d)+qˆ2,τ1f2(d))(q1f1(d)+q2f2(d))f0(d)+q1f1(d)+q2f2(d),(6) where qˆ1,τ1 and qˆ2,τ1 are the estimated parameters at (τ1)th trial on the controller update. Then, by substituting (Equation6) into (Equation5), we have (7) as1(f0(d)+q1f1(d)+q2f2(d))<(qˆ1,τ1f1(d)+qˆ2,τ1f2(d))(q1f1(d)+q2f2(d))as(f0(d)+q1f1(d)+q2f2(d)).(7) The set of the linear inequalities in (Equation7) represents the existence region of q1 and q2.

In a similar manner, supposing sgn(Jˆ(d)J(d))<0, we see that (8) as(f0(d)+q1f1(d)+q2f2(d))(qˆ1,τ1f1(d)+qˆ2,τ1f2(d))(q1f1(d)+q2f2(d))<as1(f0(d)+q1f1(d)+q2f2(d))(8) holds.

Consider again the τth trial on the controller update for derive the parameter estimate algorithm: a control operation is performed, and the system-user rates the control system (P,Kτ) based on his/her experience {xτ(d),uτ(d)}. We suppose here that in the rating R(xτ(d),uτ(d)), r(xτ(d),uτ(d))=Ms holds, i.e. the user rates the system by s-grade. Then, letting (9) Sτ={(q1,q2)|(7)if sgn(Jˆ(d)J(d))0(8)if sgn(Jˆ(d)J(d))<0}.(9) Recalling the user rating given at the first to (τ1)th control operations, we show that (q1,q2)Qτ,where (10) Qτ=S0S1S2Sτ(10) and S0 the initial guess on the parameter region. The update of the parameter existence region is illustrated in Figure . In the figure, the region enclosed by the red line is S0, and the region enclosed by the blue line is S1. The coloured area represents the parameter existence region Q1. In this way, we contract the parameter existence region by taking the intersection repeatedly. In the next subsection, we define the parameter estimate (qˆ1,τ,qˆ2,τ) from Qτ.

Figure 3. Estimate of parameter existence region.

Figure 3. Estimate of parameter existence region.

3.2. Update method of estimated parameters

One can define the estimate (qˆ1,τ,qˆ2,τ) by the “centre” of region Qτ. A drawback of taking the centre is its complexity: when the number of the controller update increases, i.e. τ increases, it is difficult to find the centre from Qτ of (Equation10), which is a polytope. To find the estimate (qˆ2,τ,qˆ2,τ) in a computationally tractable way, we apply an approximation of Qτ. Let Qrect,τ be the rectangle region that approximates Qτ, and is defined by (11) Qrect,τ={(q1,q2)|qi[q_i,τ,q¯i,τ], i{1,2}},(11) where q_i,τ and q¯i,τ are defined by q_i,τ=minqiQrect,τ1Sτqi,q¯i,τ=maxqiQrect,τ1Sτqi,respectively. Note that region Qrect,τ is an “outer” approximation of Qτ, i.e. it holds that QτQrect,τ.By taking the approximation of Qτ as (Equation11) at every controller update τ, we can find the estimate (qˆ1,τ,qˆ2,τ) in a simplified manner as (12) (qˆ1,τ,qˆ2,τ)=C(Qrect,τ),(12) where C() represents the centre of gravity, i.e. qˆ1,τ=q_1,τ+q¯1,τ2,qˆ2,τ=q_2,τ+q¯2,τ2.The approximation of the parameter existence region and parameter estimate in (Equation12) are illustrated in Figure . In the figure, the black dotted line represents the rectangle region Qrect,1 while the blue triangle represents the estimated parameters (qˆ1,1,qˆ2,1).

Figure 4. Outer approximation of Qτ to define Qrect,τ, and the centre of Qrect,τ to define (qˆ1,τ,qˆ2,τ).

Figure 4. Outer approximation of Qτ to define Qrect,τ, and the centre of Qrect,τ to define (qˆ1,τ,qˆ2,τ).

Finally, the algorithm of estimating parameters {q1,q2} is summarized in Algorithm 1.

4. Analysis of algorithm

In this section, we address the convergence analysis of the proposed algorithm. We present the following theorem, which states the contraction of region Qrect,τ.

Theorem 4.1

In Algorithm 1, it holds that (13) vol(Qrect,τ+1)<vol(Qrect,τ),(13) where vol(S) represents the volume of S.

Proof.

We consider that by the user rating, (14) as1<|EJ(d)|as(14) holds for some s{1,2,,m1} and Qrect,τ+1 is obtained as Qrect,τ+1={(q1,q2)|qi[q_i,τ+1,q¯i,τ+1], i{1,2}}.Consider here the following two cases: sgn(Jˆ(d)J(d))>0 and sgn(Jˆ(d)J(d))<0 in user rating R. First, suppose sgn(Jˆ(d)J(d))>0 implying that (Equation14) is reduced to as1<EJ(d)as.

To prove (Equation13), we show that one of the following four conditions holds. (15) (i)q_1,τ<q_1,τ+1,(ii)q¯1,τ>q¯1,τ+1,(iii)q_2,τ<q_2,τ+1,(iv)q¯2,τ>q¯2,τ+1.(15) The graphical interpretation of conditions (i)–(iv) is given in Figure . We see that (i) holds if and only if line EJ(d)=as intersects segment {(q1,q2)q1(q_1,τ, q¯1,τ),q2=q¯2,τ}, which is a part of the boundary of Qrect,τ. In the same way as the derivation of (Equation7) from (Equation5), the condition for the intersection is described by (i)q_1,τf1(d)+q¯2,τf2(d)<X,where X=qˆ1,τf1(d)+qˆ2,τf2(d)asf0(d)1+as.In a similar manner, (ii)–(iv) are equivalently reduced to (ii)q_1,τf1(d)+q¯2,τf2(d)>Y,(iii)q¯1,τf1(d)+q_2,τf2(d)<X,(iv)q¯1,τf1(d)+q_2,τf2(d)>Y,respectively, where Y=qˆ1,τf1(d)+qˆ2,τf2(d)as1f0(d)1+as1.Now, we suppose that none of (i)–(iv) holds to prove (Equation13) by contradiction. Then, it follows that both of the following inequalities Xq_1,τf1(d)+q¯2,τf2(d)Y,Xq¯1,τf1(d)+q_2,τf2(d)Yhold. then, we have Xqˆ1,τf1(d)+qˆ2,τf2(d)Y.Here, we note that qˆ1,τf1(d)+qˆ2,τf2(d)>Ymust hold since as1>0 and f00. This contradicts that none of (i)–(iv) holds.

Figure 5. Condition for contraction, given in (Equation15).

Figure 5. Condition for contraction, given in (Equation15(15) (i)q_1,τ<q_1,τ+1,(ii)q¯1,τ>q¯1,τ+1,(iii)q_2,τ<q_2,τ+1,(iv)q¯2,τ>q¯2,τ+1.(15) ).

Next, we consider the case sgn(Jˆ(d)J(d))<0, which implies that (Equation14) is reduced to asEJ(d)<as1.

In the same way as the case of sgn(Jˆ(d)J(d))>0, to prove (Equation13), we show that one of the following conditions holds. (i)q_1,τf1(d)+q¯2,τf2(d)<X,(ii)q_1,τf1(d)+q¯2,τf2(d)>Y,(iii)q¯1,τf1(d)+q_2,τf2(d)<X,(iv)q¯1,τf1(d)+q_2,τf2(d)>Y,where X=qˆ1,τf1(d)+qˆ2,τf2(d)+as1f0(d)1as1,Y=qˆ1,τf1(d)+qˆ2,τf2(d)+asf0(d)1as.Supposing that none of (i)–(iv) holds, we have Xqˆ1,τf1(d)+qˆ2,τf2(d)Y.Recall EJ>1 as stated in Remark 2.1. It follows that as1>1 holds. Consequently, it holds that qˆ1,τf1(d)+qˆ2,τf2(d)<XThis contradicts that none of (i)–(iv) holds. This concludes the statement of the theorem holds.

Remark 4.1

As implied by Theorem 4.1, by iterating the controller update with Algorithm 1, the parameter region Qτ contracts monotonically in the sense of the outer approximation. The asymptotic convergence of Qτ to (q1,q2) is not guaranteed in the analysis but is numerically verified in a demonstration given in Section 5.

5. Numerical experiment

In this section, we present a numerical experiment of the proposed control system with Algorithm 1. In the experiment, we demonstrate personalization using the proposed control system, considering two users.

5.1. Problem setting

We address the LQR problem. A plant system is given by a discrete-time linear state space equation: (16) P:x(k+1)=Ax(k)+Bu(k),(16) where system matrices A and B are given by A=[01101],B=[11].The objective function of a system-user is described by J=k=0x(k+1)Qx(k+1)+u(k)Wu(k),where Q and W are weighting matrices. The corresponding optimal control law is given by (17) K:u(k)=K(Q,W)x(k),(17) where K(Q,W)=W1BP is the optimal feedback gain and P is the solution to the Riccati equation PA+APPBW1BP+Q=0. We consider two users, user A and user B. The parameters in J for user A are given by QA=diag{q1,A,q2,A}=diag{50,1} and W = 5. Then, the parameters for user B are given by QB=diag{q1,B,q2,B}=diag{1,50} and W = 5.

It should be noted again that J is private, i.e. the system-designer cannot access Q directly. In this experiment, we try to estimate q1 and q2 based on the following user rating; (18) r={100(|EJ|=0)80(0<|EJ|0.003)60(0.003<|EJ|0.006)40(0.006<|EJ|0.009)20(0.009<|EJ|0.012)0(0.012<|EJ|).(18) The flow of the experiment is shown below.

  1. Set some initial estimate q1,0,q2,0.

  2. A control experiment is performed where the initial state x0 is determined by random values chosen from [0,10]2

  3. A system-user evaluates the temporal control system and gives a rating to a system-designer based on (Equation18).

  4. Apply Algorithm 1 to the user rating to update the estimate of qi, i{1,2}.

  5. Update control law (Equation17) with the updated parameters.

  6. Back to Stage 2.

The experiment was performed under (q1,0,q2,0)=(60,5), which is the initial guess.

5.2. Experiment results

The result of the experiment is given in Figures . The transitions of the parameters q1 and q2 obtained from the experiment are shown in Figure . In the figure, the horizontal axis represents the number of parameter updates ϵ and the vertical axis represents the parameter value. We see that the parameter estimate (qˆ1,qˆ2) converges to (q1,A,q2,A)=(50,1) for user A, and to (q1,B,q2,B)=(1,50) for user B.

Figure 6. Transitions of parameter estimate. (a) user A and (b) user B.

Figure 6. Transitions of parameter estimate. (a) user A and (b) user B.

Next, the transitions of the user rating are shown in Figure . In the figure, the horizontal axis represents the number of parameter updates ϵ and the vertical axis represents the value of the utility gained from the control system. For either user A or B, we see that the utility is maximized by the algorithm even if the system-designer cannot access full information on the objective function.

Finally, we show the state transitions of personalized control systems in Figure , which use converged estimated parameters for the control. In the figure, the horizontal axis represents the discrete-time k and the vertical axis represents the state of the plant system (Equation16). We see that the control behaviour is different for each user, which indicates that the control system is personalized.

Figure 7. Transitions of utility. (a) user A and (b) user B.

Figure 7. Transitions of utility. (a) user A and (b) user B.

Figure 8. State transitions of personalized control system. (a) user A and (b) user B.

Figure 8. State transitions of personalized control system. (a) user A and (b) user B.

6. Conclusion

In this paper, we addressed the personalization of control systems where the optimal controller is updated according to users' private objective. We formulated a problem of estimating the individual objective function based on the user ratings and proposed its solution algorithm. The algorithm was analysed for a special case where the objective function is characterized by only two parameters. Finally, a numerical experiment showed the usefulness of the algorithm.

We have not proved that Theorem 4.1 holds for the case with more than three parameters, so future works include the extension of the analysis where a general objective function is addressed. Another future work is to model the user rating in a different manner to (Equation3).

Disclosure statement

No potential conflict of interest was reported by the author(s).

Additional information

Funding

This work was supported by Grant-in-Aid for Scientific Research (B), No. 20H02173 JSPS.

Notes on contributors

Tomotaka Nii

Tomotaka Nii received the B.E. degrees in the Department of Applied Physico-informatics from Keio University in 2022. His research interests include control theory for human-in-the-loop systems.

Masaki Inoue

Masaki Inoue received the M.E. and Ph.D. degrees in mechanical engineering from Osaka University in 2009 and 2012, respectively. He served as a Research Fellow of the Japan Society for the Promotion of Science from 2010 to 2012. From 2012 to 2014, he was a Project Researcher of FIRST, Aihara Innovative Mathematical Modelling Project, and also a Doctoral Researcher of the Graduate School of Information Science and Engineering, Tokyo Institute of Technology. Currently, he is an Associate Professor of the Department of Applied Physico-informatics, Keio University. His research interests include control theory for human-in-the-loop systems. He is a member of IEEE, SICE, and ISCIE.

References

  • Fan H, Poole MS. What is personalization? Perspectives on the design and implementation of personalization in information systems. J Organ Comput Electron Commer. 2006;16(3–4):179–202.
  • Tuzhilin A. Personalization: the state of the art and future directions. Bus Comput. 2009;3(3):3–43.
  • Hasenjager M, Heckmann M, Wersing H. A survey of personalization for advanced driver assistance systems. IEEE Trans Intell Veh. 2020;5(2):335–344.
  • Yi D, Su J, Hu L, et al. Implicit personalization in driving assistance: state-of-the-art and open issues. IEEE Trans Intell Veh. 2019;5(3):397–413.
  • Lu C, Gong J, Lv C, et al. A personalized behavior learning system for human-like longitudinal speed control of autonomous vehicles. Sensors. 2019;19(17):3672.
  • Noto N, Okuda H, Tazaki Y, et al. Steering assisting system for obstacle avoidance based on personalized potential field. In: 2012 15th International IEEE Conference on Intelligent Transportation Systems. IEEE; 2012. p. 1702–1707.
  • Wiering MA, Van Otterlo M. Reinforcement learning. Adapt Learn Optim. 2012;12(3):729.
  • Lewis FL, Vrabie D, Vamvoudakis KG. Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. IEEE Contr Syst Mag. 2012;32(6):76–105.
  • Kiumarsi B, Vamvoudakis KG, Modares H, et al. Optimal and autonomous control using reinforcement learning: a survey. IEEE Trans Neural Netw Learn Syst. 2018;29(6):2042–2062.
  • Zanon M, Gros S, Bemporad A. Practical reinforcement learning of stabilizing economic MPC. In: 2019 18th European Control Conference (ECC), Naples; 2019. p. 2258–2263.
  • Ernst D, Glavic M, Capitanescu F, et al. Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Trans Syst Man Cybern B. 2009;39(2):517–529.
  • Ng AY, Russell S. Algorithms for inverse reinforcement learning. In: International Conference on Machine Learning, Stanford, CA; Vol. 1; 2000. p. 2.
  • Arora S, Doshi P. A survey of inverse reinforcement learning: challenges, methods and progress. Artif Intell. 2021;297:Article ID 103500.
  • Ozkan MF, Ma Y. Modeling driver behavior in car-following interactions with automated and human-driven vehicles and energy efficiency evaluation. IEEE Access. 2021;9:64696–64707.
  • Ozkan MF, Rocque AJ, Ma Y. Inverse reinforcement learning based stochastic driver behavior learning. IFAC-PapersOnLine. 2021;54(20):882–888.
  • Ozkan MF, Ma Y. Personalized adaptive cruise control and impacts on mixed traffic. In: 2021 American Control Conference (ACC), New Orleans; 2021. p. 412–417.
  • Milanese M, Vicino A. Optimal estimation theory for dynamic systems with set membership uncertainty: an overview. Automatica. 1991;27(6):997–1009.
  • Savkin AV, Petersen IR. Set-valued state estimation via a limited capacity communication channel. IEEE Trans Automat Contr. 2003;48(4):676–680.
  • Shi D, Chen T, Shi L. On set-valued kalman filtering and its application to event-based state estimation. IEEE Trans Automat Control. 2015;60(5):1275–1290.