2,011
Views
0
CrossRef citations to date
0
Altmetric
Invited Article

A tutorial introduction to reinforcement learning

Pages 172-191 | Received 27 Dec 2022, Accepted 19 Mar 2023, Published online: 19 Apr 2023

Abstract

In this paper, we present a brief survey of reinforcement learning, with particular emphasis on stochastic approximation (SA) as a unifying theme. The scope of the paper includes Markov reward processes, Markov decision processes, SA algorithms, and widely used algorithms such as temporal difference learning and Q-learning.

This article is part of the following collections:
Virtual Issue on the SICE Annual Conference 2022

1. Introduction

In this paper, we present a brief survey of reinforcement learning (RL), with particular emphasis on stochastic approximation (SA) as a unifying theme. The scope of the paper includes Markov reward processes, Markov decision processes (MDPs), SA methods, and widely used algorithms such as temporal difference learning and Q-learning. RL is a vast subject, and this brief survey can barely do justice to the topic. There are several excellent texts on RL, such as Refs. [Citation1–4]. The dynamics of the SA algorithm are analysed in Refs. [Citation5–11]. The interested reader may consult those sources for more information.

In this survey, we use the phrase “ RL” to refer to decision-making with uncertain models, and in addition, current actions alter the future behaviour of the system. Therefore, if the same action is taken at a future time, the consequences might not be the same. This additional feature distinguishes RL from “mere” decision-making under uncertainty. Figure  rather arbitrarily divides decision-making problems into four quadrants. Examples from each quadrant are now briefly described.

  • Many if not most decision-making problems fall into the lower-left quadrant of “good model, no alteration” (meaning that the control actions do not alter the environment). An example is a fighter aircraft which usually has an excellent model, thanks to aerodynamical modelling and/or wind tunnel tests. In turn this permits the control system designers to formulate and to solve an optimal (or some other form of) control problem.

  • Controlling a chemical reactor would be an example from the lower-right quadrant. As a traditional control system, it can be assumed that the environment in which the reactor operates does not change as a consequence of the control strategy adopted. However, due to the complexity of a reactor, it is difficult to obtain a very accurate model, in contrast with a fighter aircraft for example. In such a case, one can adopt one of two approaches. The first, which is a traditional approach in control system theory, is to use a nominal model of the system and to treat the deviations from the nominal model as uncertainties in the model. A controller is designed based on the nominal model, and robust control theory would be invoked to ensure that the controller would still perform satisfactorily (though not necessarily optimally) for the actual system. The second, which would move the problem from the lower right to the upper-right quadrant, is to attempt to “learn” the unknown dynamical model by probing its response to various inputs. This approach is suggested in Ref. [Citation4, Example 3.1]. A similar statement can be made about robots, where the geometry determines the form of the dynamical equations describing it, but not the parameters in the equations; see, for example, Ref. [Citation12]. In this case too, it is possible to “learn” the dynamics through experimentation. In practice, such an approach is far slower than the traditional control systems approach of using a nominal model and designing a “robust” controller. However, “learning control” is a popular area in the world of machine learning. One reason is that the initial modelling error is too large, then robust control theory alone would not be sufficient to ensure the stability of the actual system with the designed controller. In contrast (and in principle), a “learning control” approach can withstand larger modelling errors. The widely used model predictive control paradigm can be viewed as an example of a learning-based approach.

  • A classic example of a problem belonging to the upper-left corner is a MDP, which forms the backbone of one approach to RL. In an MDP, there is a state space X and an action space U, both of which are usually assumed to be finite. In most MDPs, |X||U|. Board games without an element of randomness such as tic-tac-toe or chess would belong to the upper-left quadrant, at least in principle. Tic-tac-toe belongs here, because the rules of the game are clear, and the number of possible games is manageable. In principle, games such as chess which are “deterministic” (i.e. there is no throwing of dice as in Backgammon for example) would also belong here. Chess is a two-person game in which, for each board position, it is possible to assign the likelihood of the three possible outcomes: white wins, black wins, or it is a draw. However, due to the enormous number of possibilities, it is often not possible to determine these likelihoods precisely. It is pointed out explicitly in Ref. [Citation13] that merely because we cannot explicitly compute this likelihood function, that does not mean that the likelihood does not exist! However, as a practical matter, it is not a bad idea to treat this likelihood function as being unknown, and to infer it on the basis of experiment/experience. Thus, as with chemical reactors, it is not uncommon to move chess-playing from the lower-right corner to the upper-right corner.

  • The upper-right quadrant is the focus of RL. There are many possible ways to formulate RL problems, each of which leads to its own solution methodologies. A very popular approach to RL is to formulate it as MDPs whose dynamics are unknown. That is the approach adopted in this paper.

Figure 1. The four quadrants of decision-making under uncertainty.

Figure 1. The four quadrants of decision-making under uncertainty.

In an MDP, at each time t, the learner (also known as the actor or the agent) measures the state XtX. Based on this measurement, the learner chooses an action UtU and receives a reward R(Xt,Ut). Future rewards are discounted by a discount factor γ(0,1). The rule by which the current action Ut is chosen as a function of the current state Xt is known as a policy. With each policy, one can associate the expected value of the total (discounted) reward over time. The problem is to find the best policy. There is a variant called partially observable MDP in which the state Xt cannot be measured directly; rather, there is an output (or observation) YtY which is a memoryless function, either deterministic or random, of Xt. These problems are not studied here; it is always assumed that Xt can be measured directly. When the parameters of the MDP are known, there are several approaches to determining the optimal policy. RL is distinct from an MDP in that, in RL, the parameters of the underlying MDP are constant but not known to the learner; they must be learnt on the basis of experimentation. Figure  depicts the situation.

Figure 2. Depiction of a RL problem.

Figure 2. Depiction of a RL problem.

The remainder of the paper is organized as follows: in Section 2, Markov reward processes are introduced. These are a precursor to MDPs, which are introduced in Section 3. Specifically, in Section 3.1, the relevant problems in the study of MDPs are formulated. In Section 3.2, the solutions to these problems are given in terms of the Bellman value iteration, the action-value function, and the F-iteration to determine the optimal action-value function. In Section 3.3, we study the situation where the dynamics of the MDP under study are not known precisely. Instead, one has access only to a sample path {Xt} of the Markov process under study. For this situation, we present two standard algorithms, known as temporal difference learning and Q -learning. Starting from Section 4, the paper consists of results due to the author. In Section 4.1, the concept of SA is introduced, and its relevance to RL is outlined in Section 4.2. In Section 4.3, a new theorem on the global asymptotic stability of nonlinear ODEs is stated; this theorem is of independent interest. Some theorems on the convergence of the SA algorithm are presented in Sections 4.4 and 4.5. In Section 5, the results of Section 4 are applied to RL problems. In Section 5.1, a technical result on the sample paths of an irreducible Markov process is stated. Using this result, simplified conditions are given for the convergence of the temporal difference algorithm (Section 5.2) and Q-learning (Section 5.3). A brief set of concluding remarks ends the paper.

2. Markov reward processes

Markov reward processes are standard (stationary) Markov processes where each state has a “reward” associated with it. Markov reward processes are a precursor to MDPs; so we review those in this section. There are several standard texts on Markov processes, one of which is Ref. [Citation14].

Suppose X is a finite set of cardinality n, written as {x1,,xn}. If {Xt}t0 is a stationary Markov process assuming values in X, then the corresponding state transition matrix A is defined by (1) aij=Pr{Xt+1=xj|Xt=xi}.(1) Thus the ith row of A is the conditional probability vector of Xt+1 when Xt=xi. Clearly the row sums of the matrix A are all equal to one. This can be expressed as A1n=1n, where 1n denotes the n-dimensional column vector whose entries all equal to one. Therefore, if we define the induced matrix norm A as A:=maxv0Avv,then A equals one, which also equals the spectral radius of A.

Now suppose that there is a “reward” function R:XR associated with each state. There is no consensus within the community about whether the reward corresponding to the state Xt is paid at time t as in Ref. [Citation3] or time t + 1 as in Refs. [Citation2,Citation4]. In this paper, it is assumed that the reward is paid at time t and is denoted by Rt; the modifications required to handle the other approach are easy and left to the reader. The reward Rt can either be a deterministic function of Xt or a random function. If Rt is a deterministic function of Xt, then we have that Rt=R(Xt) where R is the reward function mapping X into (a finite subset of) R. Thus, whenever the trajectory {Xt} of the Markov process equals some state xiX, the resulting reward R(Xt) will always equal R(xi)=:ri. Thus the reward is captured by an n-dimensional vector r, where ri=R(xi). On the other hand, if Rt is a random function of Xt, then one would have to provide the probability distribution of Rt given Xt. Since Xt has only n different values, we would have to provide n different probability distributions. To avoid technical difficulties, it is common to assume that R(xi) is a bounded random variable for each index i. Note that because the set X is finite, if the reward function is deterministic, then we have that maxxiXR(xi)<.In case the reward function R is random, as mentioned above, it is common to assume that R(xi) is a bounded random variable for each index i[n], where the symbol [n] equals {1,,n}. With this assumption, it follows that maxxiXE[R(xi)]<.Two kinds of Markov reward processes are widely studied, namely: discounted reward processes and average reward processes. In this paper, we restrict attention to discounted reward processes. However, we briefly introduce average reward processes. Define (if it exists) V(xi):=limT01T+1E[t=0TR(Xt)|X0=xi].An excellent review of average reward processes can be found in Ref. [Citation15].

In each discounted Markov reward process, there is a “discount factor” γ(0,1). This factor captures the extent to which future rewards are less valuable than immediate rewards. Fix an initial state xiX. Then the expected discounted future reward V(xi) is defined as (2) V(xi):=E[t=0γtRt|X0=xi]=E[t=0γtR(Xt)|X0=xi].(2) We often just use “discounted reward” instead of the longer phrase. With these assumptions, because γ<1, the above summation converges and is well defined. The quantity V(xi) is referred to as the value function associated with xi, and the vector (3) v=[V(x1)V(xn)](3) is referred to as the value vector. Note that, throughout this paper, we view the value as both a function V:XR as well as a vector vRn. The relationship between the two is given by (Equation3). We shall use whichever interpretation is more convenient in a given context.

This raises the question as to how the value function and/or value vector is to be determined. Define the vector rRn via (4) r:=[r1rn],(4) where ri=R(xi) if R is a deterministic function, and if Rt is a random function of Xt, then (5) ri:=E[R(xi)].(5) The next result gives a useful characterization of the value vector.

Theorem 2.1

The vector v satisfies the recursive relationship (6) v=r+γAv,(6) or, in expanded form, (7) V(xi)=ri+γj=1naijV(xj).(7)

Proof.

Let xiX be arbitrary. Then by definition we have (8) V(xi)=E[t=0γtRt|X0=xi]=ri+E[t=1γtRt|X0=xi].(8) However, if X0=xi, then X1=xj with probability aij. Therefore, we can write (9) E[t=1γtRt|X0=xi]=j=1naijE[t=1γtRt|X1=xj]=γj=1naijE[t=0γtRt|X0=xj]=γj=1naijV(xj).(9) In the second step we use the fact that the Markov process is stationary. Substituting  (Equation9) into (Equation8) gives the recursive relationship (Equation7).

Example 2.2

As an illustration of a Markov reward process, we analyse a toy snakes and ladders game with the transitions shown in Figure . Here W and L denote “win” and “lose,” respectively. The rules of the game are as follows:

  • Initial state is S.

  • A four-sided, fair die is thrown at each stage.

  • Player must land exactly on W to win and exactly on L to lose.

  • If implementing a move causes crossing of W and L, then the move is not implemented.

Figure 3. A toy snakes and ladders game.

Figure 3. A toy snakes and ladders game.

There are 12 possible states in all: S, 1, …, 9, W, and L. However, 2, 3, and 9 can be omitted, leaving nine states, namely S, 1, 4, 5, 6, 7, 8, W, and L. At each step, there are at most four possible outcomes. For example, from the state S, the four outcomes are 1, 7, 5, and 4. From state 6, the four outcomes are 7, 8, 1, and W. From state 7, the four outcomes are 8, 1, W, and 7. From state 8, there four possible outcomes are 1, W, L, and 8 with probability 1/4 each, because if the die comes up with 4, then the move cannot be implemented. It is time-consuming but straight-forward to compute the state transition matrix as S145678WLS00.250.250.2500.250001000.250.5000.2500040000.250.250.250.2500500.25000.250.250.2500600.250000.250.250.250700.2500000.250.250.25800.2500000.250.250.25W000000010L000000001

We define a reward function for this problem as follows: we set Rt=f(Xt+1), where f is defined as follows: f(W)=5, f(L)=2, and f(x)=0 for all other states. However, there is an expected reward depending on the state at the next time instant. For example, if X0=6, then the expected value of R0 is 5/4, whereas if X0=7 or X0=8, then the expected value of R0 is 3/4.

Now let us see how the implicit equation (Equation6) can be solved to determine the value vector v. Since the induced matrix norm A=1 and γ<1, it follows that the matrix IγA is nonsingular. Therefore, for every reward function r, there is a unique v that satisfies (Equation6). In principle, it is possible to deduce from (Equation6) that (10) v=(IγA)1r.(10) The difficulty with this formula however is that in most actual applications of Markov decision problems, the integer n denoting the size of the state space X is quite large. Moreover, inverting a matrix has cubic complexity in the size of the matrix. Therefore, it may not be practicable to invert the matrix IγA. So we are forced to look for alternate approaches. A feasible approach is provided by the contraction mapping theorem.

Theorem 2.3

The map yTy:=r+γAy is monotone and is a contraction with respect to the -norm, with contraction constant γ. Therefore, we can choose some vector y0 arbitrarily, and then define yi+1=r+γAyi.Then yi converges to the value vector v.

Proof.

The first statement is that if y1y2 componentwise (and note that the vectors y1 and y2 need not consist of only positive components), then Ty1Ty2. This is obvious from the fact that the matrix A has only nonnegative components, so that Ay1Ay2. For the second statement, note that, because the matrix A is row-stochastic, the induced matrix norm A is equal to one. Therefore, Ty1Ty2=γA(y1y2)γy1y2.This completes the proof.

There is however a limitation to this approach, namely, that it requires that the state transition matrix A has to be known. In RL, this assumption is often not satisfied. Instead, one has access to a single sample path {Xt} of a Markov process over X, whose state transition matrix is A. The question therefore arises: How can one compute the value vector v in such a scenario? The answer is provided by the so-called temporal difference algorithm, which is discussed in Section 3.3.

3. MDPs

3.1. Problem formulation

In a Markov reward process, the state Xt evolves on its own, according to a predetermined state transition matrix. In contrast, in a MDP, there is also another variable called the “action” which affects the dynamics. Specifically, in addition to the state space X, there is also a finite set of actions U. Each action ukU leads to a corresponding state transition matrix Auk=[aijuk]. So at time t, if the state is Xt and an action UtU is applied, then (11) Pr{Xt+1=xj|Xt=xi,Ut=uk}=aijuk.(11) Obviously, for each fixed ukU, the corresponding state transition matrix Auk is row-stochastic. In addition, there is also a “reward” function R:X×UR. Note that in a Markov reward process, the reward depends only on the current state, whereas in a MDP, the reward depends on both the current state as well as the action taken. As in Markov reward processes, it is possible to permit R to be a random function of Xt and Ut as opposed to a deterministic function. Moreover, to be consistent with the earlier convention, it is assumed that the reward R(Xt,Ut) is paid at time t.

The most important aspect of an MDP is the concept of a “policy,” which is just a systematic way of choosing Ut given Xt. If π:XU is any map, this would be called a deterministic policy, and the set of all deterministic policies is denoted by Πd. Alternatively, let S(U) denote the set of probability distributions on the finite set U. Then a map π:XS(U) would be called a probabilistic policy, and the set of probabilistic policies is denoted by Πp. Note that the cardinality of Πd equals |U|||X|, while the set Πp is uncountable.

A vital point about MDPs is this: whenever any policy π, whether deterministic or probabilistic, is implemented, the resulting process {Xt} is a Markov process with an associated state transition matrix, which is denoted by Aπ. This matrix can be determined as follows: if πΠd, then at time t, if Xt=xi, then the corresponding action Ut equals π(xi). Therefore, (12) Pr{Xt+1=xj|Xt=xi,π}=aijπ(xi).(12) If πΠp and (13) π(xi)=[ϕi1ϕim],(13) where m=|U|, then (14) Pr{Xt+1=xj|Xt=xi,π}=k=1mϕikaijuk.(14) In a similar manner, for every policy π, the reward function R:X×UR can be converted into a reward map Rπ:XR as follows: If πΠd, then (15) Rπ(xi)=R(xi,π(xi)),(15) whereas if πΠp, then (16) Rπ(xi)=k=1mϕikR(xi,uk).(16) Thus, given any policy π, whether deterministic or probabilistic, we can associate with it a reward vector rπ. To summarize, given any MDP, once a policy π is chosen, the resulting process {Xt} is a Markov reward process with state transition matrix Aπ and reward vector rπ.

Example 3.1

To illustrate these ideas, suppose n = 4, m = 2, so that there four states and two actions. Thus, there are two 4×4 state transition matrices A1,A2 corresponding to the two actions. (In the interests of clarity, we write A1 and A2 instead of Au1 and Au2.) Suppose π1 is a deterministic policy, represented as a n×m matrix (in this case a 4×2 matrix), as follows: M1=[01011001].This means that if Xt=x1,x2, or x4, then Ut=u2, while if Xt=x3, then Ut=u1. Let us use the notation (Ak)i to denote the ith row of the matrix Ak, where k = 1, 2 and i=1,2,3,4. Then the state transition matrix Aπ1 is given by Aπ1=[(A2)1(A2)2(A1)3(A2)4].Thus, the first, third, and fourth rows of Aπ1 come from A2, while the second row comes from A1.

Next, suppose π2 is a probabilistic policy, represented by the matrix M2=[0.30.70.20.80.90.10.40.6].Thus, if Xt=x1, then the action Ut=u1 with probability 0.3 and equals u2 with probability 0.7, and so on. For this policy, the resulting state transition matrix is determined as follows: Aπ2=[0.3(A1)1+0.7(A2)10.2(A1)2+0.8(A2)20.9(A1)3+0.1(A2)30.4(A1)4+0.6(A2)4].

For a MDP, one can pose three questions:

  1. Policy evaluation: We have seen already that given a Markov reward process, with a reward vector r and a discount factor γ, there corresponds a unique value vector v. We have also seen that, for any choice of a policy π, whether deterministic or probabilistic, there corresponds a state transition matrix Aπ and a reward vector rπ. Therefore, once a policy π is chosen, the MDP becomes a Markov reward process with state transition matrix Aπ and reward vector rπ. We can define vπ(xi) to be the value vector associated with this Markov reward process. The question is: How can vπ(xi) be computed?

  2. Optimal value determination: For each policy π, there is an associated value vector vπ. Let us view vπ as a map from X to R, so that Vπ(xi) is the ith component of vπ. Now suppose xiX is a specified initial state, and define (17) V(xi):=maxπΠpVπ(xi)(17) to be the optimal value over all policies, when the MDP is started in the initial state X0=xi. How can V(xi) be computed? Note that in (Equation17), the optimum is taken over all probabilistic policies. However, it can be shown that the optimum is the same even if π is restricted to only deterministic policies.

  3. Optimal policy determination: In (Equation17) above, we associate an optimal policy with each state xi. Now we can extend the idea and define the optimal policy map XΠd via (18) π(xi):=arg maxπΠdVπ(xi).(18) How can the optimal policy map π be determined? Note that it is not a priori evident that there exists one policy that is optimal for all initial states. But the existence of such an optimal policy can be shown. Also, we can restrict to πΠd in (Equation18) because it can be shown that the maximum over πΠp is not any larger. In other words, maxπΠdVπ(xi)=maxπΠpVπ(xi).

3.2. MDPs: solution

In this subsection, we present answers to the three questions above.

3.2.1. Policy evaluation

Suppose a policy π in Πd or Πp is specified. Then the corresponding state transition matrix Aπ and reward vector rπ are given by (Equation12) (or (Equation14)) and (Equation15), respectively. As pointed out above, once the policy is chosen, the process becomes just a Markov reward process. Then it readily follows from Theorem 2.1 that vπ satisfies an equation analogous to (Equation6), namely (19) vπ=rπ+γAπvπ.(19) As before, it is inadvisable to compute vπ via vπ=(IγAπ)1rπ. Instead, one should use value iteration to solve (Equation19). Observe that whatever the policy π might be, the resulting state transition matrix Aπ satisfies A=1. Therefore, the map yrπ+γAπy is a contraction with respect to , with contraction constant γ.

3.2.2. Optimal value determination

Now we introduce one of the key ideas in MDPs. Define the Bellman iteration map B:RnRn via (20) (Bv)i:=maxukU[R(xi,uk)+γj=1naijukvj].(20)

Theorem 3.2

The map B is monotone and a contraction with respect to the -norm. Therefore, the fixed point v¯ of the map B satisfies the relation (21) (v¯)i:=maxukU[R(xi,uk)+γj=1naijuk(v¯)j].(21)

Note that (Equation21) is known as the Bellman optimality equation. Thus, in principle at least, we can choose an arbitrary initial guess v0Rd, and repeatedly apply the Bellman iteration. The resulting iterations would converge to the unique fixed point of the operator B, which we denote by v¯.

The significance of the Bellman iteration is given by the next theorem.

Theorem 3.3

Define v¯Rn to be the unique fixed point of B, and define vRn to equal [V(xi),xiX], where V(xi) is defined in (Equation17). Then v¯=v.

Therefore, the optimal value vector can be computed using the Bellman iteration. However, knowing the optimal value vector does not, by itself, give us an optimal policy.

3.2.3. Optimal policy determination

To solve the problem of optimal policy determination, we introduce another function Qπ:X×UR, known as the action-value function, which is defined as follows: (22) Qπ(xi,uk):=R(xi,uk)+Eπ[t=1γtRπ(Xt)|X0=xi,U0=uk].(22) This function was first defined in Ref. [Citation16]. Note that Qπ is defined only for deterministic policies. In principle, it is possible to define it for probabilistic policies, but this is not commonly done. In the above definition, the expectation Eπ is with respect to the evolution of the state Xt under the policy π.

The way in which a MDP is setup is that at time t, the Markov process reaches a state Xt, based on the previous state Xt1 and the state transition matrix Aπ corresponding to the policy π. Once Xt is known, the policy π determines the action Ut=π(Xt), and then the reward Rπ(Xt)=R(Xt,π(Xt)) is generated. In particular, when defining the value function Vπ(xi) corresponding to a policy π, we start off the MDP in the initial state X0=xi, and choose the action U0=π(xi). However, in defining the action-value function Qπ, we do not feel compelled to set U0=π(X0)=π(xi), and can choose an arbitrary action ukU. From t = 1 onwards however, the action Ut is chosen as Ut=π(Xt). This seemingly small change leads to some simplifications.

Just as we can interpret Vπ:XR as an n-dimensional vector, we can interpret Qπ:X×UR as an nm-dimensional vector, or as a matrix of dimension n×m. Consequently, the Qπ-vector has higher dimension than the value vector.

Theorem 3.4

For each policy πΠd, the function Qπ satisfies the recursive relationship

(23) Qπ(xi,uk)=R(xi,uk)+γj=1naijukQπ(xj,π(xj)).(23)

Proof.

Observe that at time t = 0, the state transition matrix is Auk. So, given that X0=xi and U0=uk, the next state X1 has the distribution X1[aijuk,j=1,,n].Moreover, U1=π(X1) because the policy π is implemented from time t = 1 onwards. Therefore, Qπ(xi,uk)=R(xi,uk)+Eπ[j=1naijuk(γR(xj,π(xj))+t=2γtRπ(Xt)|X1=xj,U1=π(xj))]=R(xi,uk)+Eπ[γj=1naijuk(R(xj,π(xj))+t=1γtRπ(Xt)|X1=xj,U1=π(xj))]=R(xi,uk)+γj=1naijukQ(xj,π(xj)).This is the desired conclusion.

Theorem 3.5

The functions Vπ and Qπ are related via (24) Vπ(xi)=Qπ(xi,π(xi)).(24)

Proof.

If we choose uk=π(xi), then (Equation23) becomes Qπ(xi,π(xi))=Rπ(xi)+γj=1naijπ(xj)Q(xj,π(xj)).This is the same as (Equation11) written out componentwise. We know that (Equation11) has a unique solution, namely Vπ. This shows that (Equation24) holds.

The import of Theorem 3.5 is the following: in defining the function Qπ(xi,uk) for a fixed policy πΠd, we have the freedom to choose the initial action uk as any element we wish in the action space U. However, if we choose the initial action uk=π(xi) for each state xiX, then the corresponding action-value function Qπ(xi,uk) equals the value function Vπ(xi), for each state xiX.

In view of (Equation24), the recursive equation for Qπ can be rewritten as (25) Qπ(xi,uk)=R(xi,uk)+γj=1naijukVπ(xj).(25) This motivates the next theorem.

Theorem 3.6

Define Q:X×UR by (26) Q(xi,uk)=R(xi,uk)+γj=1naijukV(xj).(26) Then Q(,) satisfies the following relationships: (27) Q(xi,uk)=R(xi,uk)+γj=1naijukmaxwlUQ(xj,wl),(27) (28) V(xi)=maxukUQ(xi,uk).(28) Moreover, every policy πΠd such that (29) π(xi)=arg maxukUQ(xi,uk)(29) is optimal.

Proof.

Since Q(,) is defined by (Equation26), it follows that maxukUQ(xi,uk)=maxukU[R(xi,uk)+γj=1naijukV(xj)]=V(xi).This establishes (Equation28) and (Equation29). Substituting  (Equation28) into (Equation26) gives (Equation27).

Theorem 3.6 converts the problem of determining an optimal policy into one of solving the implicit equation (Equation27). For this purpose, we define an iteration on action-functions that is analogous to (Equation20) for value functions. As with the value function, the action-value function can either be viewed as a map Q:X×UR, or as a vector in Rnm, or as an n×m matrix. We use whichever interpretation is convenient in the given situation.

Theorem 3.7

Define F:R|X|×|U|R|X|×|U| by (30) [F(Q)](xi,uk):=R(xi,uk)+γj=1naijukmaxwlUQ(xj,wl).(30) Then the map F is monotone and is a contraction. Moreover, for all Q0:X×UR, the sequence of iterations {Ft(Q0)} converges to Q as t.

If we were to rewrite (Equation21) and (Equation27) in terms of expected values, the differences between the Q-function and the V-function would become apparent. We can rewrite (Equation21) as (31) V(Xt)=maxUtU{R(Xt,Ut)+γE[V(Xt+1)|Xt]}(31) and (Equation27) as (32) Q(Xt,Ut)=R(Xt,Ut)+γE[maxUt+1UQ(Xt+1,Ut+1)].(32) Thus in the Bellman formulation and iteration, the maximization occurs outside the expectation, whereas with the Q-formulation and F-iteration, the maximization occurs inside the expectation.

3.3. Iterative algorithms for MDPs with unknown dynamics

In principle, Theorem 2.3 can be used to compute, to arbitrary precision, the value vector of a Markov reward process. Similarly, Theorem 3.7 can be used to compute, to arbitrary precision, the optimal action-value function of a MDP, from which both the optimal value function and the optimal policy can be determined. However, both theorems depend crucially on knowing the dynamics of the underlying process. For instance, if the state transition matrix A is not known, it would not be possible to carry out the iterations yi+1=r+γAyi.Early researchers in RL were aware of this issue and developed several algorithms that do not require explicit knowledge of the dynamics of the underlying process. Instead, it is assumed that a sample path {Xt}t=0 of the Markov process, together with the associated reward process, are available for use. With this information, one can think of two distinct approaches. First, one can use the sample path to estimate the state transition matrix, call it Aˆ. After a sufficiently long sample path has been observed, the contraction iteration above can be applied with A replaced by Aˆ. This would correspond to so-called “indirect adaptive control.” The second approach would be to use the sample path right from time t = 0, and adjust only one component of the estimated value function at each time instant t. This would correspond to so-called “direct adaptive control.” Using a similar approach, it is also possible to estimate the action-value function based on a single sample path. We describe two such algorithms, namely temporal difference learning for estimating the value function of a Markov reward process and Q-learning for estimating the action-value function of a MDP. Within temporal difference, we make a further distinction between estimating the full value vector and estimating a projection of the value vector onto a lower-dimensional subspace.

3.3.1. Temporal difference learning without function approximation

In this subsection and the next, we describe the so-called “temporal difference” family of algorithms, first introduced in Ref. [Citation17]. The objective of the algorithm is to compute the value vector of a Markov reward process. Recall that the value vector v of a Markov reward process satisfies (Equation6). In temporal difference approach, it is not assumed that the state transition matrix A is known. Rather, it is assumed that the learner has available a sample path {(Xt)} of the Markov process under study, together with the associated reward at each time. For simplicity, it is assumed that the reward is deterministic and not random. Thus the reward at time t is just R(Xt) and does not add any information.

There are two variants of the algorithm. In the first, one constructs a sequence of approximations vˆt that, one hopes, would converge to the true value vector v as t. In the second, which is used when n is very large, one chooses a “basis representation matrix” ΨRn×d, where dn. Then one constructs a sequence of vectors θtRd, such that the corresponding sequence of vectors ΨθtRn forms an approximation to the value vector v. Since there is no a priori reason to believe that v belongs to the range of Ψ, there is also no reason to believe that Ψθt would converge to v. The second approach is called temporal difference learning with function approximation. The first is studied in this subsection, while the second is studied in the next subsection.

In principle, by observing the sample path for a sufficiently long duration, it is possible to make a reliable estimate of A. However, a key feature of the temporal difference algorithm is that it is a “direct” method, which works directly with the sample path, without attempting to infer the underlying Markov process. With the sample path {Xt} of the Markov process, one can associate a corresponding “index process” {Nt} taking values in [n] as follows: Nt=iif Xt=xiX.It is obvious that the index process has the same transition matrix A as the process {Xt}. The idea is to start with an initial estimate vˆ0, and update it at each time t based on the sample path {(Xt,R(Xt))}.

Now we introduce the (Temporal Difference) TD(λ) algorithm studied in this paper. This version of the TD(λ) algorithm comes from Ref. [Citation18, Equation (4.7)] and is as follows: let v denote the unique solution of the equationFootnote1 v=r+γAv.At time t, let vˆtRn denote the current estimate of v. Thus the ith component of vˆt, denoted by Vˆt,i, is the estimate of the reward when the initial state is xi. Let {Nt} be the index process defined above. Define the “temporal difference” (33) δt+1:=RNt+γVˆt,Nt+1Vˆt,Nt t0,(33) where Vˆt,Nt denotes the Ntth component of the vector vˆt. Equivalently, if the state at time t is xiX and the state at the next time t + 1 is xj, then (34) δt+1=Ri+γVˆt,jVˆt,i.(34) Next, choose a number λ[0,1). Define the “eligibility vector” (35) zt=τ=0t(γλ)τI{Ntτ=Nt}eNtτ,(35) where eNs is a unit vector with a 1 in location Ns and zeros elsewhere. Since the indicator function in the above summation picks up only those occurrences where Ntτ=Nt, the vector zt can also be expressed as (36) zt=zteNt,zt=τ=0t(γλ)τI{Ntτ=Nt}.(36) Thus the support of the vector zt consists of the singleton {Nt}. Finally, update the estimate vˆt as (37) vˆt+1=vˆt+δt+1αtzt,(37) where {αt} is a sequence of step sizes. Note that, at time t, only the Ntth component of vˆt is updated, and the rest remain the same.

A sufficient condition for the convergence of the TD(λ)-algorithm is given in Ref. [Citation18].

Theorem 3.8

The sequence {vˆt} converges almost surely to v as t, provided (38) t=0αt2I{Nt=i}<, a.s. i[n],(38) (39) t=0αtI{Nt=i}=, a.s. i[n].(39)

3.3.2. TD-learning with function approximation

In this set-up, we again observe a time series {(Xt,R(Xt))}. The new feature is that there is a “basis” matrix ΨRn×d, where dn. The estimated value vector at time t is given by vˆt=Ψθt, where θtRd is the parameter to be updated. In this representation, it is clear that for any index i[n], we have that Vˆt,i=Ψiθt=(Ψi),θt,where Ψi denotes the ith row of the matrix Ψ.

Now we define the learning rule for updating θt. Let {Xt} be the observed sample path. By a slight abuse of notation, define yt=[ΨXt]Rd.Thus, if Xt=xi, then yt=[Ψi]. The eligibility vector ztRd is defined via (40) zt=τ=0t(γλ)tτyτ.(40) Note that zt satisfies the recursion zt=γλzt1+yt.Hence it is not necessary to keep track of an ever-growing set of past values of yτ. In contrast to (Equation35), there is no term of the type I{Ntτ=Nt} in (Equation40). Thus, unlike the eligibility vector defined in (Equation35), the current vector zt can have more than one nonzero component. Next, define the temporal difference δt+1 as in (Equation33). Note that if Xt=xi and Xt+1=xj, then δt+1=ri+γ[Ψj]θt[Ψi]θt.Then the updating rule is (41) θt+1=θt+αtδt+1zt,(41) where αt is the step size.

The convergence analysis of (Equation41) is carried out in detail in Ref. [Citation19] based on the assumption that the state transition matrix A is irreducible. This is quite reasonable, as it ensures that every state xi occurs infinitely often in any sample path, with probability one. Since that convergence analysis does not readily fit into the methods studied in subsequent sections, we state the main results without proof. However, we state and prove various intermediate results that are useful in their own right.

Suppose A is row-stochastic and irreducible, and let μ denote its stationary distribution. Define M=Diag(μi) and define a norm M on Rd by vM=(vMv)1/2.Then the corresponding distance between two vectors v1 and v2 is given by v1v2M=((v1v2)M(v1v2))1/2.Then the following result is proved in Ref. [Citation19].

Lemma 3.9

Suppose A[0,1]n×n is row-stochastic and irreducible. Let μ be the stationary distribution of A. Then AvMvM vRn.Consequently, the map vr+γAv is a contraction with respect to M.

Proof.

We will show that AvM2vM2 vRn,which is clearly equivalent to the AvMvM. Now AvM2=i=1nμi(Av)i2=i=1nμi(j=1nAijvj)2.However, for each fixed index i, the row Ai is a probability distribution, and the function f(Y)=Y2 is convex. If we apply Jensen's inequality with f(Y)=Y2, we see that (j=1nAijvj)2j=1nAijvj2 i.Therefore, AvM2i=1nμi(j=1nAijvj2)=j=1n(i=1nμiAij)vj2=j=1nμjvj2=vM2,where in the last step, we use the fact that μA=μ.

To analyse the behaviour of the TD(λ) algorithm with function approximation, the following map Tλ:RnRn is defined in Ref. [Citation19]: [Tλv]i:=(1λ)l=0λlE×[τ=0lγτR(Xτ+1)+γl+1VXl+1|X0 xi].Note that Tλv can be written explicitly as Tλv=(1λ)l=0λl[τ=0lγτAτr+γl+1Al+1v].

Lemma 3.10

The map Tλ is a contraction with respect to M, with contraction constant [γ(1λ)]/(1γλ).

Proof.

Note that the first term on the right side does not depend on v. Therefore, Tλ(v1v2)=γ(1λ)l=0(γλ)lAl+1(v1v2).However, it is already known that A(v1v2)Mv1v2M.By repeatedly applying the above, it follows that Al(v1v2)Mv1v2M l.Therefore, Tλ(v1v2)Mγ(1λ)l=0(γλ)lv1v2M=γ(1λ)1γλv1v2M.This is the desired bound.

Define a projection Π:RnRn by Πa:=Ψ(ΨMΨ)1ΨMa.Then Πa=arg minbΨ(Rd)abM.Thus Π projects the space Rn onto the image of the matrix Ψ, which is a d-dimensional subspace, if Ψ has full column rank. In other words, Πa is the closest point to a in the subspace Ψ(Rn).

Next, observe that the projection Π is nonexpansive with respect to M. As a result, the composite map ΠTλ is a contraction. Thus there exists a unique v¯Rd such that ΠTλv¯=v¯.Moreover, the above equation shows that in fact v¯ belongs to the range of Ψ. Thus there exists a θRd such that v¯=Ψθ, and θ is unique if Ψ has full column rank.

The limit behaviour of the TD(λ) algorithm is given by the next theorem, which is a key result from Ref. [Citation19].

Theorem 3.11

Suppose that Ψ has full column rank and that t=0αt=andt=0αt2<.Then, the sequence {θt} converges almost surely to θRd, where θ is the unique solution of ΠTλ(Ψθ)=Ψθ.Moreover, ΨθvM1γλ1γΠvvM.

Note that since ΨθΠ(Rd) for all θRd, the best that one can hope for is that ΨθvM=ΠvvM.The theorem states that the above identity might not hold and provides an upper bound for the distance between the limit Ψθ and the true value vector v. It is bounded by a factor (1γλ)/(1γ) times this minimum.

Note that (1γλ)/(1γ)>1. So this is the extent to which the TD(λ) iterations miss the optimal approximation.

3.3.3. Q-learning

The Q-learning algorithm proposed in Ref. [Citation16] has the characterization (Equation27) of Q as its starting point. The algorithm is based on the following premise: at time t, the current state Xt can be observed; call it xiX. Then the learner is free to choose the action Ut; call it ukU. With this choice, the next state Xt+1 has the probability distribution equal to the ith row of the state transition matrix Auk. Suppose the observed next stat Xt+1 is xjX. With these conventions, the Q-learning algorithm proceeds as follows.

  1. Choose an arbitrary initial guess Q0:X×UR and an initial state X0X.

  2. At time t, with current state Xt=xi, choose a current action Ut=ukU, and let the Markov process run for one time step. Observe the resulting next state Xt+1=xj. Then update the function Qt as follows: (42) Qt+1(xi,uk)=Qt(xi,uk)+αt[R(xi,uk)+γVt(xj)Qt(xi,uk)],Qt+1(xs,wl)=Qt(xs,wl) (xs,wl)(xi,uk),(42) where (43) Vt(xj)=maxwlUQt(xj,wl),(43) and {αt} is a deterministic sequence of step sizes.

  3. Repeat.

It is evident that in the Q-learning algorithm, at any instant of time t, only one element (namely, Q(Xt,Ut)) gets updated. In the original paper by Watkins and Dayan [Citation16], the convergence of the algorithm used some rather ad hoc methods. Subsequently, a general class of algorithms known as “asynchronous stochastic approximation (ASA),” which included Q-learning as a special case, was introduced in Refs. [Citation18,Citation20]. A sufficient condition for the convergence of the Q-learning algorithm, which was originally presented in Ref. [Citation16], is rederived using these methods.

Theorem 3.12

The Q-learning algorithm converges to the optimal action-value function Q provided the following conditions are satisfied. (44) t=0αtI(Xt,Ut)=(xi,uk)= (xi,uk)X×U,(44) (45) t=0αt2I(Xt,Ut)=(xi,uk)< (xi,uk)X×U.(45)

The main shortcoming of Theorems 3.8 and 3.12 is that the sufficient conditions (Equation38), (Equation39), (Equation44), and (Equation45) are probabilistic in nature. Thus it is not clear how they are to be verified in a specific application. Note that in the Q-learning algorithm, there is no guidance on how to choose the next action Ut. Presumably Ut is chosen so as to ensure that (Equation44) and (Equation45) are satisfied. In Section 5, we show how these theorems can be proven, and also, how the troublesome probabilistic sufficient conditions can be replaced by purely algebraic conditions.

4. SA algorithms

4.1. SA and relevance to RL

The contents of the previous section make it clear that in MDP theory, a central role is played by the need to solve fixed point problems. Determining the value of a Markov reward problem requires the solution of (Equation6). Determining the optimal value of an MDP requires finding the fixed point of the Bellman iteration. Finally, determining the optimal policy for an MDP requires finding the fixed point of the F-iteration. As pointed out in Section 3.3, when the dynamics of an MDP are completely known, these fixed point problems can be solved by repeatedly applying the corresponding contraction mapping. However, when the dynamics of the MDP are not known, and one has access only to a sample path of the MDP, a different approach is required. In Section 3.3, we have presented two such methods, namely the temporal difference algorithm for value determination and the Q-learning algorithm for determining the optimal action-value function. Theorems 3.8 and 3.12, respectively, give sufficient conditions for the convergence of these algorithms. The proofs of these theorems, as given in the original papers, tend to be “one-off,” that is, tailored to the specific algorithm. It is now shown that a probabilistic method known as “SA” can be used to unify these methods in a common format. Moreover, instead of the convergence proofs being “one-off,” the SA algorithm provides a unifying approach.

The applications of SA go beyond these two specific algorithms. There is another area called “ deep RL” for problems in which the size of the state space is very large. Recall that the action-value function Q:X×U can either be viewed as an nm-dimensional vector or an n×m matrix. In deep RL, one determines (either exactly or approximately) the action-value function Q(xi,uk) for a small number of pairs (xi,uk)X×U. Using these as a starting point, the overall function Q defined for all pairs (xi,uk)X×U is obtained by training a deep neural network. Training a neural network (in this or any other application) requires the minimization of the average mean-squared error, denoted by J(θ) where θ denotes the vector of adjustable parameters. In general, the function J() is not convex; hence one can at best aspire to find a stationary point of J(), i.e. a solution to the equation J(θ)=0. This problem is also amenable to the application of the SA approach.

Now we give a brief introduction to SA. Suppose f:RdRd is some function and d can be any integer. The objective of SA is to find a solution to the equation f(θ)=0, when only noisy measurements of f() are available. The SA method was introduced in Ref. [Citation21], where the objective was to find a solution to a scalar equation f(θ)=0, where f:RR. The extension to the case where d>1 was first proposed in Ref. [Citation22]. The problem of finding a fixed point of a map g:RdRd can be formulated as the above problem with f(θ):=g(θ)θ. If it is desired to find a stationary point of a C1 function J:RdR, then we simply set f(θ)=J(θ). Thus the above problem formulation is quite versatile. More details are given at the start of Section 4.2.

SA is a family of iterative algorithms, in which one begins with an initial guess θ0 and derives the next guess θt+1 from θt. Several variants of SA are possible. In synchronous SA, every component of θt is changed to obtain θt+1. This was the original concept of SA. If, at any time t, only one component of θt is changed to obtain θt+1, and the others remain unchanged, this is known as ASA . This phrase was apparently first introduced in Ref. [Citation20]. A variant of the approach in Ref. [Citation20] is presented in Ref. [Citation23]. Specifically, in Ref. [Citation23], a distinction is introduced between using a “local clock” versus using a “global clock.” It is also possible to study an intermediate situation where, at each time t, some but not necessarily all components of θt are updated. There does not appear to be a common name for this situation. The phrase batch asynchronous stochastic approximation (BASA) is introduced in Ref. [Citation24]. More details about these variations are given below. There is a fourth variant known as two-time scale SA is introduced in Ref. [Citation25]. In this set-up, one attempts to solve two coupled equations of the form f(θ,ϕ)=0,g(θ,ϕ)=0,where θRn,ϕRm, and f:Rn×RmRn,g:Rn×RmRm. The idea is that one of the iterations (say θt+1) is updated “more slowly” than the other (say ϕt+1). Due to space limitations, two time-scale SA is not discussed further in this paper. The interested reader is referred to Refs. [Citation25–27] for the theory and to Refs. [Citation28,Citation29] for applications to a specific type of RL, known as actor-critic algorithms.

The relevance of SA to RL arises from the following factors:

  • Many (though not all) algorithms used in RL can be formulated as some type of SA algorithms.

  • Examples include temporal difference learning, temporal difference learning with function approximation, Q -learning, deep neural network learning, and actor-critic learning. The first three are discussed in detail in Section 5.

Thus, SA provides a unifying framework for several disparate-looking RL algorithms.

4.2. Problem formulation

There are several equivalent formulations of the basic SA problem.

  1. Finding a zero of a function: Suppose f:RdRd is some function. Note that f() need not be available in closed form. The only thing needed is that given any θRd, an “oracle” returns a noise-corrupted version of f(θ). The objective is to determine a solution of the equation f(θ)=0.

  2. Finding a fixed point of a mapping: Suppose g:RdRd. The objective is to find a fixed point of g(), that is, a solution to g(θ)=θ. If we define f(θ)=g(θ)θ, this is the same problem as the above. One might ask: Why not define f(θ)=θg(θ)? As we shall see below, the convergence of the SA algorithm (in various forms) is closely related to the global asymptotic stability of the ODE θ˙=f(θ). Also, as seen in the previous section, in many applications, the map g() of which we wish to find a fixed point is a contraction. In such a case, there is a unique fixed point θ of g(). In such a case, under relatively mild conditions, θ is a globally asymptotically stable equilibrium of the ODE θ˙=g(θ)θ, but not if the sign is reversed.

  3. Finding a stationary point of a function: Suppose J:RdR is a C1 function. The objective is to find a stationary point of J(), that is, a θ such that J(θ)=0. If we define f(θ)=J(θ), then this is the same problem as above. Here again, if we wish the SA algorithm to converge to a global minimum of J(), then the minus sign is essential. On the other hand, if we wish the SA algorithm to converge to a global maximum of J(), then we remove the minus sign.

Suppose the problem is one of finding a zero of a given function f(). The synchronous version of SA proceeds as follows: An initial guess θ0Rd is chosen (usually in a deterministic manner, but it can also be randomly chosen). At time t, the available measurement is (46) yt+1=f(θt)+ξt+1,(46) where ξt+1 is the measurement noise. Based on this, the current guess is updated to (47) θt+1=θt+αtyt+1=θt+αt[f(θt)+ξt+1],(47) where {αt} is a predefined sequence of “step sizes,” with αt(0,1) for all t. If the problem is that of finding a fixed point of g(), the updating rule is (48) θt+1=θt+αt[g(θt)θt+ξt+1]=(1αt)θt+αt[g(θt)+ξt+1].(48) If the problem is to find a stationary point of J(), the updating rule is (49) θt+1=θt+αtyt+1=θt+αt[J(θt)+ξt+1].(49) These updating rules represent what might be called Synchronous SA, because at each time t, every component of θt is updated. Other variants of SA are studied in subsequent sections.

4.3. A new theorem for global asymptotic stability

In this section, we state a new theorem on the global asymptotic stability of nonlinear ODEs. This theorem is new and is of interest aside from its applications to the convergence of SA algorithms. The contents of this section and the next section are taken from Ref. [Citation30]. To state the result (Theorem 4.3), we introduce a few preliminary concepts from Lyapunov stability theory. The required background can be found in Refs. [Citation31–33].

Definition 4.1

A function ϕ:R+R+ is said to belong to class K, denoted by ϕK, if ϕ(0)=0, and ϕ() is strictly increasing. A function ϕK is said to belong to class KR, denoted by ϕKR, if in addition, ϕ(r) as r. A function ϕ:R+R+ is said to belong to class B, denoted by ϕB, if ϕ(0)=0, and in addition, for all 0<ϵ<M<, we have that (50) infϵrMϕ(r)>0.(50)

The concepts of functions of class K and class KR are standard. The concept of a function of class B is new. Note that if ϕ() is continuous, then it belongs to class B if and only if ϕ(0)=0 and ϕ(r)>0 for all r>0.

Example 4.2

Observe that every ϕ of class K also belongs to class B. However, the converse is not true. Define ϕ(r)={rif r[0,1],e(r1)if r>1.Then ϕ belongs to class B. However, since ϕ(r)0 as r, ϕ cannot be bounded below by any function of class K.

Suppose we wish to find a solution of f(θ)=0. The convergence analysis of synchronous SA depends on the stability of an associated ODE θ˙=f(θ). We now state a new theorem on global asymptotic stability, and then use this to establish the convergence of the synchronous SA algorithm. In order to state this theorem, we first introduce some standing assumptions on f(). Note that these assumptions are standard in the literature.

(F1)

The equation f(θ)=0 has a unique solution θ.

(F2)

The function f is globally Lipschitz-continuous with constant L. (51) f(θ)f(ϕ)2Lθϕ2 θ,ϕRd.(51)

Theorem 4.3

Suppose Assumption (F1) holds, and that there exists a function V:RdR+ and functions η,ψKR,ϕB such that (52) η(θθ2)V(θ)ψ(θθ2) θRd,(52) (53) V˙(θ)ϕ(θθ2) θRd.(53) Then θ is a globally asymptotically stable equilibrium of the ODE θ˙=f(θ).

This is Ref. [Citation30, Theorem 4], and the proof can be found therein. Well-known classical theorems for global asymptotic stability, such as those found in Refs. [Citation31–33], require the function ϕ() to belong to class K. Theorem 4.3 is an improvement, in that the function ϕ() is required only to belong to the larger class B.

4.4. A convergence theorem for synchronous SA

In this subsection, we present a convergence theorem for synchronous SA. Theorem 4.4 is slightly more general than a corresponding result in Ref. [Citation30]. This theorem is obtained by combining some results from Refs. [Citation24,Citation30]. Other convergence theorems and examples can be found in Ref. [Citation30].

In order to analyse the convergence of the SA algorithm, we need to make some assumptions about the nature of the measurement error sequence {ξt}. These assumptions are couched in terms of the conditional expectation of a random variable with respect to a σ-algebra. Readers who are unfamiliar with the concept are referred to Ref. [Citation34] for the relevant background.

Let θ0t denote the tuple θ0,θ1,,θt, and define ξ1t analogously; note that there is no ξ0. Let {Ft}t0 be any filtration (i.e. increasing sequence of σ-algebras), such that θ0t,ξ1t are measurable with respect to Ft. For example, one can choose Ft to be the σ-algebra generated by the tuples θ0t,ξ1t.

(N1)

There exists a sequence {bt} of nonnegative numbers such that (54) E(ξt+1|Ft)2bt a.s. t0.(54) Thus bt provides a bound on the Euclidean norm of the conditional expectation of the measurement error with respect to the σ-algebra Ft.

(N2)

There exists a sequence {σt} of nonnegative numbers such that (55) E(ξt+1E(ξt+1|Ft)22|Ft)σt2(1+θt22), a.s. t0.(55)

Note that the quantity on the left side of (Equation55) is the conditional variance of ξt+1 with respect to the σ-algebra Ft.

Now we can state a theorem about the convergence of synchronous SA.

Theorem 4.4

Suppose f(θ)=0, and Assumptions (F1–F2) and (N1–N2) hold. Suppose in addition that there exists a C2 Lyapunov function V:RdR+ that satisfies the following conditions:

  • There exist constants a, b>0 such that (56) aθθ22V(θ)bθθ22 θRd.(56)

  • There is a finite constant M such that (57) 2V(θ)S2M θRd.(57)

With this hypothesis, we can state the following conclusions:
(1)

If V˙(θ)0 for all θRd, and if (58) t=0αt2<,t=0αtbt<,t=0αt2σt2<,(58) then the iterations {θt} are bounded almost surely.

(2)

Suppose further that there exists a function ϕB such that (59) V˙(θ)ϕ(θθ2) θRd(59) and in addition to (Equation58), we also have (60) t=0αt=,(60) Then θtθ almost surely as t.

Observe the nice “division of labour” between the two conditions: Equation (Equation58) guarantees the almost sure boundedness of the iterations, while the addition of (Equation60) leads to the almost sure convergence of the iterations to the desired limit, namely the solution of f(θ)=0. This division of labour is first found in Ref. [Citation35]. Theorem 4.4 is a substantial improvement on Ref. [Citation36], which were the previously best results. The interested reader is referred to Ref. [Citation30] for further details.

Theorem 4.4 is a slight generalization of Ref. [Citation30, Theorem 5]. In that theorem, it is assumed that bt=0 for all t, and that the constants σt are uniformly bounded by some constant σ. In this case, (Equation58) and (Equation60) become (61) t=0αt2<,t=0αt=.(61) These two conditions are usually referred to as the Robbins–Monro conditions.

4.5. Convergence of BASA

Equations (Equation47) through (Equation49) represent what might be called synchronous SA, because at each time t, every component of θt is updated. Variants of synchronous SA include ASA, where at each time t, exactly one component of θt is updated, and BASA, where at each time t, some but not necessarily all components of θt are updated. We present the results for BASA, because ASA is a special case of BASA. Moreover, we focus on (Equation48), where the objective is to find a fixed point of a contractive map g. The modifications required for (Equation47) and (Equation49) are straight-forward.

The relevant reference for these results is Ref. [Citation24]. As a slight modification of (Equation46), it is assumed that at each time t + 1, there is available a noisy measurement (62) yt+1=g(θt)θt+ξt+1.(62) We assume that there is a given deterministic sequence of “step sizes” {βt}. In BASA, not every component of θt is updated at time t. To determine which components are to be updated, we define d different binary “update processes” {κt,i}, i[d]. No assumptions are made regarding their independence. At time t, define (63) S(t):={i[d]:κt,i=1}.(63) This means that (64) θt+1,i=θt,i iS(t).(64) In order to define θt+1,i when iS(t), we make a distinction between two different approaches: global clocks and local clocks. If a global clock is used, then (65) αt,i=βt iS(t),αt,i=0 iS(t).(65) If a local clock is used, then we first define the local counter (66) νt,i=τ=0tκτ,i,i[d],(66) which is the total number of occasions when iS(τ), 0τt. Equivalently, νt,i is the total number of times up to and including time t when θτ,i is updated. With this convention, we define (67) αt,i=βνt,i iS(t),αt,i=0 iS(t).(67) The distinction between global clocks and local clocks was apparently introduced in Ref. [Citation23]. Traditional RL algorithms such as TD(λ) and Q-learning, discussed in detail in Section 3.3 and again in Sections 5.2 and 5.3, use a global clock. That is not surprising because Ref. [Citation23] came after Refs. [Citation16,Citation17]. It is shown in Ref. [Citation24] that the use of local clocks actually simplifies the analysis of these algorithms.

Now we present the BASA updating rules. Let us define the “step size vector” αtR+d via (Equation65) or (Equation67) as appropriate. Then the update rule is (68) θt+1=θt+αtyt+1,(68) where yt+1 is defined in (Equation62). Here, the symbol ° denotes the Hadamard product of two vectors of equal dimensions. Thus if a,b have the same dimensions, then c=ab is defined by ci=aibi for all i.

Recall that we are given a function g:RdRd, and the objective is to find a solution to the fixed point equation g(θ)=θ. Towards this end, we begin by stating the assumptions about the noise sequence.

(N1′)

There exists a sequence of constants {bt} such that (69) E(ξt+12|Ft)bt(1+θ0t) t0.(69)

(N2′)

There exists a sequence of constants {σt} such that (70) E(ξt+1E(ξt+1|Ft)22|Ft)σt2(1+θ0t2) t0.(70)

Comparing (Equation54) and (Equation55) with (Equation69) and (Equation70), respectively, we see that the term θt22 is replaced by θ0t. So the constants bt and σt can be different in the two cases. But because the two formulations is quite similar, we denote the first set of conditions as (N1) and (N2), and the second set of conditions as (N1′) and (N2′).

Next we state conditions on the step size sequence, which allow us to state the theorems in a compact manner. Next, we state the assumptions on the step size sequence. Note that if a local clock is used, then αt,i can be random even if βt is deterministic.

(S1)

The random step size sequences {αt,i} and the sequences {bt}, {σt2} satisfy (71) t=0αt,i2<,t=0σt2αt,i2<,×t=0btαt,i<, a.s. i[d].(71)

(S2)

The random step size sequence {αt,i} satisfies (72) t=0αt,i=, a.s. i[d].(72)

Finally, we state an assumption about the map g.

(G)

g is a contraction with respect to the -norm with some contraction constant γ<1.

Theorem 4.5

Suppose that Assumptions (N1′) and (N2′) about the noise sequence, (S1) about the step size sequence, and (G) about the function g hold. Then suptθt< almost surely.

Theorem 4.6

Let θ denote the unique fixed point of g. Suppose that Assumptions (N1′) and (N2′) about the noise sequence, (S1) and (S2) about the step size sequence, and (G) about the function g hold. Then θt converges almost surely to θ as t.

The proofs of these theorems can be found in Ref. [Citation24].

5. Applications to RL

In this section, we apply the contents of the previous section to derive sufficient conditions for two distinct RL algorithms, namely temporal difference learning (without function approximation) and Q -learning. Previously known results are stated in Section 3.3. So what is the need to re-analyse those algorithms again from the standpoint of SA? There are two reasons for doing so. First, the historical TD(λ) and Q-learning algorithms are stated using a “global clock” as defined in Section 4.5. Subsequently, the concept of a “local clock” is introduced in Ref. [Citation23]. In Ref. [Citation24], the authors build upon this distinction to achieve two objectives. First, when a local clock is used, there are fewer assumptions. Second, by proving a result on the sample paths of an irreducible Markov process (proved in Ref. [Citation24]), probabilistic conditions such as (Equation38)–(Equation39) and (Equation44)–(Equation45) are replaced by purely algebraic conditions.

5.1. A useful theorem about irreducible Markov processes

Theorem 5.1

Suppose {N(t)} is a Markov process on [d] with a state transition matrix A that is irreducible. Suppose {βt}t0 is a sequence of real numbers in (0,1) such that βt+1βt for all t, and (73) t=0βt=.(73) Then (74) t=0βtI{N(t)=i}(ω)=t=0βtfi(N(t)(ω))= i[d]  ωΩ0,(74) where I denotes the indicator function.

5.2. TD-learning without function approximation

Recall the TD(λ) algorithm without function approximation, presented in Section 3.3. One observes a time series {(Xt,R(Xt))} where {Xt} is a Markov process over X={x1,,xn} with a (possibly unknown) state transition matrix A, and R:XR is a known reward function. With the sample path {Xt} of the Markov process, one can associate a corresponding “index process” {Nt} taking values in [n] as follows: Nt=iif Xt=xiX.It is obvious that the index process has the same transition matrix A as the process {Xt}. The idea is to start with an initial estimate vˆ0 and update it at each time t based on the sample path {(Xt,Rt)}.

Now we recall the TD(λ) algorithm without function approximation. At time t, let vˆtRn denote the current estimate of v. Let {Nt} be the index process defined above. Define the “temporal difference” (75) δt+1:=RNt+γVˆt,Nt+1Vˆt,Nt t0,(75) where Vˆt,Nt denotes the Ntth component of the vector vˆt. Equivalently, if the state at time t is xiX and the state at the next time t + 1 is xj, then (76) δt+1=Ri+γVˆt,jVˆt,i.(76) Next, choose a number λ[0,1). Define the “eligibility vector” (77) zt=τ=0t(γλ)τI{Ntτ=Nt}eNtτ,(77) where eNs is a unit vector with a 1 in location Ns and zeros elsewhere. Finally, update the estimate vˆt as (78) vˆt+1=vˆt+δt+1αtzt,(78) where αt is the step size chosen in accordance with either a global or a local clock. The distinction between the two is described next.

To complete the problem specification, we need to specify how the step size αt is chosen in (Equation78). The two possibilities studied here are global clocks and local clocks. If a global clock is used, then αt=βt, whereas if a local clock is used, then αt=βνt,i, where νt,i=τ=0tI{zτ,i0}.Note that in the traditional implementation of the TD(λ) algorithm suggested in Refs. [Citation17–19], a global clock is used. Moreover, the algorithm is shown to converge provided (79) t=0αt2<,t=0αt=,  a.s.(79) As we shall see below, the theorem statements when local clocks are used involve slightly fewer assumptions than when global clocks are used. Moreover, neither involves probabilistic conditions such as (Equation38) and (Equation39), in contrast to Theorem 3.8.

Next we present two theorems regarding the convergence of the TD(0) algorithm. As the hypotheses are slightly different, they are presented separately. But the proofs are quite similar and can be found in Ref. [Citation24].

Theorem 5.2

Consider the TD(λ) algorithm using a local clock to determine the step size. Suppose that the state transition matrix A is irreducible, and that the deterministic step size sequence {βt} satisfies the Robbins–Monro conditions t=0βt=,t=0βt2<.Then vtv almost surely as t.

Theorem 5.3

Consider the TD(λ) algorithm using a global clock to determine the step size. Suppose that the state transition matrix A is irreducible and that the deterministic step size sequence is nonincreasing (i.e. βt+1βt for all t) and satisfies the Robbins–Monro conditions as described above. Then vtv almost surely as t.

5.3. Q-learning

The Q-learning algorithm proposed in Ref. [Citation16] is now recalled for the convenience of the reader.

  1. Choose an arbitrary initial guess Q0:X×UR and an initial state X0X.

  2. At time t, with current state Xt=xi, choose a current action Ut=ukU, and let the Markov process run for one time step. Observe the resulting next state Xt+1=xj. Then update the function Qt as follows: (80) Qt+1(xi,uk)=Qt(xi,uk)+βt[R(xi,uk)+γVt(xj)Qt(xi,uk)],Qt+1(xs,wl)=Qt(xs,wl) (xs,wl)(xi,uk),(80) where (81) Vt(xj)=maxwlUQt(xj,wl),(81) and {βt} is a deterministic sequence of step sizes.

  3. Repeat.

In earlier work such as Refs. [Citation18,Citation20], it is shown that the Q-learning algorithm converges to the optimal action-value function Q provided (82) t=0βtI(Xt,Ut)=(xi,uk)= (xi,uk)X×U,(82) (83) t=0βt2I(Xt,Ut)=(xi,uk)< (xi,uk)X×U.(83) These conditions are stated here as Theorem 3.12. Similar hypotheses are present in all existing results in ASA. Note that in the Q-learning algorithm, there is no guidance on how to choose the next action Ut. Presumably Ut is chosen so as to ensure that (Equation82) and (Equation83) are satisfied. However, we now demonstrate a way to avoid such conditions using Theorem 5.1. We also introduce batch updating and show that it is possible to use a local clock instead of a global clock.

The batch Q-learning algorithm introduced here is as follows:

  1. Choose an arbitrary initial guess Q0:X×UR, and m initial states X0kX,k[m], in some fashion (deterministic or random). Note that the m initial states need not be distinct.

  2. At time t, for each action index k[m], with current state Xtk=xik, choose the current action as Ut=ukU, and let the Markov process run for one time step. Observe the resulting next state Xt+1k=xjk. Then update function Qt as follows, once for each k[m]: (84) Qt+1(xik,uk)={Qt(xik,uk)+αt,i,k[R(xi,uk)+γVt(xjk)Qt(xik,uk)]if xs=xik,Qt(xsk,uk)if xskxik,(84) where (85) Vt(xjk)=maxwlUQt(xjk,wl).(85) Here αt,i,k equals βt for all i, k if a global clock is used, and equals (86) αt,i,k=τ=0tI{Xtk=xi}(86) if a local clock is used.

  3. Repeat.

Remark

Note that m different simulations are being run in parallel, and that in the kth simulation, the next action Ut is always chosen as uk. Hence, at each instant of time t, exactly m components of Q(,) (viewed as an n×m matrix) are updated, namely the (Xtk,uk) component, for each k[m]. In typical MDPs, the size of the action space m is much smaller than the size of the state space n. For example, in the Blackjack problem discussed in Ref. [Citation4, Chapter 4], n2100 while m = 2! Therefore the proposed batch Q-learning algorithm is quite efficient in practice.

Now, by fitting this algorithm into the framework of Theorem 5.1, we can prove the following general result. The proof can be found in Ref. [Citation24].

Theorem 5.4

Suppose that each matrix Auk is irreducible, and that the step size sequence {βt} satisfies the Robbins–Monro conditions (Equation61) with αt replaced by βt. With this assumption, we have the following:

(1)

If a local clock is used as in (Equation84), then Qt converges almost surely to Q.

(2)

If a global clock is used (i.e. αt,i,k=βt for all t, i, k), and {βt} is nonincreasing, then Qt converges almost surely to Q.

Remark

Note that, in the statement of the theorem, it is not assumed that every policy π leads to an irreducible Markov process – only that every action leads to an irreducible Markov process. In other words, the assumption is that the m different matrices Auk,k[m] correspond to irreducible Markov processes. This is a substantial improvement. It is shown in Ref. [Citation37] that the following problem is NP (Nondeterministic Polynomial)-hard: given an MDP, determine whether every policy π results in a Markov process that is a unichain, that is, consists of a single set of recurrent states with the associated state transition matrix being irreducible, plus possibly some transient states. Our problem is slightly different, because we do not permit any transient states. Nevertheless, this problem is also likely to be very difficult. By not requiring any condition of this sort, and also by dispensing with conditions analogous to (Equation82) and (Equation83), the above theorem statement is more useful.

6. Conclusions and problems for future research

In this brief survey, we have attempted to sketch some of the highlights of RL. Our viewpoint, which is quite mainstream, is to view RL as solving Markov decision problems (MDPs) when the underlying dynamics are unknown. We have used the paradigm of SA as a unifying approach. We have presented convergence theorems for the standard approach, which might be thought of as “synchronous” SA as well as variants such as ASA and BASA. Many of these results are due to the author and his collaborators.

In this survey, due to length limitations, we have not discussed actor-critic algorithms. These can be viewed as applications of the policy gradient theorem [Citation38,Citation39] coupled with SA applied to two-time-scale (i.e. singularly perturbed) systems [Citation25,Citation27]. Some other relevant references are [Citation28,Citation29,Citation40]. Also, the rapidly emerging field of finite-time SA has not been discussed. Finite-Time Stochastic Approximation (FTSA) can lead to estimates of the rate of convergence of various RL algorithms, whereas conventional SA leads to only asymptotic results. Some recent relevant papers include Refs. [Citation41,Citation42].

Disclosure statement

No potential conflict of interest was reported by the author.

Additional information

Funding

This work was supported by Science and Engineering Research Board, Government of India.

Notes on contributors

Mathukumalli Vidyasagar

Mathukumalli Vidyasagar was born in Guntur, India on September 29, 1947. He received the B.S., M.S. and Ph.D. degrees in electrical engineering from the University of Wisconsin in Madison, in 1965, 1967 and 1969 respectively. Between 1969 and 1989, he was a Professor of Electrical Engineering at Marquette University, Concordia University, and the University of Waterloo. In 1989 he returned to India as the Director of the newly created Centre for Artificial Intelligence and Robotics (CAIR) in Bangalore under the Ministry of Defence, Government of India. In 2000 he moved to the Indian private sector as an Executive Vice President of India's largest software company, Tata Consultancy Services. In 2009 he retired from TCS and joined the Erik Jonsson School of Engineering & Computer Science at the University of Texas at Dallas, as a Cecil & Ida Green Chair in Systems Biology Science. In January 2015 he received the Jawaharlal Nehru Science Fellowship and divided his time between UT Dallas and the Indian Institute of Technology Hyderabad. In January 2018 he stopped teaching at UT Dallas and now resides full-time in Hyderabad. Since March 2020, he is a SERB National Science Chair. His current research is in the area of Reinforcement Learning, with emphasis on using stochastic approximation theory. More broadly, he is interested in machine learning, systems and control theory, and their applications. Vidyasagar has received a number of awards in recognition of his research contributions, including Fellowship in The Royal Society, the world's oldest scientific academy in continuous existence, the IEEE Control Systems (Technical Field) Award, the Rufus Oldenburger Medal of ASME, the John R. Ragazzini Education Award from AACC, and others. He is the author of thirteen books and more than 160 papers in peer-reviewed journals.

Notes

1 For clarity, we have changed the notation so that the value vector is now denoted by v instead of v as in (Equation6).

References

  • Bertsekas DP, Tsitsiklis JN. Neuro-dynamic programming. Nashua: Athena Scientific; 1996.
  • Puterman ML. Markov decision processes: discrete stochastic dynamic programming. Hoboken: John Wiley; 2005.
  • Szepesvári C. Algorithms for reinforcement learning. San Rafael: Morgan and Claypool; 2010.
  • Sutton RS, Barto AG. Reinforcement learning: an introduction. 2nd ed. Cambridge: MIT Press; 2018.
  • Ljung L. Strong convergence of a stochastic approximation algorithm. Ann Stat. 1978;6:680–696.
  • Kushner HJ, Clark DS. Stochastic approximation methods for constrained and unconstrained systems. Springer-Verlag; 1978 (Applied mathematical sciences).
  • Benveniste A, Metivier M, Priouret P. Adaptive algorithms and stochastic approximation. Springer-Verlag; 1990.
  • Kushner HJ, Yin GG. Stochastic approximation and recursive algorithms and applications. Springer-Verlag; 1997.
  • Benaim M. Dynamics of stochastic approximation algorithms. Springer Verlag; 1999.
  • Borkar VS. Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press; 2008.
  • Borkar VS. Stochastic approximation: a dynamical systems viewpoint. 2nd ed. Hindustan Book Agency; 2022.
  • Spong MW, Hutchinson SR, Vidyasagar M. Robot modeling and control. 2nd ed. John Wiley; 2020.
  • Shannon CE. Programming a computer for playing chess. Philos Mag Ser7. 1950;41(314):1–18.
  • Vidyasagar M. Hidden Markov processes: theory and applications to biology. Princeton University Press; 2014.
  • Arapostathis A, Borkar VS, Fernández-Gaucherand E, et al. Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J Control Optim. 1993;31(2):282–344.
  • Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8(3–4):279–292.
  • Sutton RS. Learning to predict by the method of temporal differences. Mach Learn. 1988;3(1):9–44.
  • Jaakkola T, Jordan MI, Singh SP. Convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 1994;6(6):1185–1201.
  • Tsitsiklis JN, Roy BV. An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Contr. 1997;42(5):674–690.
  • Tsitsiklis JN. Asynchronous stochastic approximation and q-learning. Mach Learn. 1994;16:185–202.
  • Robbins H, Monro S. A stochastic approximation method. Ann Math Stat. 1951;22(3):400–407.
  • Blum JR. Multivariable stochastic approximation methods. Ann Math Stat. 1954;25(4):737–744.
  • Borkar VS. Asynchronous stochastic approximations. SIAM J Control Optim. 1998;36(3):840–851.
  • Karandikar RL, Vidyasagar M. Convergence of batch asynchronous stochastic approximation with applications to reinforcement learning [Arxiv:2109.03445v2]; 2022.
  • Borkar VS. Stochastic approximation in two time scales. Syst Control Lett. 1997;29(5):291–294.
  • Tadić VB. Almost sure convergence of two time-scale stochastic approximation algorithms. In: Proceedings of the American Control Conference, Boston, Vol. 4; 2004. p. 3802–3807.
  • Lakshminarayanan C, Bhatnagar S. A stability criterion for two timescale stochastic approximation schemes. Automatica. 2017;79:108–114.
  • Konda VR, Borkar VS. Actor-critic learning algorithms for Markov decision processes. SIAM J Control Optim. 1999;38(1):94–123.
  • Konda VR, Tsitsiklis JN. Actor-critic algorithms. In: Neural information processing systems (NIPS1999); 1999. p. 1008–1014.
  • Vidyasagar M. Convergence of stochastic approximation via martingale and converse Lyapunov methods [Arxiv:2205.01303v1]; 2022.
  • Vidyasagar M. Nonlinear systems analysis (SIAM classics series). Soc Ind Appl Math SIAM; 2002;42.
  • Hahn W. Stability of motion. Springer-Verlag; 1967.
  • Khalil HK. Nonlinear systems. 3rd ed. Prentice Hall; 2002.
  • Durrett R. Probability: theory and examples. 5th ed. Cambridge University Press; 2019.
  • Gladyshev EG. On stochastic approximation. Theory Probab Appl. 1965;X(2):275–278.
  • Borkar VS, Meyn SP. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J Control Optim. 2000;38:447–469.
  • Tsitsiklis JN. NP-hardness of checking the unichain condition in average cost MDPs. Oper Res Lett. 2007;35:319–323.
  • Sutton RS, McAllester D, Singh S, et al. Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems 12 (proceedings of the 1999 conference), Denver: MIT Press; 2000. p. 1057–1063.
  • Marbach P, Tsitsiklis JN. Simulation-based optimization of Markov reward processes. IEEE Trans Autom Contr. 2001;46(2):191–209.
  • Konda V, Tsitsiklis J. On actor-critic algorithms. SIAM J Control Optim. 2003;42(4):1143–1166.
  • Chen Z, Maguluri ST, Shakkottai S, et al. Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes [Arxiv:2002.00874v4]; 2020.
  • Chen Z, Maguluri ST, Shakkottai S, et al. Finite-sample analysis of off-policy TD-learning via generalized bellman operators [Arxiv:2106.12729v1]; 2021.