Full article: A tutorial introduction to reinforcement learning

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

Abstract

In this paper, we present a brief survey of reinforcement learning, with particular emphasis on stochastic approximation (SA) as a unifying theme. The scope of the paper includes Markov reward processes, Markov decision processes, SA algorithms, and widely used algorithms such as temporal difference learning and Q-learning.

Keywords:

1. Introduction

In this paper, we present a brief survey of reinforcement learning (RL), with particular emphasis on stochastic approximation (SA) as a unifying theme. The scope of the paper includes Markov reward processes, Markov decision processes (MDPs), SA methods, and widely used algorithms such as temporal difference learning and Q-learning. RL is a vast subject, and this brief survey can barely do justice to the topic. There are several excellent texts on RL, such as Refs. [Citation1–4]. The dynamics of the SA algorithm are analysed in Refs. [Citation5–11]. The interested reader may consult those sources for more information.

In this survey, we use the phrase “ RL” to refer to decision-making with uncertain models, and in addition, current actions alter the future behaviour of the system. Therefore, if the same action is taken at a future time, the consequences might not be the same. This additional feature distinguishes RL from “mere” decision-making under uncertainty. Figure rather arbitrarily divides decision-making problems into four quadrants. Examples from each quadrant are now briefly described.

Many if not most decision-making problems fall into the lower-left quadrant of “good model, no alteration” (meaning that the control actions do not alter the environment). An example is a fighter aircraft which usually has an excellent model, thanks to aerodynamical modelling and/or wind tunnel tests. In turn this permits the control system designers to formulate and to solve an optimal (or some other form of) control problem.
Controlling a chemical reactor would be an example from the lower-right quadrant. As a traditional control system, it can be assumed that the environment in which the reactor operates does not change as a consequence of the control strategy adopted. However, due to the complexity of a reactor, it is difficult to obtain a very accurate model, in contrast with a fighter aircraft for example. In such a case, one can adopt one of two approaches. The first, which is a traditional approach in control system theory, is to use a nominal model of the system and to treat the deviations from the nominal model as uncertainties in the model. A controller is designed based on the nominal model, and robust control theory would be invoked to ensure that the controller would still perform satisfactorily (though not necessarily optimally) for the actual system. The second, which would move the problem from the lower right to the upper-right quadrant, is to attempt to “learn” the unknown dynamical model by probing its response to various inputs. This approach is suggested in Ref. [Citation4, Example 3.1]. A similar statement can be made about robots, where the geometry determines the form of the dynamical equations describing it, but not the parameters in the equations; see, for example, Ref. [Citation12]. In this case too, it is possible to “learn” the dynamics through experimentation. In practice, such an approach is far slower than the traditional control systems approach of using a nominal model and designing a “robust” controller. However, “learning control” is a popular area in the world of machine learning. One reason is that the initial modelling error is too large, then robust control theory alone would not be sufficient to ensure the stability of the actual system with the designed controller. In contrast (and in principle), a “learning control” approach can withstand larger modelling errors. The widely used model predictive control paradigm can be viewed as an example of a learning-based approach.
A classic example of a problem belonging to the upper-left corner is a MDP, which forms the backbone of one approach to RL. In an MDP, there is a state space $X$ and an action space $U$ , both of which are usually assumed to be finite. In most MDPs, $| X | ≫ | U |$ . Board games without an element of randomness such as tic-tac-toe or chess would belong to the upper-left quadrant, at least in principle. Tic-tac-toe belongs here, because the rules of the game are clear, and the number of possible games is manageable. In principle, games such as chess which are “deterministic” (i.e. there is no throwing of dice as in Backgammon for example) would also belong here. Chess is a two-person game in which, for each board position, it is possible to assign the likelihood of the three possible outcomes: white wins, black wins, or it is a draw. However, due to the enormous number of possibilities, it is often not possible to determine these likelihoods precisely. It is pointed out explicitly in Ref. [Citation13] that merely because we cannot explicitly compute this likelihood function, that does not mean that the likelihood does not exist! However, as a practical matter, it is not a bad idea to treat this likelihood function as being unknown, and to infer it on the basis of experiment/experience. Thus, as with chemical reactors, it is not uncommon to move chess-playing from the lower-right corner to the upper-right corner.
The upper-right quadrant is the focus of RL. There are many possible ways to formulate RL problems, each of which leads to its own solution methodologies. A very popular approach to RL is to formulate it as MDPs whose dynamics are unknown. That is the approach adopted in this paper.

Figure 1. The four quadrants of decision-making under uncertainty.

In an MDP, at each time t, the learner (also known as the actor or the agent) measures the state $X_{t} \in X$ . Based on this measurement, the learner chooses an action $U_{t} \in U$ and receives a reward $R (X_{t}, U_{t})$ . Future rewards are discounted by a discount factor $γ \in (0, 1)$ . The rule by which the current action $U_{t}$ is chosen as a function of the current state $X_{t}$ is known as a policy. With each policy, one can associate the expected value of the total (discounted) reward over time. The problem is to find the best policy. There is a variant called partially observable MDP in which the state $X_{t}$ cannot be measured directly; rather, there is an output (or observation) $Y_{t} \in Y$ which is a memoryless function, either deterministic or random, of $X_{t}$ . These problems are not studied here; it is always assumed that $X_{t}$ can be measured directly. When the parameters of the MDP are known, there are several approaches to determining the optimal policy. RL is distinct from an MDP in that, in RL, the parameters of the underlying MDP are constant but not known to the learner; they must be learnt on the basis of experimentation. Figure depicts the situation.

Figure 2. Depiction of a RL problem.

The remainder of the paper is organized as follows: in Section 2, Markov reward processes are introduced. These are a precursor to MDPs, which are introduced in Section 3. Specifically, in Section 3.1, the relevant problems in the study of MDPs are formulated. In Section 3.2, the solutions to these problems are given in terms of the Bellman value iteration, the action-value function, and the F-iteration to determine the optimal action-value function. In Section 3.3, we study the situation where the dynamics of the MDP under study are not known precisely. Instead, one has access only to a sample path ${X_{t}}$ of the Markov process under study. For this situation, we present two standard algorithms, known as temporal difference learning and Q -learning. Starting from Section 4, the paper consists of results due to the author. In Section 4.1, the concept of SA is introduced, and its relevance to RL is outlined in Section 4.2. In Section 4.3, a new theorem on the global asymptotic stability of nonlinear ODEs is stated; this theorem is of independent interest. Some theorems on the convergence of the SA algorithm are presented in Sections 4.4 and 4.5. In Section 5, the results of Section 4 are applied to RL problems. In Section 5.1, a technical result on the sample paths of an irreducible Markov process is stated. Using this result, simplified conditions are given for the convergence of the temporal difference algorithm (Section 5.2) and Q-learning (Section 5.3). A brief set of concluding remarks ends the paper.

2. Markov reward processes

Markov reward processes are standard (stationary) Markov processes where each state has a “reward” associated with it. Markov reward processes are a precursor to MDPs; so we review those in this section. There are several standard texts on Markov processes, one of which is Ref. [Citation14].

Suppose $X$ is a finite set of cardinality n, written as ${x_{1}, \dots, x_{n}}$ . If ${X_{t}}_{t \geq 0}$ is a stationary Markov process assuming values in $X$ , then the corresponding state transition matrix A is defined by (1) $\begin{aligned} a_{i j} = Pr {X_{t + 1} = x_{j} | X_{t} = x_{i}} . \end{aligned}$ (1) Thus the ith row of A is the conditional probability vector of $X_{t + 1}$ when $X_{t} = x_{i}$ . Clearly the row sums of the matrix A are all equal to one. This can be expressed as $A 1_{n} = 1_{n}$ , where $1_{n}$ denotes the n-dimensional column vector whose entries all equal to one. Therefore, if we define the induced matrix norm $‖ A ‖_{\infty \to \infty}$ as $‖ A ‖_{\infty \to \infty} := max_{v \neq 0} \frac{‖ A v ‖_{\infty}}{‖ v ‖_{\infty}},$ then $‖ A ‖_{\infty \to \infty}$ equals one, which also equals the spectral radius of A.

Now suppose that there is a “reward” function $R : X \to R$ associated with each state. There is no consensus within the community about whether the reward corresponding to the state $X_{t}$ is paid at time t as in Ref. [Citation3] or time t + 1 as in Refs. [Citation2,Citation4]. In this paper, it is assumed that the reward is paid at time t and is denoted by $R_{t}$ ; the modifications required to handle the other approach are easy and left to the reader. The reward $R_{t}$ can either be a deterministic function of $X_{t}$ or a random function. If $R_{t}$ is a deterministic function of $X_{t}$ , then we have that $R_{t} = R (X_{t})$ where R is the reward function mapping $X$ into (a finite subset of) $R$ . Thus, whenever the trajectory ${X_{t}}$ of the Markov process equals some state $x_{i} \in X$ , the resulting reward $R (X_{t})$ will always equal $R (x_{i}) =: r_{i}$ . Thus the reward is captured by an n-dimensional vector $r$ , where $r_{i} = R (x_{i})$ . On the other hand, if $R_{t}$ is a random function of $X_{t}$ , then one would have to provide the probability distribution of $R_{t}$ given $X_{t}$ . Since $X_{t}$ has only n different values, we would have to provide n different probability distributions. To avoid technical difficulties, it is common to assume that $R (x_{i})$ is a bounded random variable for each index i. Note that because the set $X$ is finite, if the reward function is deterministic, then we have that $max_{x_{i} \in X} R (x_{i}) < \infty .$ In case the reward function R is random, as mentioned above, it is common to assume that $R (x_{i})$ is a bounded random variable for each index $i \in [n]$ , where the symbol $[n]$ equals ${1, \dots, n}$ . With this assumption, it follows that $max_{x_{i} \in X} E [R (x_{i})] < \infty .$ Two kinds of Markov reward processes are widely studied, namely: discounted reward processes and average reward processes. In this paper, we restrict attention to discounted reward processes. However, we briefly introduce average reward processes. Define (if it exists) $V (x_{i}) := lim_{T \to 0} \frac{1}{T + 1} E [\sum_{t = 0}^{T} R (X_{t}) | X_{0} = x_{i}] .$ An excellent review of average reward processes can be found in Ref. [Citation15].

In each discounted Markov reward process, there is a “discount factor” $γ \in (0, 1)$ . This factor captures the extent to which future rewards are less valuable than immediate rewards. Fix an initial state $x_{i} \in X$ . Then the expected discounted future reward $V (x_{i})$ is defined as (2) $\begin{aligned} V (x_{i}) & := E [\sum_{t = 0}^{\infty} γ^{t} R_{t} | X_{0} = x_{i}] \\ = E [\sum_{t = 0}^{\infty} γ^{t} R (X_{t}) | X_{0} = x_{i}] . \end{aligned}$ (2) We often just use “discounted reward” instead of the longer phrase. With these assumptions, because $γ < 1$ , the above summation converges and is well defined. The quantity $V (x_{i})$ is referred to as the value function associated with $x_{i}$ , and the vector (3) $\begin{aligned} v = [\begin{array}{lll} V (x_{1}) & \dots & V (x_{n}) \end{array}]^{⊤} \end{aligned}$ (3) is referred to as the value vector. Note that, throughout this paper, we view the value as both a function $V : X \to R$ as well as a vector $v \in R^{n}$ . The relationship between the two is given by (Equation3(3) $\begin{aligned} v = [\begin{array}{lll} V (x_{1}) & \dots & V (x_{n}) \end{array}]^{⊤} \end{aligned}$ (3) ). We shall use whichever interpretation is more convenient in a given context.

This raises the question as to how the value function and/or value vector is to be determined. Define the vector $r \in R^{n}$ via (4) $\begin{aligned} r := [\begin{array}{lll} r_{1} & \dots & r_{n} \end{array}]^{⊤}, \end{aligned}$ (4) where $r_{i} = R (x_{i})$ if R is a deterministic function, and if $R_{t}$ is a random function of $X_{t}$ , then (5) $\begin{aligned} r_{i} := E [R (x_{i})] . \end{aligned}$ (5) The next result gives a useful characterization of the value vector.

Theorem 2.1

The vector $v$ satisfies the recursive relationship (6) $\begin{aligned} v = r + γ A v, \end{aligned}$ (6) or, in expanded form, (7) $\begin{aligned} V (x_{i}) = r_{i} + γ \sum_{j = 1}^{n} a_{i j} V (x_{j}) . \end{aligned}$ (7)

Proof.

Let $x_{i} \in X$ be arbitrary. Then by definition we have (8) $\begin{aligned} V (x_{i}) & = E [\sum_{t = 0}^{\infty} γ^{t} R_{t} | X_{0} = x_{i}] \\ = r_{i} + E [\sum_{t = 1}^{\infty} γ^{t} R_{t} | X_{0} = x_{i}] . \end{aligned}$ (8) However, if $X_{0} = x_{i}$ , then $X_{1} = x_{j}$ with probability $a_{i j}$ . Therefore, we can write (9) $\begin{aligned} E [\sum_{t = 1}^{\infty} γ^{t} R_{t} | X_{0} = x_{i}] \\ = \sum_{j = 1}^{n} a_{i j} E [\sum_{t = 1}^{\infty} γ^{t} R_{t} | X_{1} = x_{j}] \\ = γ \sum_{j = 1}^{n} a_{i j} E [\sum_{t = 0}^{\infty} γ^{t} R_{t} | X_{0} = x_{j}] \\ = γ \sum_{j = 1}^{n} a_{i j} V (x_{j}) . \end{aligned}$ (9) In the second step we use the fact that the Markov process is stationary. Substituting (Equation9(9) $\begin{aligned} E [\sum_{t = 1}^{\infty} γ^{t} R_{t} | X_{0} = x_{i}] \\ = \sum_{j = 1}^{n} a_{i j} E [\sum_{t = 1}^{\infty} γ^{t} R_{t} | X_{1} = x_{j}] \\ = γ \sum_{j = 1}^{n} a_{i j} E [\sum_{t = 0}^{\infty} γ^{t} R_{t} | X_{0} = x_{j}] \\ = γ \sum_{j = 1}^{n} a_{i j} V (x_{j}) . \end{aligned}$ (9) ) into (Equation8(8) $\begin{aligned} V (x_{i}) & = E [\sum_{t = 0}^{\infty} γ^{t} R_{t} | X_{0} = x_{i}] \\ = r_{i} + E [\sum_{t = 1}^{\infty} γ^{t} R_{t} | X_{0} = x_{i}] . \end{aligned}$ (8) ) gives the recursive relationship (Equation7(7) $\begin{aligned} V (x_{i}) = r_{i} + γ \sum_{j = 1}^{n} a_{i j} V (x_{j}) . \end{aligned}$ (7) ).

Example 2.2

As an illustration of a Markov reward process, we analyse a toy snakes and ladders game with the transitions shown in Figure . Here W and L denote “win” and “lose,” respectively. The rules of the game are as follows:

Initial state is S.
A four-sided, fair die is thrown at each stage.
Player must land exactly on W to win and exactly on L to lose.
If implementing a move causes crossing of W and L, then the move is not implemented.

Figure 3. A toy snakes and ladders game.

There are 12 possible states in all: S, 1, …, 9, W, and L. However, 2, 3, and 9 can be omitted, leaving nine states, namely S, 1, 4, 5, 6, 7, 8, W, and L. At each step, there are at most four possible outcomes. For example, from the state S, the four outcomes are 1, 7, 5, and 4. From state 6, the four outcomes are 7, 8, 1, and W. From state 7, the four outcomes are 8, 1, W, and 7. From state 8, there four possible outcomes are 1, W, L, and 8 with probability 1/4 each, because if the die comes up with 4, then the move cannot be implemented. It is time-consuming but straight-forward to compute the state transition matrix as $\begin{array}{cccccccccc} S & 1 & 4 & 5 & 6 & 7 & 8 & W & L \\ S & 0 & 0.25 & 0.25 & 0.25 & 0 & 0.25 & 0 & 0 & 0 \\ 1 & 0 & 0 & 0.25 & 0.50 & 0 & 0.25 & 0 & 0 & 0 \\ 4 & 0 & 0 & 0 & 0.25 & 0.25 & 0.25 & 0.25 & 0 & 0 \\ 5 & 0 & 0.25 & 0 & 0 & 0.25 & 0.25 & 0.25 & 0 & 0 \\ 6 & 0 & 0.25 & 0 & 0 & 0 & 0.25 & 0.25 & 0.25 & 0 \\ 7 & 0 & 0.25 & 0 & 0 & 0 & 0 & 0.25 & 0.25 & 0.25 \\ 8 & 0 & 0.25 & 0 & 0 & 0 & 0 & 0.25 & 0.25 & 0.25 \\ W & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ L & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{array}$

We define a reward function for this problem as follows: we set $R_{t} = f (X_{t + 1})$ , where f is defined as follows: $f (W) = 5$ , $f (L) = - 2$ , and $f (x) = 0$ for all other states. However, there is an expected reward depending on the state at the next time instant. For example, if $X_{0} = 6$ , then the expected value of $R_{0}$ is 5/4, whereas if $X_{0} = 7$ or $X_{0} = 8$ , then the expected value of $R_{0}$ is 3/4.

Now let us see how the implicit equation (Equation6(6) $\begin{aligned} v = r + γ A v, \end{aligned}$ (6) ) can be solved to determine the value vector $v$ . Since the induced matrix norm $‖ A ‖_{\infty \to \infty} = 1$ and $γ < 1$ , it follows that the matrix $I - γ A$ is nonsingular. Therefore, for every reward function $r$ , there is a unique $v$ that satisfies (Equation6(6) $\begin{aligned} v = r + γ A v, \end{aligned}$ (6) ). In principle, it is possible to deduce from (Equation6(6) $\begin{aligned} v = r + γ A v, \end{aligned}$ (6) ) that (10) $\begin{aligned} v = (I - γ A)^{- 1} r . \end{aligned}$ (10) The difficulty with this formula however is that in most actual applications of Markov decision problems, the integer n denoting the size of the state space $X$ is quite large. Moreover, inverting a matrix has cubic complexity in the size of the matrix. Therefore, it may not be practicable to invert the matrix $I - γ A$ . So we are forced to look for alternate approaches. A feasible approach is provided by the contraction mapping theorem.

Theorem 2.3

The map $y \mapsto T y := r + γ A y$ is monotone and is a contraction with respect to the $ℓ_{\infty}$ -norm, with contraction constant γ. Therefore, we can choose some vector $y^{0}$ arbitrarily, and then define $y^{i + 1} = r + γ A y^{i} .$ Then $y^{i}$ converges to the value vector $v$ .

Proof.

The first statement is that if $y_{1} \leq y_{2}$ componentwise (and note that the vectors $y_{1}$ and $y_{2}$ need not consist of only positive components), then $T y_{1} \leq T y_{2}$ . This is obvious from the fact that the matrix A has only nonnegative components, so that $A y_{1} \leq A y_{2}$ . For the second statement, note that, because the matrix A is row-stochastic, the induced matrix norm $‖ A ‖_{\infty \to \infty}$ is equal to one. Therefore, $‖ T y_{1} - T y_{2} ‖_{\infty} = ‖ γ A (y_{1} - y_{2}) ‖_{\infty} \leq γ ‖ y_{1} - y_{2} ‖_{\infty} .$ This completes the proof.

There is however a limitation to this approach, namely, that it requires that the state transition matrix A has to be known. In RL, this assumption is often not satisfied. Instead, one has access to a single sample path ${X_{t}}$ of a Markov process over $X$ , whose state transition matrix is A. The question therefore arises: How can one compute the value vector $v$ in such a scenario? The answer is provided by the so-called temporal difference algorithm, which is discussed in Section 3.3.

3. MDPs

3.1. Problem formulation

In a Markov reward process, the state $X_{t}$ evolves on its own, according to a predetermined state transition matrix. In contrast, in a MDP, there is also another variable called the “action” which affects the dynamics. Specifically, in addition to the state space $X$ , there is also a finite set of actions $U$ . Each action $u_{k} \in U$ leads to a corresponding state transition matrix $A^{u_{k}} = [a_{i j}^{u_{k}}]$ . So at time t, if the state is $X_{t}$ and an action $U_{t} \in U$ is applied, then (11) $\begin{aligned} Pr {X_{t + 1} = x_{j} | X_{t} = x_{i}, U_{t} = u_{k}} = a_{i j}^{u_{k}} . \end{aligned}$ (11) Obviously, for each fixed $u_{k} \in U$ , the corresponding state transition matrix $A^{u_{k}}$ is row-stochastic. In addition, there is also a “reward” function $R : X \times U \to R$ . Note that in a Markov reward process, the reward depends only on the current state, whereas in a MDP, the reward depends on both the current state as well as the action taken. As in Markov reward processes, it is possible to permit R to be a random function of $X_{t}$ and $U_{t}$ as opposed to a deterministic function. Moreover, to be consistent with the earlier convention, it is assumed that the reward $R (X_{t}, U_{t})$ is paid at time t.

The most important aspect of an MDP is the concept of a “policy,” which is just a systematic way of choosing $U_{t}$ given $X_{t}$ . If $π : X \to U$ is any map, this would be called a deterministic policy, and the set of all deterministic policies is denoted by $Π_{d}$ . Alternatively, let $S (U)$ denote the set of probability distributions on the finite set $U$ . Then a map $π : X \to S (U)$ would be called a probabilistic policy, and the set of probabilistic policies is denoted by $Π_{p}$ . Note that the cardinality of $Π_{d}$ equals $| U | |^{| X |}$ , while the set $Π_{p}$ is uncountable.

A vital point about MDPs is this: whenever any policy π, whether deterministic or probabilistic, is implemented, the resulting process ${X_{t}}$ is a Markov process with an associated state transition matrix, which is denoted by $A_{π}$ . This matrix can be determined as follows: if $π \in Π_{d}$ , then at time t, if $X_{t} = x_{i}$ , then the corresponding action $U_{t}$ equals $π (x_{i})$ . Therefore, (12) $\begin{aligned} Pr {X_{t + 1} = x_{j} | X_{t} = x_{i}, π} = a_{i j}^{π (x_{i})} . \end{aligned}$ (12) If $π \in Π_{p}$ and (13) $\begin{aligned} π (x_{i}) = [\begin{array}{lll} ϕ_{i 1} & \dots & ϕ_{i m} \end{array}], \end{aligned}$ (13) where $m = | U |$ , then (14) $\begin{aligned} Pr {X_{t + 1} = x_{j} | X_{t} = x_{i}, π} = \sum_{k = 1}^{m} ϕ_{i k} a_{i j}^{u_{k}} . \end{aligned}$ (14) In a similar manner, for every policy π, the reward function $R : X \times U \to R$ can be converted into a reward map $R_{π} : X \to R$ as follows: If $π \in Π_{d}$ , then (15) $\begin{aligned} R_{π} (x_{i}) = R (x_{i}, π (x_{i})), \end{aligned}$ (15) whereas if $π \in Π_{p}$ , then (16) $\begin{aligned} R_{π} (x_{i}) = \sum_{k = 1}^{m} ϕ_{i k} R (x_{i}, u_{k}) . \end{aligned}$ (16) Thus, given any policy π, whether deterministic or probabilistic, we can associate with it a reward vector $r_{π}$ . To summarize, given any MDP, once a policy π is chosen, the resulting process ${X_{t}}$ is a Markov reward process with state transition matrix $A_{π}$ and reward vector $r_{π}$ .

Example 3.1

To illustrate these ideas, suppose n = 4, m = 2, so that there four states and two actions. Thus, there are two $4 \times 4$ state transition matrices $A^{1}, A^{2}$ corresponding to the two actions. (In the interests of clarity, we write $A^{1}$ and $A^{2}$ instead of $A^{u_{1}}$ and $A^{u_{2}}$ .) Suppose $π_{1}$ is a deterministic policy, represented as a $n \times m$ matrix (in this case a $4 \times 2$ matrix), as follows: $M_{1} = [\begin{array}{cc} 0 & 1 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{array}] .$ This means that if $X_{t} = x_{1}, x_{2}$ , or $x_{4}$ , then $U_{t} = u_{2}$ , while if $X_{t} = x_{3}$ , then $U_{t} = u_{1}$ . Let us use the notation $(A^{k})^{i}$ to denote the ith row of the matrix $A^{k}$ , where k = 1, 2 and $i = 1, 2, 3, 4$ . Then the state transition matrix $A_{π_{1}}$ is given by $A_{π_{1}} = [\begin{matrix} (A^{2})^{1} \\ (A^{2})^{2} \\ (A^{1})^{3} \\ (A^{2})^{4} \end{matrix}] .$ Thus, the first, third, and fourth rows of $A_{π_{1}}$ come from $A^{2}$ , while the second row comes from $A^{1}$ .

Next, suppose $π_{2}$ is a probabilistic policy, represented by the matrix $M_{2} = [\begin{array}{cc} 0.3 & 0.7 \\ 0.2 & 0.8 \\ 0.9 & 0.1 \\ 0.4 & 0.6 \end{array}] .$ Thus, if $X_{t} = x_{1}$ , then the action $U_{t} = u_{1}$ with probability 0.3 and equals $u_{2}$ with probability 0.7, and so on. For this policy, the resulting state transition matrix is determined as follows: $A_{π_{2}} = [\begin{matrix} 0.3 (A^{1})^{1} + 0.7 (A^{2})^{1} \\ 0.2 (A^{1})^{2} + 0.8 (A^{2})^{2} \\ 0.9 (A^{1})^{3} + 0.1 (A^{2})^{3} \\ 0.4 (A^{1})^{4} + 0.6 (A^{2})^{4} \end{matrix}] .$

For a MDP, one can pose three questions:

Policy evaluation: We have seen already that given a Markov reward process, with a reward vector $r$ and a discount factor γ, there corresponds a unique value vector $v$ . We have also seen that, for any choice of a policy π, whether deterministic or probabilistic, there corresponds a state transition matrix $A_{π}$ and a reward vector $r_{π}$ . Therefore, once a policy π is chosen, the MDP becomes a Markov reward process with state transition matrix $A_{π}$ and reward vector $r_{π}$ . We can define $v_{π} (x_{i})$ to be the value vector associated with this Markov reward process. The question is: How can $v_{π} (x_{i})$ be computed?
Optimal value determination: For each policy π, there is an associated value vector $v_{π}$ . Let us view $v_{π}$ as a map from $X$ to $R$ , so that $V_{π} (x_{i})$ is the ith component of $v_{π}$ . Now suppose $x_{i} \in X$ is a specified initial state, and define (17) $\begin{aligned} V^{*} (x_{i}) := max_{π \in Π_{p}} V_{π} (x_{i}) \end{aligned}$ (17) to be the optimal value over all policies, when the MDP is started in the initial state $X_{0} = x_{i}$ . How can $V^{*} (x_{i})$ be computed? Note that in (Equation17(17) $\begin{aligned} V^{*} (x_{i}) := max_{π \in Π_{p}} V_{π} (x_{i}) \end{aligned}$ (17) ), the optimum is taken over all probabilistic policies. However, it can be shown that the optimum is the same even if π is restricted to only deterministic policies.
Optimal policy determination: In (Equation17(17) $\begin{aligned} V^{*} (x_{i}) := max_{π \in Π_{p}} V_{π} (x_{i}) \end{aligned}$ (17) ) above, we associate an optimal policy with each state $x_{i}$ . Now we can extend the idea and define the optimal policy map $X \to Π_{d}$ via (18) $\begin{aligned} π^{*} (x_{i}) := {a r g m a x}_{π \in Π_{d}} V_{π} (x_{i}) . \end{aligned}$ (18) How can the optimal policy map $π^{*}$ be determined? Note that it is not a priori evident that there exists one policy that is optimal for all initial states. But the existence of such an optimal policy can be shown. Also, we can restrict to $π \in Π_{d}$ in (Equation18(18) $\begin{aligned} π^{*} (x_{i}) := {a r g m a x}_{π \in Π_{d}} V_{π} (x_{i}) . \end{aligned}$ (18) ) because it can be shown that the maximum over $π \in Π_{p}$ is not any larger. In other words, $max_{π \in Π_{d}} V_{π} (x_{i}) = max_{π \in Π_{p}} V_{π} (x_{i}) .$

3.2. MDPs: solution

In this subsection, we present answers to the three questions above.

3.2.1. Policy evaluation

Suppose a policy π in $Π_{d}$ or $Π_{p}$ is specified. Then the corresponding state transition matrix $A_{π}$ and reward vector $r_{π}$ are given by (Equation12(12) $\begin{aligned} Pr {X_{t + 1} = x_{j} | X_{t} = x_{i}, π} = a_{i j}^{π (x_{i})} . \end{aligned}$ (12) ) (or (Equation14(14) $\begin{aligned} Pr {X_{t + 1} = x_{j} | X_{t} = x_{i}, π} = \sum_{k = 1}^{m} ϕ_{i k} a_{i j}^{u_{k}} . \end{aligned}$ (14) )) and (Equation15(15) $\begin{aligned} R_{π} (x_{i}) = R (x_{i}, π (x_{i})), \end{aligned}$ (15) ), respectively. As pointed out above, once the policy is chosen, the process becomes just a Markov reward process. Then it readily follows from Theorem 2.1 that $v_{π}$ satisfies an equation analogous to (Equation6(6) $\begin{aligned} v = r + γ A v, \end{aligned}$ (6) ), namely (19) $\begin{aligned} v_{π} = r_{π} + γ A_{π} v_{π} . \end{aligned}$ (19) As before, it is inadvisable to compute $v_{π}$ via $v_{π} = (I - γ A_{π})^{- 1} r_{π}$ . Instead, one should use value iteration to solve (Equation19(19) $\begin{aligned} v_{π} = r_{π} + γ A_{π} v_{π} . \end{aligned}$ (19) ). Observe that whatever the policy π might be, the resulting state transition matrix $A_{π}$ satisfies $‖ A ‖_{\infty \to \infty} = 1$ . Therefore, the map $y \mapsto r_{π} + γ A_{π} y$ is a contraction with respect to $‖ \cdot ‖_{\infty}$ , with contraction constant γ.

3.2.2. Optimal value determination

Now we introduce one of the key ideas in MDPs. Define the Bellman iteration map $B : R^{n} \to R^{n}$ via (20) $\begin{aligned} (B v)_{i} := max_{u_{k} \in U} [R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} v_{j}] . \end{aligned}$ (20)

Theorem 3.2

The map B is monotone and a contraction with respect to the $ℓ_{\infty}$ -norm. Therefore, the fixed point $\bar{v}$ of the map B satisfies the relation (21) $\begin{aligned} (\bar{v})_{i} := max_{u_{k} \in U} [R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} (\bar{v})_{j}] . \end{aligned}$ (21)

Note that (Equation21(21) $\begin{aligned} (\bar{v})_{i} := max_{u_{k} \in U} [R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} (\bar{v})_{j}] . \end{aligned}$ (21) ) is known as the Bellman optimality equation. Thus, in principle at least, we can choose an arbitrary initial guess $v_{0} \in R^{d}$ , and repeatedly apply the Bellman iteration. The resulting iterations would converge to the unique fixed point of the operator B, which we denote by $\bar{v}$ .

The significance of the Bellman iteration is given by the next theorem.

Theorem 3.3

Define $\bar{v} \in R^{n}$ to be the unique fixed point of B, and define $v^{*} \in R^{n}$ to equal $[V^{*} (x_{i}), x_{i} \in X]$ , where $V^{*} (x_{i})$ is defined in (Equation17(17) $\begin{aligned} V^{*} (x_{i}) := max_{π \in Π_{p}} V_{π} (x_{i}) \end{aligned}$ (17) ). Then $\bar{v} = v^{*}$ .

Therefore, the optimal value vector can be computed using the Bellman iteration. However, knowing the optimal value vector does not, by itself, give us an optimal policy.

3.2.3. Optimal policy determination

To solve the problem of optimal policy determination, we introduce another function $Q_{π} : X \times U \to R$ , known as the action-value function, which is defined as follows: (22) $\begin{aligned} Q_{π} (x_{i}, u_{k}) \\ := R (x_{i}, u_{k}) \\ + E_{π} [\sum_{t = 1}^{\infty} γ^{t} R_{π} (X_{t}) | X_{0} = x_{i}, U_{0} = u_{k}] . \end{aligned}$ (22) This function was first defined in Ref. [Citation16]. Note that $Q_{π}$ is defined only for deterministic policies. In principle, it is possible to define it for probabilistic policies, but this is not commonly done. In the above definition, the expectation $E_{π}$ is with respect to the evolution of the state $X_{t}$ under the policy π.

The way in which a MDP is setup is that at time t, the Markov process reaches a state $X_{t}$ , based on the previous state $X_{t - 1}$ and the state transition matrix $A_{π}$ corresponding to the policy π. Once $X_{t}$ is known, the policy π determines the action $U_{t} = π (X_{t})$ , and then the reward $R_{π} (X_{t}) = R (X_{t}, π (X_{t}))$ is generated. In particular, when defining the value function $V_{π} (x_{i})$ corresponding to a policy π, we start off the MDP in the initial state $X_{0} = x_{i}$ , and choose the action $U_{0} = π (x_{i})$ . However, in defining the action-value function $Q_{π}$ , we do not feel compelled to set $U_{0} = π (X_{0}) = π (x_{i})$ , and can choose an arbitrary action $u_{k} \in U$ . From t = 1 onwards however, the action $U_{t}$ is chosen as $U_{t} = π (X_{t})$ . This seemingly small change leads to some simplifications.

Just as we can interpret $V_{π} : X \to R$ as an n-dimensional vector, we can interpret $Q_{π} : X \times U \to R$ as an nm-dimensional vector, or as a matrix of dimension $n \times m$ . Consequently, the $Q_{π}$ -vector has higher dimension than the value vector.

Theorem 3.4

For each policy $π \in Π_{d}$ , the function $Q_{π}$ satisfies the recursive relationship

(23) $\begin{aligned} Q_{π} (x_{i}, u_{k}) = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} Q_{π} (x_{j}, π (x_{j})) . \end{aligned}$ (23)

Proof.

Observe that at time t = 0, the state transition matrix is $A^{u_{k}}$ . So, given that $X_{0} = x_{i}$ and $U_{0} = u_{k}$ , the next state $X_{1}$ has the distribution $X_{1} \sim [a_{i j}^{u_{k}}, j = 1, \dots, n] .$ Moreover, $U_{1} = π (X_{1})$ because the policy π is implemented from time t = 1 onwards. Therefore, $\begin{aligned} Q_{π} (x_{i}, u_{k}) & = R (x_{i}, u_{k}) \\ + E_{π} [\sum_{j = 1}^{n} a_{i j}^{u_{k}} (γ R (x_{j}, π (x_{j})) \\ + \sum_{t = 2}^{\infty} γ^{t} R_{π} (X_{t}) | X_{1} = x_{j}, U_{1} = π (x_{j}))] \\ = R (x_{i}, u_{k}) \\ + E_{π} [γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} (R (x_{j}, π (x_{j})) \\ + \sum_{t = 1}^{\infty} γ^{t} R_{π} (X_{t}) | X_{1} = x_{j}, U_{1} = π (x_{j}))] \\ = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} Q (x_{j}, π (x_{j})) . \end{aligned}$ This is the desired conclusion.

Theorem 3.5

The functions $V_{π}$ and $Q_{π}$ are related via (24) $\begin{aligned} V_{π} (x_{i}) = Q_{π} (x_{i}, π (x_{i})) . \end{aligned}$ (24)

Proof.

If we choose $u_{k} = π (x_{i})$ , then (Equation23(23) $\begin{aligned} Q_{π} (x_{i}, u_{k}) = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} Q_{π} (x_{j}, π (x_{j})) . \end{aligned}$ (23) ) becomes $Q_{π} (x_{i}, π (x_{i})) = R_{π} (x_{i}) + γ \sum_{j = 1}^{n} a_{i j}^{π (x_{j})} Q (x_{j}, π (x_{j})) .$ This is the same as (Equation11(11) $\begin{aligned} Pr {X_{t + 1} = x_{j} | X_{t} = x_{i}, U_{t} = u_{k}} = a_{i j}^{u_{k}} . \end{aligned}$ (11) ) written out componentwise. We know that (Equation11(11) $\begin{aligned} Pr {X_{t + 1} = x_{j} | X_{t} = x_{i}, U_{t} = u_{k}} = a_{i j}^{u_{k}} . \end{aligned}$ (11) ) has a unique solution, namely $V_{π}$ . This shows that (Equation24(24) $\begin{aligned} V_{π} (x_{i}) = Q_{π} (x_{i}, π (x_{i})) . \end{aligned}$ (24) ) holds.

The import of Theorem 3.5 is the following: in defining the function $Q_{π} (x_{i}, u_{k})$ for a fixed policy $π \in Π_{d}$ , we have the freedom to choose the initial action $u_{k}$ as any element we wish in the action space $U$ . However, if we choose the initial action $u_{k} = π (x_{i})$ for each state $x_{i} \in X$ , then the corresponding action-value function $Q_{π} (x_{i}, u_{k})$ equals the value function $V_{π} (x_{i})$ , for each state $x_{i} \in X$ .

In view of (Equation24(24) $\begin{aligned} V_{π} (x_{i}) = Q_{π} (x_{i}, π (x_{i})) . \end{aligned}$ (24) ), the recursive equation for $Q_{π}$ can be rewritten as (25) $\begin{aligned} Q_{π} (x_{i}, u_{k}) = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} V_{π} (x_{j}) . \end{aligned}$ (25) This motivates the next theorem.

Theorem 3.6

Define $Q^{*} : X \times U \to R$ by (26) $\begin{aligned} Q^{*} (x_{i}, u_{k}) = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} V^{*} (x_{j}) . \end{aligned}$ (26) Then $Q^{*} (\cdot, \cdot)$ satisfies the following relationships: (27) $\begin{aligned} Q^{*} (x_{i}, u_{k}) & = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} max_{w_{l} \in U} Q^{*} (x_{j}, w_{l}), \end{aligned}$ (27) (28) $\begin{aligned} V^{*} (x_{i}) & = max_{u_{k} \in U} Q^{*} (x_{i}, u_{k}) . \end{aligned}$ (28) Moreover, every policy $π \in Π_{d}$ such that (29) $\begin{aligned} π^{*} (x_{i}) = {a r g m a x}_{u_{k} \in U} Q^{*} (x_{i}, u_{k}) \end{aligned}$ (29) is optimal.

Proof.

Since $Q^{*} (\cdot, \cdot)$ is defined by (Equation26(26) $\begin{aligned} Q^{*} (x_{i}, u_{k}) = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} V^{*} (x_{j}) . \end{aligned}$ (26) ), it follows that $\begin{aligned} max_{u_{k} \in U} Q^{*} (x_{i}, u_{k}) \\ = max_{u_{k} \in U} [R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} V^{*} (x_{j})] = V^{*} (x_{i}) . \end{aligned}$ This establishes (Equation28(28) $\begin{aligned} V^{*} (x_{i}) & = max_{u_{k} \in U} Q^{*} (x_{i}, u_{k}) . \end{aligned}$ (28) ) and (Equation29(29) $\begin{aligned} π^{*} (x_{i}) = {a r g m a x}_{u_{k} \in U} Q^{*} (x_{i}, u_{k}) \end{aligned}$ (29) ). Substituting (Equation28(28) $\begin{aligned} V^{*} (x_{i}) & = max_{u_{k} \in U} Q^{*} (x_{i}, u_{k}) . \end{aligned}$ (28) ) into (Equation26(26) $\begin{aligned} Q^{*} (x_{i}, u_{k}) = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} V^{*} (x_{j}) . \end{aligned}$ (26) ) gives (Equation27(27) $\begin{aligned} Q^{*} (x_{i}, u_{k}) & = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} max_{w_{l} \in U} Q^{*} (x_{j}, w_{l}), \end{aligned}$ (27) ).

Theorem 3.6 converts the problem of determining an optimal policy into one of solving the implicit equation (Equation27(27) $\begin{aligned} Q^{*} (x_{i}, u_{k}) & = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} max_{w_{l} \in U} Q^{*} (x_{j}, w_{l}), \end{aligned}$ (27) ). For this purpose, we define an iteration on action-functions that is analogous to (Equation20(20) $\begin{aligned} (B v)_{i} := max_{u_{k} \in U} [R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} v_{j}] . \end{aligned}$ (20) ) for value functions. As with the value function, the action-value function can either be viewed as a map $Q : X \times U \to R$ , or as a vector in $R^{n m}$ , or as an $n \times m$ matrix. We use whichever interpretation is convenient in the given situation.

Theorem 3.7

Define $F : R^{| X | \times | U |} \to R^{| X | \times | U |}$ by (30) $\begin{aligned} [F (Q)] (x_{i}, u_{k}) := R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} max_{w_{l} \in U} Q (x_{j}, w_{l}) . \end{aligned}$ (30) Then the map F is monotone and is a contraction. Moreover, for all $Q_{0} : X \times U \to R$ , the sequence of iterations ${F^{t} (Q_{0})}$ converges to $Q^{*}$ as $t \to \infty$ .

If we were to rewrite (Equation21(21) $\begin{aligned} (\bar{v})_{i} := max_{u_{k} \in U} [R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} (\bar{v})_{j}] . \end{aligned}$ (21) ) and (Equation27(27) $\begin{aligned} Q^{*} (x_{i}, u_{k}) & = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} max_{w_{l} \in U} Q^{*} (x_{j}, w_{l}), \end{aligned}$ (27) ) in terms of expected values, the differences between the Q-function and the V-function would become apparent. We can rewrite (Equation21(21) $\begin{aligned} (\bar{v})_{i} := max_{u_{k} \in U} [R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} (\bar{v})_{j}] . \end{aligned}$ (21) ) as (31) $\begin{aligned} V^{*} (X_{t}) = max_{U_{t} \in U} {R (X_{t}, U_{t}) + γ E [V^{*} (X_{t + 1}) | X_{t}]} \end{aligned}$ (31) and (Equation27(27) $\begin{aligned} Q^{*} (x_{i}, u_{k}) & = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} max_{w_{l} \in U} Q^{*} (x_{j}, w_{l}), \end{aligned}$ (27) ) as (32) $\begin{aligned} Q^{*} (X_{t}, U_{t}) & = R (X_{t}, U_{t}) \\ + γ E [max_{U_{t + 1} \in U} Q^{*} (X_{t + 1}, U_{t + 1})] . \end{aligned}$ (32) Thus in the Bellman formulation and iteration, the maximization occurs outside the expectation, whereas with the Q-formulation and F-iteration, the maximization occurs inside the expectation.

3.3. Iterative algorithms for MDPs with unknown dynamics

In principle, Theorem 2.3 can be used to compute, to arbitrary precision, the value vector of a Markov reward process. Similarly, Theorem 3.7 can be used to compute, to arbitrary precision, the optimal action-value function of a MDP, from which both the optimal value function and the optimal policy can be determined. However, both theorems depend crucially on knowing the dynamics of the underlying process. For instance, if the state transition matrix A is not known, it would not be possible to carry out the iterations $y^{i + 1} = r + γ A y^{i} .$ Early researchers in RL were aware of this issue and developed several algorithms that do not require explicit knowledge of the dynamics of the underlying process. Instead, it is assumed that a sample path ${X_{t}}_{t = 0}^{\infty}$ of the Markov process, together with the associated reward process, are available for use. With this information, one can think of two distinct approaches. First, one can use the sample path to estimate the state transition matrix, call it $\hat{A}$ . After a sufficiently long sample path has been observed, the contraction iteration above can be applied with A replaced by $\hat{A}$ . This would correspond to so-called “indirect adaptive control.” The second approach would be to use the sample path right from time t = 0, and adjust only one component of the estimated value function at each time instant t. This would correspond to so-called “direct adaptive control.” Using a similar approach, it is also possible to estimate the action-value function based on a single sample path. We describe two such algorithms, namely temporal difference learning for estimating the value function of a Markov reward process and Q-learning for estimating the action-value function of a MDP. Within temporal difference, we make a further distinction between estimating the full value vector and estimating a projection of the value vector onto a lower-dimensional subspace.

3.3.1. Temporal difference learning without function approximation

In this subsection and the next, we describe the so-called “temporal difference” family of algorithms, first introduced in Ref. [Citation17]. The objective of the algorithm is to compute the value vector of a Markov reward process. Recall that the value vector $v$ of a Markov reward process satisfies (Equation6(6) $\begin{aligned} v = r + γ A v, \end{aligned}$ (6) ). In temporal difference approach, it is not assumed that the state transition matrix A is known. Rather, it is assumed that the learner has available a sample path ${(X_{t})}$ of the Markov process under study, together with the associated reward at each time. For simplicity, it is assumed that the reward is deterministic and not random. Thus the reward at time t is just $R (X_{t})$ and does not add any information.

There are two variants of the algorithm. In the first, one constructs a sequence of approximations ${\hat{v}}_{t}$ that, one hopes, would converge to the true value vector $v$ as $t \to \infty$ . In the second, which is used when n is very large, one chooses a “basis representation matrix” $Ψ \in R^{n \times d}$ , where $d ≪ n$ . Then one constructs a sequence of vectors $θ_{t} \in R^{d}$ , such that the corresponding sequence of vectors $Ψ θ_{t} \in R^{n}$ forms an approximation to the value vector $v$ . Since there is no a priori reason to believe that $v$ belongs to the range of Ψ, there is also no reason to believe that $Ψ θ_{t}$ would converge to $v$ . The second approach is called temporal difference learning with function approximation. The first is studied in this subsection, while the second is studied in the next subsection.

In principle, by observing the sample path for a sufficiently long duration, it is possible to make a reliable estimate of A. However, a key feature of the temporal difference algorithm is that it is a “direct” method, which works directly with the sample path, without attempting to infer the underlying Markov process. With the sample path ${X_{t}}$ of the Markov process, one can associate a corresponding “index process” ${N_{t}}$ taking values in $[n]$ as follows: $N_{t} = i if X_{t} = x_{i} \in X .$ It is obvious that the index process has the same transition matrix A as the process ${X_{t}}$ . The idea is to start with an initial estimate ${\hat{v}}_{0}$ , and update it at each time t based on the sample path ${(X_{t}, R (X_{t}))}$ .

Now we introduce the (Temporal Difference) $T D (λ)$ algorithm studied in this paper. This version of the $T D (λ)$ algorithm comes from Ref. [Citation18, Equation (4.7)] and is as follows: let $v^{*}$ denote the unique solution of the equationFootnote¹ $v^{*} = r + γ A v^{*} .$ At time t, let ${\hat{v}}_{t} \in R^{n}$ denote the current estimate of $v^{*}$ . Thus the ith component of ${\hat{v}}_{t}$ , denoted by ${\hat{V}}_{t, i}$ , is the estimate of the reward when the initial state is $x_{i}$ . Let ${N_{t}}$ be the index process defined above. Define the “temporal difference” (33) $\begin{aligned} δ_{t + 1} := R_{N_{t}} + γ {\hat{V}}_{t, N_{t + 1}} - {\hat{V}}_{t, N_{t}} \forall t \geq 0, \end{aligned}$ (33) where ${\hat{V}}_{t, N_{t}}$ denotes the $N_{t}$ th component of the vector ${\hat{v}}_{t}$ . Equivalently, if the state at time t is $x_{i} \in X$ and the state at the next time t + 1 is $x_{j}$ , then (34) $\begin{aligned} δ_{t + 1} = R_{i} + γ {\hat{V}}_{t, j} - {\hat{V}}_{t, i} . \end{aligned}$ (34) Next, choose a number $λ \in [0, 1)$ . Define the “eligibility vector” (35) $\begin{aligned} z_{t} = \sum_{τ = 0}^{t} (γ λ)^{τ} I_{{N_{t - τ} = N_{t}}} e_{N_{t - τ}}, \end{aligned}$ (35) where $e_{N_{s}}$ is a unit vector with a 1 in location $N_{s}$ and zeros elsewhere. Since the indicator function in the above summation picks up only those occurrences where $N_{t - τ} = N_{t}$ , the vector $z_{t}$ can also be expressed as (36) $\begin{aligned} z_{t} = z_{t} e_{N_{t}}, z_{t} = \sum_{τ = 0}^{t} (γ λ)^{τ} I_{{N_{t - τ} = N_{t}}} . \end{aligned}$ (36) Thus the support of the vector $z_{t}$ consists of the singleton ${N_{t}}$ . Finally, update the estimate ${\hat{v}}_{t}$ as (37) $\begin{aligned} {\hat{v}}_{t + 1} = {\hat{v}}_{t} + δ_{t + 1} α_{t} z_{t}, \end{aligned}$ (37) where ${α_{t}}$ is a sequence of step sizes. Note that, at time t, only the $N_{t}$ th component of ${\hat{v}}_{t}$ is updated, and the rest remain the same.

A sufficient condition for the convergence of the $T D (λ)$ -algorithm is given in Ref. [Citation18].

Theorem 3.8

The sequence ${{\hat{v}}_{t}}$ converges almost surely to $v^{*}$ as $t \to \infty$ , provided (38) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} I_{{N_{t} = i}} < \infty, a.s. \forall i \in [n], \end{aligned}$ (38) (39) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t} I_{{N_{t} = i}} = \infty, a.s. \forall i \in [n] . \end{aligned}$ (39)

3.3.2. TD-learning with function approximation

In this set-up, we again observe a time series ${(X_{t}, R (X_{t}))}$ . The new feature is that there is a “basis” matrix $Ψ \in R^{n \times d}$ , where $d ≪ n$ . The estimated value vector at time t is given by ${\hat{v}}_{t} = Ψ θ_{t}$ , where $θ_{t} \in R^{d}$ is the parameter to be updated. In this representation, it is clear that for any index $i \in [n]$ , we have that ${\hat{V}}_{t, i} = Ψ^{i} θ_{t} = 〈 (Ψ^{i})^{⊤}, θ_{t} 〉,$ where $Ψ^{i}$ denotes the ith row of the matrix Ψ.

Now we define the learning rule for updating $θ_{t}$ . Let ${X_{t}}$ be the observed sample path. By a slight abuse of notation, define $y_{t} = [Ψ^{X_{t}}]^{⊤} \in R^{d} .$ Thus, if $X_{t} = x_{i}$ , then $y_{t} = [Ψ^{i}]^{⊤}$ . The eligibility vector $z_{t} \in R^{d}$ is defined via (40) $\begin{aligned} z_{t} = \sum_{τ = 0}^{t} (γ λ)^{t - τ} y_{τ} . \end{aligned}$ (40) Note that $z_{t}$ satisfies the recursion $z_{t} = γ λ z_{t - 1} + y_{t} .$ Hence it is not necessary to keep track of an ever-growing set of past values of $y_{τ}$ . In contrast to (Equation35(35) $\begin{aligned} z_{t} = \sum_{τ = 0}^{t} (γ λ)^{τ} I_{{N_{t - τ} = N_{t}}} e_{N_{t - τ}}, \end{aligned}$ (35) ), there is no term of the type $I_{{N_{t - τ} = N_{t}}}$ in (Equation40(40) $\begin{aligned} z_{t} = \sum_{τ = 0}^{t} (γ λ)^{t - τ} y_{τ} . \end{aligned}$ (40) ). Thus, unlike the eligibility vector defined in (Equation35(35) $\begin{aligned} z_{t} = \sum_{τ = 0}^{t} (γ λ)^{τ} I_{{N_{t - τ} = N_{t}}} e_{N_{t - τ}}, \end{aligned}$ (35) ), the current vector $z_{t}$ can have more than one nonzero component. Next, define the temporal difference $δ_{t + 1}$ as in (Equation33(33) $\begin{aligned} δ_{t + 1} := R_{N_{t}} + γ {\hat{V}}_{t, N_{t + 1}} - {\hat{V}}_{t, N_{t}} \forall t \geq 0, \end{aligned}$ (33) ). Note that if $X_{t} = x_{i}$ and $X_{t + 1} = x_{j}$ , then $δ_{t + 1} = r_{i} + γ [Ψ^{j}]^{⊤} θ_{t} - [Ψ^{i}]^{⊤} θ_{t} .$ Then the updating rule is (41) $\begin{aligned} θ_{t + 1} = θ_{t} + α_{t} δ_{t + 1} z_{t}, \end{aligned}$ (41) where $α_{t}$ is the step size.

The convergence analysis of (Equation41(41) $\begin{aligned} θ_{t + 1} = θ_{t} + α_{t} δ_{t + 1} z_{t}, \end{aligned}$ (41) ) is carried out in detail in Ref. [Citation19] based on the assumption that the state transition matrix A is irreducible. This is quite reasonable, as it ensures that every state $x_{i}$ occurs infinitely often in any sample path, with probability one. Since that convergence analysis does not readily fit into the methods studied in subsequent sections, we state the main results without proof. However, we state and prove various intermediate results that are useful in their own right.

Suppose A is row-stochastic and irreducible, and let $μ$ denote its stationary distribution. Define $M = D i a g (μ_{i})$ and define a norm $‖ \cdot ‖_{M}$ on $R^{d}$ by $‖ v ‖_{M} = (v^{⊤} M v)^{1 / 2} .$ Then the corresponding distance between two vectors $v_{1}$ and $v_{2}$ is given by $‖ v_{1} - v_{2} ‖_{M} = ((v_{1} - v_{2})^{⊤} M (v_{1} - v_{2}))^{1 / 2} .$ Then the following result is proved in Ref. [Citation19].

Lemma 3.9

Suppose $A \in [0, 1]^{n \times n}$ is row-stochastic and irreducible. Let $μ$ be the stationary distribution of A. Then $‖ A v ‖_{M} \leq ‖ v ‖_{M} \forall v \in R^{n} .$ Consequently, the map $v \mapsto r + γ A v$ is a contraction with respect to $‖ \cdot ‖_{M}$ .

Proof.

We will show that $‖ A v ‖_{M}^{2} \leq ‖ v ‖_{M}^{2} \forall v \in R^{n},$ which is clearly equivalent to the $‖ A v ‖_{M} \leq ‖ v ‖_{M}$ . Now $‖ A v ‖_{M}^{2} = \sum_{i = 1}^{n} μ_{i} (A v)_{i}^{2} = \sum_{i = 1}^{n} μ_{i} {(\sum_{j = 1}^{n} A_{i j} v_{j})}^{2} .$ However, for each fixed index i, the row $A^{i}$ is a probability distribution, and the function $f (Y) = Y^{2}$ is convex. If we apply Jensen's inequality with $f (Y) = Y^{2}$ , we see that ${(\sum_{j = 1}^{n} A_{i j} v_{j})}^{2} \leq \sum_{j = 1}^{n} A_{i j} v_{j}^{2} \forall i .$ Therefore, $\begin{aligned} ‖ A v ‖_{M}^{2} & \leq \sum_{i = 1}^{n} μ_{i} (\sum_{j = 1}^{n} A_{i j} v_{j}^{2}) = \sum_{j = 1}^{n} (\sum_{i = 1}^{n} μ_{i} A_{i j}) v_{j}^{2} \\ = \sum_{j = 1}^{n} μ_{j} v_{j}^{2} = ‖ v ‖_{M}^{2}, \end{aligned}$ where in the last step, we use the fact that $μ A = μ$ .

To analyse the behaviour of the $T D (λ)$ algorithm with function approximation, the following map $T^{λ} : R^{n} \to R^{n}$ is defined in Ref. [Citation19]: $\begin{aligned} [T^{λ} v]_{i} & := (1 - λ) \sum_{l = 0}^{\infty} λ^{l} E \\ \times [\sum_{τ = 0}^{l} γ^{τ} R (X_{τ + 1}) + γ^{l + 1} V_{X_{l + 1}} | X_{0} x_{i}] . \end{aligned}$ Note that $T^{λ} v$ can be written explicitly as $T^{λ} v = (1 - λ) \sum_{l = 0}^{\infty} λ^{l} [\sum_{τ = 0}^{l} γ^{τ} A^{τ} r + γ^{l + 1} A^{l + 1} v] .$

Lemma 3.10

The map $T^{λ}$ is a contraction with respect to $‖ \cdot ‖_{M}$ , with contraction constant $[γ (1 - λ)] / (1 - γ λ)$ .

Proof.

Note that the first term on the right side does not depend on $v$ . Therefore, $T^{λ} (v_{1} - v_{2}) = γ (1 - λ) \sum_{l = 0}^{\infty} (γ λ)^{l} A^{l + 1} (v_{1} - v_{2}) .$ However, it is already known that $‖ A (v_{1} - v_{2}) ‖_{M} \leq ‖ v_{1} - v_{2} ‖_{M} .$ By repeatedly applying the above, it follows that $‖ A^{l} (v_{1} - v_{2}) ‖_{M} \leq ‖ v_{1} - v_{2} ‖_{M} \forall l .$ Therefore, $\begin{aligned} ‖ T^{λ} (v_{1} - v_{2}) ‖_{M} \leq γ (1 - λ) \sum_{l = 0}^{\infty} (γ λ)^{l} ‖ v_{1} - v_{2} ‖_{M} \\ = \frac{γ (1 - λ)}{1 - γ λ} ‖ v_{1} - v_{2} ‖_{M} . \end{aligned}$ This is the desired bound.

Define a projection $Π : R^{n} \to R^{n}$ by $Π a := Ψ (Ψ^{⊤} M Ψ)^{- 1} Ψ^{⊤} M a .$ Then $Π a = {a r g m i n}_{b \in Ψ (R^{d})} ‖ a - b ‖_{M} .$ Thus Π projects the space $R^{n}$ onto the image of the matrix Ψ, which is a d-dimensional subspace, if Ψ has full column rank. In other words, $Π a$ is the closest point to $a$ in the subspace $Ψ (R^{n})$ .

Next, observe that the projection Π is nonexpansive with respect to $‖ \cdot ‖_{M}$ . As a result, the composite map $Π T^{λ}$ is a contraction. Thus there exists a unique $\bar{v} \in R^{d}$ such that $Π T^{λ} \bar{v} = \bar{v} .$ Moreover, the above equation shows that in fact $\bar{v}$ belongs to the range of Ψ. Thus there exists a $θ^{*} \in R^{d}$ such that $\bar{v} = Ψ θ^{*}$ , and $θ^{*}$ is unique if Ψ has full column rank.

The limit behaviour of the $T D (λ)$ algorithm is given by the next theorem, which is a key result from Ref. [Citation19].

Theorem 3.11

Suppose that Ψ has full column rank and that $\sum_{t = 0}^{\infty} α_{t} = \infty a n d \sum_{t = 0}^{\infty} α_{t}^{2} < \infty .$ Then, the sequence ${θ_{t}}$ converges almost surely to $θ^{*} \in R^{d}$ , where $θ^{*}$ is the unique solution of $Π T^{λ} (Ψ θ^{*}) = Ψ θ^{*} .$ Moreover, $‖ Ψ θ^{*} - v^{*} ‖_{M} \leq \frac{1 - γ λ}{1 - γ} ‖ Π v^{*} - v^{*} ‖_{M} .$

Note that since $Ψ θ \in Π (R^{d})$ for all $θ \in R^{d}$ , the best that one can hope for is that $‖ Ψ θ^{*} - v^{*} ‖_{M} = ‖ Π v^{*} - v^{*} ‖_{M} .$ The theorem states that the above identity might not hold and provides an upper bound for the distance between the limit $Ψ θ^{*}$ and the true value vector $v^{*}$ . It is bounded by a factor $(1 - γ λ) / (1 - γ)$ times this minimum.

Note that $(1 - γ λ) / (1 - γ) > 1$ . So this is the extent to which the $T D (λ)$ iterations miss the optimal approximation.

3.3.3. Q-learning

The Q-learning algorithm proposed in Ref. [Citation16] has the characterization (Equation27(27) $\begin{aligned} Q^{*} (x_{i}, u_{k}) & = R (x_{i}, u_{k}) + γ \sum_{j = 1}^{n} a_{i j}^{u_{k}} max_{w_{l} \in U} Q^{*} (x_{j}, w_{l}), \end{aligned}$ (27) ) of $Q^{*}$ as its starting point. The algorithm is based on the following premise: at time t, the current state $X_{t}$ can be observed; call it $x_{i} \in X$ . Then the learner is free to choose the action $U_{t}$ ; call it $u_{k} \in U$ . With this choice, the next state $X_{t + 1}$ has the probability distribution equal to the ith row of the state transition matrix $A^{u_{k}}$ . Suppose the observed next stat $X_{t + 1}$ is $x_{j} \in X$ . With these conventions, the Q-learning algorithm proceeds as follows.

Choose an arbitrary initial guess $Q_{0} : X \times U \to R$ and an initial state $X_{0} \in X$ .
At time t, with current state $X_{t} = x_{i}$ , choose a current action $U_{t} = u_{k} \in U$ , and let the Markov process run for one time step. Observe the resulting next state $X_{t + 1} = x_{j}$ . Then update the function $Q_{t}$ as follows: (42) $\begin{aligned} \begin{aligned} Q_{t + 1} (x_{i}, u_{k}) & = Q_{t} (x_{i}, u_{k}) + α_{t} [R (x_{i}, u_{k}) \\ + γ V_{t} (x_{j}) - Q_{t} (x_{i}, u_{k})], \\ Q_{t + 1} (x_{s}, w_{l}) & = Q_{t} (x_{s}, w_{l}) \forall (x_{s}, w_{l}) \neq (x_{i}, u_{k}), \end{aligned} \end{aligned}$ (42) where (43) $\begin{aligned} V_{t} (x_{j}) = max_{w_{l} \in U} Q_{t} (x_{j}, w_{l}), \end{aligned}$ (43) and ${α_{t}}$ is a deterministic sequence of step sizes.
Repeat.

It is evident that in the Q-learning algorithm, at any instant of time t, only one element (namely,

Q (X_{t}, U_{t})

) gets updated. In the original paper by Watkins and Dayan [Citation16], the convergence of the algorithm used some rather ad hoc methods. Subsequently, a general class of algorithms known as “asynchronous stochastic approximation (ASA),” which included Q-learning as a special case, was introduced in Refs. [Citation18,Citation20]. A sufficient condition for the convergence of the Q-learning algorithm, which was originally presented in Ref. [Citation16], is rederived using these methods.

Theorem 3.12

The Q-learning algorithm converges to the optimal action-value function $Q^{*}$ provided the following conditions are satisfied. (44) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} = \infty \forall (x_{i}, u_{k}) \in X \times U, \end{aligned}$ (44) (45) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} < \infty \forall (x_{i}, u_{k}) \in X \times U . \end{aligned}$ (45)

The main shortcoming of Theorems 3.8 and 3.12 is that the sufficient conditions (Equation38(38) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} I_{{N_{t} = i}} < \infty, a.s. \forall i \in [n], \end{aligned}$ (38) ), (Equation39(39) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t} I_{{N_{t} = i}} = \infty, a.s. \forall i \in [n] . \end{aligned}$ (39) ), (Equation44(44) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} = \infty \forall (x_{i}, u_{k}) \in X \times U, \end{aligned}$ (44) ), and (Equation45(45) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} < \infty \forall (x_{i}, u_{k}) \in X \times U . \end{aligned}$ (45) ) are probabilistic in nature. Thus it is not clear how they are to be verified in a specific application. Note that in the Q-learning algorithm, there is no guidance on how to choose the next action $U_{t}$ . Presumably $U_{t}$ is chosen so as to ensure that (Equation44(44) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} = \infty \forall (x_{i}, u_{k}) \in X \times U, \end{aligned}$ (44) ) and (Equation45(45) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} < \infty \forall (x_{i}, u_{k}) \in X \times U . \end{aligned}$ (45) ) are satisfied. In Section 5, we show how these theorems can be proven, and also, how the troublesome probabilistic sufficient conditions can be replaced by purely algebraic conditions.

4. SA algorithms

4.1. SA and relevance to RL

The contents of the previous section make it clear that in MDP theory, a central role is played by the need to solve fixed point problems. Determining the value of a Markov reward problem requires the solution of (Equation6(6) $\begin{aligned} v = r + γ A v, \end{aligned}$ (6) ). Determining the optimal value of an MDP requires finding the fixed point of the Bellman iteration. Finally, determining the optimal policy for an MDP requires finding the fixed point of the F-iteration. As pointed out in Section 3.3, when the dynamics of an MDP are completely known, these fixed point problems can be solved by repeatedly applying the corresponding contraction mapping. However, when the dynamics of the MDP are not known, and one has access only to a sample path of the MDP, a different approach is required. In Section 3.3, we have presented two such methods, namely the temporal difference algorithm for value determination and the Q-learning algorithm for determining the optimal action-value function. Theorems 3.8 and 3.12, respectively, give sufficient conditions for the convergence of these algorithms. The proofs of these theorems, as given in the original papers, tend to be “one-off,” that is, tailored to the specific algorithm. It is now shown that a probabilistic method known as “SA” can be used to unify these methods in a common format. Moreover, instead of the convergence proofs being “one-off,” the SA algorithm provides a unifying approach.

The applications of SA go beyond these two specific algorithms. There is another area called “ deep RL” for problems in which the size of the state space is very large. Recall that the action-value function $Q : X \times U$ can either be viewed as an nm-dimensional vector or an $n \times m$ matrix. In deep RL, one determines (either exactly or approximately) the action-value function $Q (x_{i}, u_{k})$ for a small number of pairs $(x_{i}, u_{k}) \in X \times U$ . Using these as a starting point, the overall function Q defined for all pairs $(x_{i}, u_{k}) \in X \times U$ is obtained by training a deep neural network. Training a neural network (in this or any other application) requires the minimization of the average mean-squared error, denoted by $J (θ)$ where $θ$ denotes the vector of adjustable parameters. In general, the function $J (\cdot)$ is not convex; hence one can at best aspire to find a stationary point of $J (\cdot)$ , i.e. a solution to the equation $\nabla J (θ) = 0$ . This problem is also amenable to the application of the SA approach.

Now we give a brief introduction to SA. Suppose $f : R^{d} \to R^{d}$ is some function and d can be any integer. The objective of SA is to find a solution to the equation $f (θ) = 0$ , when only noisy measurements of $f (\cdot)$ are available. The SA method was introduced in Ref. [Citation21], where the objective was to find a solution to a scalar equation $f (θ) = 0$ , where $f : R \to R$ . The extension to the case where d>1 was first proposed in Ref. [Citation22]. The problem of finding a fixed point of a map $g : R^{d} \to R^{d}$ can be formulated as the above problem with $f (θ) := g (θ) - θ$ . If it is desired to find a stationary point of a $C^{1}$ function $J : R^{d} \to R$ , then we simply set $f (θ) = \nabla J (θ)$ . Thus the above problem formulation is quite versatile. More details are given at the start of Section 4.2.

SA is a family of iterative algorithms, in which one begins with an initial guess $θ_{0}$ and derives the next guess $θ_{t + 1}$ from $θ_{t}$ . Several variants of SA are possible. In synchronous SA, every component of $θ_{t}$ is changed to obtain $θ_{t + 1}$ . This was the original concept of SA. If, at any time t, only one component of $θ_{t}$ is changed to obtain $θ_{t + 1}$ , and the others remain unchanged, this is known as ASA . This phrase was apparently first introduced in Ref. [Citation20]. A variant of the approach in Ref. [Citation20] is presented in Ref. [Citation23]. Specifically, in Ref. [Citation23], a distinction is introduced between using a “local clock” versus using a “global clock.” It is also possible to study an intermediate situation where, at each time t, some but not necessarily all components of $θ_{t}$ are updated. There does not appear to be a common name for this situation. The phrase batch asynchronous stochastic approximation (BASA) is introduced in Ref. [Citation24]. More details about these variations are given below. There is a fourth variant known as two-time scale SA is introduced in Ref. [Citation25]. In this set-up, one attempts to solve two coupled equations of the form $f (θ, ϕ) = 0, g (θ, ϕ) = 0,$ where $θ \in R^{n}, ϕ \in R^{m}$ , and $f : R^{n} \times R^{m} \to R^{n}, g : R^{n} \times R^{m} \to R^{m}$ . The idea is that one of the iterations (say $θ_{t + 1}$ ) is updated “more slowly” than the other (say $ϕ_{t + 1}$ ). Due to space limitations, two time-scale SA is not discussed further in this paper. The interested reader is referred to Refs. [Citation25–27] for the theory and to Refs. [Citation28,Citation29] for applications to a specific type of RL, known as actor-critic algorithms.

The relevance of SA to RL arises from the following factors:

Many (though not all) algorithms used in RL can be formulated as some type of SA algorithms.
Examples include temporal difference learning, temporal difference learning with function approximation, Q -learning, deep neural network learning, and actor-critic learning. The first three are discussed in detail in Section 5.

Thus, SA provides a unifying framework for several disparate-looking RL algorithms.

4.2. Problem formulation

There are several equivalent formulations of the basic SA problem.

Finding a zero of a function: Suppose $f : R^{d} \to R^{d}$ is some function. Note that $f (\cdot)$ need not be available in closed form. The only thing needed is that given any $θ \in R^{d}$ , an “oracle” returns a noise-corrupted version of $f (θ)$ . The objective is to determine a solution of the equation $f (θ) = 0$ .
Finding a fixed point of a mapping: Suppose $g : R^{d} \to R^{d}$ . The objective is to find a fixed point of $g (\cdot)$ , that is, a solution to $g (θ) = θ$ . If we define $f (θ) = g (θ) - θ$ , this is the same problem as the above. One might ask: Why not define $f (θ) = θ - g (θ)$ ? As we shall see below, the convergence of the SA algorithm (in various forms) is closely related to the global asymptotic stability of the ODE $\dot{θ} = f (θ)$ . Also, as seen in the previous section, in many applications, the map $g (\cdot)$ of which we wish to find a fixed point is a contraction. In such a case, there is a unique fixed point $θ^{*}$ of $g (\cdot)$ . In such a case, under relatively mild conditions, $θ^{*}$ is a globally asymptotically stable equilibrium of the ODE $\dot{θ} = g (θ) - θ$ , but not if the sign is reversed.
Finding a stationary point of a function: Suppose $J : R^{d} \to R$ is a $C^{1}$ function. The objective is to find a stationary point of $J (\cdot)$ , that is, a $θ$ such that $\nabla J (θ) = 0$ . If we define $f (θ) = - \nabla J (θ)$ , then this is the same problem as above. Here again, if we wish the SA algorithm to converge to a global minimum of $J (\cdot)$ , then the minus sign is essential. On the other hand, if we wish the SA algorithm to converge to a global maximum of $J (\cdot)$ , then we remove the minus sign.

Suppose the problem is one of finding a zero of a given function

f (\cdot)

. The synchronous version of SA proceeds as follows: An initial guess

θ_{0} \in R^{d}

is chosen (usually in a deterministic manner, but it can also be randomly chosen). At time t, the available measurement is

(46)

\begin{aligned} y_{t + 1} = f (θ_{t}) + ξ_{t + 1}, \end{aligned}

(46) where

ξ_{t + 1}

is the measurement noise. Based on this, the current guess is updated to

(47)

\begin{aligned} θ_{t + 1} = θ_{t} + α_{t} y_{t + 1} = θ_{t} + α_{t} [f (θ_{t}) + ξ_{t + 1}], \end{aligned}

(47) where

{α_{t}}

is a predefined sequence of “step sizes,” with

α_{t} \in (0, 1)

for all t. If the problem is that of finding a fixed point of

g (\cdot)

, the updating rule is

(48)

\begin{aligned} θ_{t + 1} & = θ_{t} + α_{t} [g (θ_{t}) - θ_{t} + ξ_{t + 1}] \\ = (1 - α_{t}) θ_{t} + α_{t} [g (θ_{t}) + ξ_{t + 1}] . \end{aligned}

(48) If the problem is to find a stationary point of

J (\cdot)

, the updating rule is

(49)

\begin{aligned} θ_{t + 1} = θ_{t} + α_{t} y_{t + 1} = θ_{t} + α_{t} [- \nabla J (θ_{t}) + ξ_{t + 1}] . \end{aligned}

(49) These updating rules represent what might be called Synchronous SA, because at each time t, every component of

θ_{t}

is updated. Other variants of SA are studied in subsequent sections.

4.3. A new theorem for global asymptotic stability

In this section, we state a new theorem on the global asymptotic stability of nonlinear ODEs. This theorem is new and is of interest aside from its applications to the convergence of SA algorithms. The contents of this section and the next section are taken from Ref. [Citation30]. To state the result (Theorem 4.3), we introduce a few preliminary concepts from Lyapunov stability theory. The required background can be found in Refs. [Citation31–33].

Definition 4.1

A function $ϕ : R_{+} \to R_{+}$ is said to belong to class $K$ , denoted by $ϕ \in K$ , if $ϕ (0) = 0$ , and $ϕ (\cdot)$ is strictly increasing. A function $ϕ \in K$ is said to belong to class $K R$ , denoted by $ϕ \in K R$ , if in addition, $ϕ (r) \to \infty$ as $r \to \infty$ . A function $ϕ : R_{+} \to R_{+}$ is said to belong to class $B$ , denoted by $ϕ \in B$ , if $ϕ (0) = 0$ , and in addition, for all $0 < ϵ < M < \infty$ , we have that (50) $\begin{aligned} inf_{ϵ \leq r \leq M} ϕ (r) > 0 . \end{aligned}$ (50)

The concepts of functions of class $K$ and class $K R$ are standard. The concept of a function of class $B$ is new. Note that if $ϕ (\cdot)$ is continuous, then it belongs to class $B$ if and only if $ϕ (0) = 0$ and $ϕ (r) > 0$ for all r>0.

Example 4.2

Observe that every ϕ of class $K$ also belongs to class $B$ . However, the converse is not true. Define $ϕ (r) = {\begin{cases} r & if r \in [0, 1], \\ e^{- (r - 1)} & if r > 1 . \end{cases}$ Then ϕ belongs to class $B$ . However, since $ϕ (r) \to 0$ as $r \to \infty$ , ϕ cannot be bounded below by any function of class $K$ .

Suppose we wish to find a solution of $f (θ) = 0$ . The convergence analysis of synchronous SA depends on the stability of an associated ODE $\dot{θ} = f (θ)$ . We now state a new theorem on global asymptotic stability, and then use this to establish the convergence of the synchronous SA algorithm. In order to state this theorem, we first introduce some standing assumptions on $f (\cdot)$ . Note that these assumptions are standard in the literature.

(F1)	The equation $f (θ) = 0$ has a unique solution $θ^{*}$ .
(F2)	The function $f$ is globally Lipschitz-continuous with constant L. (51) $\begin{aligned} ‖ f (θ) - f (ϕ) ‖_{2} \leq L ‖ θ - ϕ ‖_{2} \forall θ, ϕ \in R^{d} . \end{aligned}$ (51)

Theorem 4.3

Suppose Assumption (F1) holds, and that there exists a function $V : R^{d} \to R_{+}$ and functions $η, ψ \in K R, ϕ \in B$ such that (52) $\begin{aligned} η (‖ θ - θ^{*} ‖_{2}) & \leq V (θ) \leq ψ (‖ θ - θ^{*} ‖_{2}) \forall θ \in R^{d}, \end{aligned}$ (52) (53) $\begin{aligned} \dot{V} (θ) & \leq - ϕ (‖ θ - θ^{*} ‖_{2}) \forall θ \in R^{d} . \end{aligned}$ (53) Then $θ^{*}$ is a globally asymptotically stable equilibrium of the ODE $\dot{θ} = f (θ)$ .

This is Ref. [Citation30, Theorem 4], and the proof can be found therein. Well-known classical theorems for global asymptotic stability, such as those found in Refs. [Citation31–33], require the function $ϕ (\cdot)$ to belong to class $K$ . Theorem 4.3 is an improvement, in that the function $ϕ (\cdot)$ is required only to belong to the larger class $B$ .

4.4. A convergence theorem for synchronous SA

In this subsection, we present a convergence theorem for synchronous SA. Theorem 4.4 is slightly more general than a corresponding result in Ref. [Citation30]. This theorem is obtained by combining some results from Refs. [Citation24,Citation30]. Other convergence theorems and examples can be found in Ref. [Citation30].

In order to analyse the convergence of the SA algorithm, we need to make some assumptions about the nature of the measurement error sequence ${ξ_{t}}$ . These assumptions are couched in terms of the conditional expectation of a random variable with respect to a σ-algebra. Readers who are unfamiliar with the concept are referred to Ref. [Citation34] for the relevant background.

Let $θ_{0}^{t}$ denote the tuple $θ_{0}, θ_{1}, \dots, θ_{t}$ , and define $ξ_{1}^{t}$ analogously; note that there is no $ξ_{0}$ . Let ${F_{t}}_{t \geq 0}$ be any filtration (i.e. increasing sequence of σ-algebras), such that $θ_{0}^{t}, ξ_{1}^{t}$ are measurable with respect to $F_{t}$ . For example, one can choose $F_{t}$ to be the σ-algebra generated by the tuples $θ_{0}^{t}, ξ_{1}^{t}$ .

(N1)	There exists a sequence ${b_{t}}$ of nonnegative numbers such that (54) $\begin{aligned} ‖ E (ξ_{t + 1} \| F_{t}) ‖_{2} \leq b_{t} a.s. \forall t \geq 0 . \end{aligned}$ (54) Thus $b_{t}$ provides a bound on the Euclidean norm of the conditional expectation of the measurement error with respect to the σ-algebra $F_{t}$ .
(N2)	There exists a sequence ${σ_{t}}$ of nonnegative numbers such that (55) $\begin{aligned} E (‖ ξ_{t + 1} - E (ξ_{t + 1} \| F_{t}) ‖_{2}^{2} \| F_{t}) \\ \leq σ_{t}^{2} (1 + ‖ θ_{t} ‖_{2}^{2}), a.s. \forall t \geq 0 . \end{aligned}$ (55)

Note that the quantity on the left side of (Equation55) is the conditional variance of

ξ_{t + 1}

with respect to the σ-algebra

F_{t}

Now we can state a theorem about the convergence of synchronous SA.

Theorem 4.4

Suppose $f (θ^{*}) = 0$ , and Assumptions (F1–F2) and (N1–N2) hold. Suppose in addition that there exists a $C^{2}$ Lyapunov function $V : R^{d} \to R_{+}$ that satisfies the following conditions:

There exist constants a, b>0 such that (56) $\begin{aligned} a ‖ θ - θ^{*} ‖_{2}^{2} \leq V (θ) \leq b ‖ θ - θ^{*} ‖_{2}^{2} \forall θ \in R^{d} . \end{aligned}$ (56)
There is a finite constant M such that (57) $\begin{aligned} ‖ \nabla^{2} V (θ) ‖_{S} \leq 2 M \forall θ \in R^{d} . \end{aligned}$ (57)

With this hypothesis, we can state the following conclusions:

(1)	If $\dot{V} (θ) \leq 0$ for all $θ \in R^{d}$ , and if (58) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} < \infty, \sum_{t = 0}^{\infty} α_{t} b_{t} < \infty, \sum_{t = 0}^{\infty} α_{t}^{2} σ_{t}^{2} < \infty, \end{aligned}$ (58) then the iterations ${θ_{t}}$ are bounded almost surely.
(2)	Suppose further that there exists a function $ϕ \in B$ such that (59) $\begin{aligned} \dot{V} (θ) \leq - ϕ (‖ θ - θ^{} ‖_{2}) \forall θ \in R^{d} \end{aligned}$ (59) and in addition to (Equation58(58) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} < \infty, \sum_{t = 0}^{\infty} α_{t} b_{t} < \infty, \sum_{t = 0}^{\infty} α_{t}^{2} σ_{t}^{2} < \infty, \end{aligned}$ (58) ), we also have (60) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t} = \infty, \end{aligned}$ (60) Then $θ_{t} \to θ^{}$ almost surely as $t \to \infty$ .

Observe the nice “division of labour” between the two conditions: Equation (Equation58(58) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} < \infty, \sum_{t = 0}^{\infty} α_{t} b_{t} < \infty, \sum_{t = 0}^{\infty} α_{t}^{2} σ_{t}^{2} < \infty, \end{aligned}$ (58) ) guarantees the almost sure boundedness of the iterations, while the addition of (Equation60(60) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t} = \infty, \end{aligned}$ (60) ) leads to the almost sure convergence of the iterations to the desired limit, namely the solution of $f (θ) = 0$ . This division of labour is first found in Ref. [Citation35]. Theorem 4.4 is a substantial improvement on Ref. [Citation36], which were the previously best results. The interested reader is referred to Ref. [Citation30] for further details.

Theorem 4.4 is a slight generalization of Ref. [Citation30, Theorem 5]. In that theorem, it is assumed that $b_{t} = 0$ for all t, and that the constants $σ_{t}$ are uniformly bounded by some constant σ. In this case, (Equation58(58) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} < \infty, \sum_{t = 0}^{\infty} α_{t} b_{t} < \infty, \sum_{t = 0}^{\infty} α_{t}^{2} σ_{t}^{2} < \infty, \end{aligned}$ (58) ) and (Equation60(60) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t} = \infty, \end{aligned}$ (60) ) become (61) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} < \infty, \sum_{t = 0}^{\infty} α_{t} = \infty . \end{aligned}$ (61) These two conditions are usually referred to as the Robbins–Monro conditions.

4.5. Convergence of BASA

Equations (Equation47(47) $\begin{aligned} θ_{t + 1} = θ_{t} + α_{t} y_{t + 1} = θ_{t} + α_{t} [f (θ_{t}) + ξ_{t + 1}], \end{aligned}$ (47) ) through (Equation49(49) $\begin{aligned} θ_{t + 1} = θ_{t} + α_{t} y_{t + 1} = θ_{t} + α_{t} [- \nabla J (θ_{t}) + ξ_{t + 1}] . \end{aligned}$ (49) ) represent what might be called synchronous SA, because at each time t, every component of $θ_{t}$ is updated. Variants of synchronous SA include ASA, where at each time t, exactly one component of $θ_{t}$ is updated, and BASA, where at each time t, some but not necessarily all components of $θ_{t}$ are updated. We present the results for BASA, because ASA is a special case of BASA. Moreover, we focus on (Equation48(48) $\begin{aligned} θ_{t + 1} & = θ_{t} + α_{t} [g (θ_{t}) - θ_{t} + ξ_{t + 1}] \\ = (1 - α_{t}) θ_{t} + α_{t} [g (θ_{t}) + ξ_{t + 1}] . \end{aligned}$ (48) ), where the objective is to find a fixed point of a contractive map $g$ . The modifications required for (Equation47(47) $\begin{aligned} θ_{t + 1} = θ_{t} + α_{t} y_{t + 1} = θ_{t} + α_{t} [f (θ_{t}) + ξ_{t + 1}], \end{aligned}$ (47) ) and (Equation49(49) $\begin{aligned} θ_{t + 1} = θ_{t} + α_{t} y_{t + 1} = θ_{t} + α_{t} [- \nabla J (θ_{t}) + ξ_{t + 1}] . \end{aligned}$ (49) ) are straight-forward.

The relevant reference for these results is Ref. [Citation24]. As a slight modification of (Equation46(46) $\begin{aligned} y_{t + 1} = f (θ_{t}) + ξ_{t + 1}, \end{aligned}$ (46) ), it is assumed that at each time t + 1, there is available a noisy measurement (62) $\begin{aligned} y_{t + 1} = g (θ_{t}) - θ_{t} + ξ_{t + 1} . \end{aligned}$ (62) We assume that there is a given deterministic sequence of “step sizes” ${β_{t}}$ . In BASA, not every component of $θ_{t}$ is updated at time t. To determine which components are to be updated, we define d different binary “update processes” ${κ_{t, i}}$ , $i \in [d]$ . No assumptions are made regarding their independence. At time t, define (63) $\begin{aligned} S (t) := {i \in [d] : κ_{t, i} = 1} . \end{aligned}$ (63) This means that (64) $\begin{aligned} θ_{t + 1, i} = θ_{t, i} \forall i \notin S (t) . \end{aligned}$ (64) In order to define $θ_{t + 1, i}$ when $i \in S (t)$ , we make a distinction between two different approaches: global clocks and local clocks. If a global clock is used, then (65) $\begin{aligned} α_{t, i} = β_{t} \forall i \in S (t), α_{t, i} = 0 \forall i \notin S (t) . \end{aligned}$ (65) If a local clock is used, then we first define the local counter (66) $\begin{aligned} ν_{t, i} = \sum_{τ = 0}^{t} κ_{τ, i}, i \in [d], \end{aligned}$ (66) which is the total number of occasions when $i \in S (τ)$ , $0 \leq τ \leq t$ . Equivalently, $ν_{t, i}$ is the total number of times up to and including time t when $θ_{τ, i}$ is updated. With this convention, we define (67) $\begin{aligned} α_{t, i} = β_{ν_{t, i}} \forall i \in S (t), α_{t, i} = 0 \forall i \notin S (t) . \end{aligned}$ (67) The distinction between global clocks and local clocks was apparently introduced in Ref. [Citation23]. Traditional RL algorithms such as $T D (λ)$ and Q-learning, discussed in detail in Section 3.3 and again in Sections 5.2 and 5.3, use a global clock. That is not surprising because Ref. [Citation23] came after Refs. [Citation16,Citation17]. It is shown in Ref. [Citation24] that the use of local clocks actually simplifies the analysis of these algorithms.

Now we present the BASA updating rules. Let us define the “step size vector” $α_{t} \in R_{+}^{d}$ via (Equation65(65) $\begin{aligned} α_{t, i} = β_{t} \forall i \in S (t), α_{t, i} = 0 \forall i \notin S (t) . \end{aligned}$ (65) ) or (Equation67(67) $\begin{aligned} α_{t, i} = β_{ν_{t, i}} \forall i \in S (t), α_{t, i} = 0 \forall i \notin S (t) . \end{aligned}$ (67) ) as appropriate. Then the update rule is (68) $\begin{aligned} θ_{t + 1} = θ_{t} + α_{t} \circ y_{t + 1}, \end{aligned}$ (68) where $y_{t + 1}$ is defined in (Equation62(62) $\begin{aligned} y_{t + 1} = g (θ_{t}) - θ_{t} + ξ_{t + 1} . \end{aligned}$ (62) ). Here, the symbol ° denotes the Hadamard product of two vectors of equal dimensions. Thus if $a, b$ have the same dimensions, then $c = a \circ b$ is defined by $c_{i} = a_{i} b_{i}$ for all i.

Recall that we are given a function $g : R^{d} \to R^{d}$ , and the objective is to find a solution to the fixed point equation $g (θ) = θ$ . Towards this end, we begin by stating the assumptions about the noise sequence.

(N1′)	There exists a sequence of constants ${b_{t}}$ such that (69) $\begin{aligned} E (‖ ξ_{t + 1} ‖_{2} \| F_{t}) \leq b_{t} (1 + ‖ θ_{0}^{t} ‖_{\infty}) \forall t \geq 0. \end{aligned}$ (69)
(N2′)	There exists a sequence of constants ${σ_{t}}$ such that (70) $\begin{aligned} E (‖ ξ_{t + 1} - E (ξ_{t + 1} \| F_{t}) ‖_{2}^{2} \| F_{t}) \\ \leq σ_{t}^{2} (1 + ‖ θ_{0}^{t} ‖_{\infty}^{2}) \forall t \geq 0. \end{aligned}$ (70)

Comparing (Equation54) and (Equation55) with (Equation69) and (Equation70), respectively, we see that the term

‖ θ_{t} ‖_{2}^{2}

is replaced by

‖ θ_{0}^{t} ‖_{\infty}

. So the constants

b_{t}

and

σ_{t}

can be different in the two cases. But because the two formulations is quite similar, we denote the first set of conditions as (N1) and (N2), and the second set of conditions as (N1′) and (N2′).

Next we state conditions on the step size sequence, which allow us to state the theorems in a compact manner. Next, we state the assumptions on the step size sequence. Note that if a local clock is used, then $α_{t, i}$ can be random even if $β_{t}$ is deterministic.

(S1)	The random step size sequences ${α_{t, i}}$ and the sequences ${b_{t}}$ , ${σ_{t}^{2}}$ satisfy (71) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t, i}^{2} < \infty, \sum_{t = 0}^{\infty} σ_{t}^{2} α_{t, i}^{2} < \infty, \\ \times \sum_{t = 0}^{\infty} b_{t} α_{t, i} < \infty, a.s. \forall i \in [d] . \end{aligned}$ (71)
(S2)	The random step size sequence ${α_{t, i}}$ satisfies (72) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t, i} = \infty, a.s. \forall i \in [d] . \end{aligned}$ (72)

Finally, we state an assumption about the map $g$ .

(G)

$g$ is a contraction with respect to the $ℓ_{\infty}$ -norm with some contraction constant $γ < 1$ .

Theorem 4.5

Suppose that Assumptions (N1′) and (N2′) about the noise sequence, (S1) about the step size sequence, and (G) about the function $g$ hold. Then $sup_{t} ‖ θ_{t} ‖_{\infty} < \infty$ almost surely.

Theorem 4.6

Let $θ^{*}$ denote the unique fixed point of $g$ . Suppose that Assumptions (N1′) and (N2′) about the noise sequence, (S1) and (S2) about the step size sequence, and (G) about the function $g$ hold. Then $θ_{t}$ converges almost surely to $θ^{*}$ as $t \to \infty$ .

The proofs of these theorems can be found in Ref. [Citation24].

5. Applications to RL

In this section, we apply the contents of the previous section to derive sufficient conditions for two distinct RL algorithms, namely temporal difference learning (without function approximation) and Q -learning. Previously known results are stated in Section 3.3. So what is the need to re-analyse those algorithms again from the standpoint of SA? There are two reasons for doing so. First, the historical $T D (λ)$ and Q-learning algorithms are stated using a “global clock” as defined in Section 4.5. Subsequently, the concept of a “local clock” is introduced in Ref. [Citation23]. In Ref. [Citation24], the authors build upon this distinction to achieve two objectives. First, when a local clock is used, there are fewer assumptions. Second, by proving a result on the sample paths of an irreducible Markov process (proved in Ref. [Citation24]), probabilistic conditions such as (Equation38(38) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} I_{{N_{t} = i}} < \infty, a.s. \forall i \in [n], \end{aligned}$ (38) )–(Equation39(39) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t} I_{{N_{t} = i}} = \infty, a.s. \forall i \in [n] . \end{aligned}$ (39) ) and (Equation44(44) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} = \infty \forall (x_{i}, u_{k}) \in X \times U, \end{aligned}$ (44) )–(Equation45(45) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} < \infty \forall (x_{i}, u_{k}) \in X \times U . \end{aligned}$ (45) ) are replaced by purely algebraic conditions.

5.1. A useful theorem about irreducible Markov processes

Theorem 5.1

Suppose ${N (t)}$ is a Markov process on $[d]$ with a state transition matrix A that is irreducible. Suppose ${β_{t}}_{t \geq 0}$ is a sequence of real numbers in $(0, 1)$ such that $β_{t + 1} \leq β_{t}$ for all t, and (73) $\begin{aligned} \sum_{t = 0}^{\infty} β_{t} = \infty . \end{aligned}$ (73) Then (74) $\begin{aligned} \sum_{t = 0}^{\infty} β_{t} I_{{N (t) = i}} (ω) \\ = \sum_{t = 0}^{\infty} β_{t} f_{i} (N (t) (ω)) = \infty \forall i \in [d] \forall ω \in Ω_{0}, \end{aligned}$ (74) where I denotes the indicator function.

5.2. TD-learning without function approximation

Recall the $T D (λ)$ algorithm without function approximation, presented in Section 3.3. One observes a time series ${(X_{t}, R (X_{t}))}$ where ${X_{t}}$ is a Markov process over $X = {x_{1}, \dots, x_{n}}$ with a (possibly unknown) state transition matrix A, and $R : X \to R$ is a known reward function. With the sample path ${X_{t}}$ of the Markov process, one can associate a corresponding “index process” ${N_{t}}$ taking values in $[n]$ as follows: $N_{t} = i if X_{t} = x_{i} \in X .$ It is obvious that the index process has the same transition matrix A as the process ${X_{t}}$ . The idea is to start with an initial estimate ${\hat{v}}_{0}$ and update it at each time t based on the sample path ${(X_{t}, R_{t})}$ .

Now we recall the $T D (λ)$ algorithm without function approximation. At time t, let ${\hat{v}}_{t} \in R^{n}$ denote the current estimate of $v$ . Let ${N_{t}}$ be the index process defined above. Define the “temporal difference” (75) $\begin{aligned} δ_{t + 1} := R_{N_{t}} + γ {\hat{V}}_{t, N_{t + 1}} - {\hat{V}}_{t, N_{t}} \forall t \geq 0, \end{aligned}$ (75) where ${\hat{V}}_{t, N_{t}}$ denotes the $N_{t}$ th component of the vector ${\hat{v}}_{t}$ . Equivalently, if the state at time t is $x_{i} \in X$ and the state at the next time t + 1 is $x_{j}$ , then (76) $\begin{aligned} δ_{t + 1} = R_{i} + γ {\hat{V}}_{t, j} - {\hat{V}}_{t, i} . \end{aligned}$ (76) Next, choose a number $λ \in [0, 1)$ . Define the “eligibility vector” (77) $\begin{aligned} z_{t} = \sum_{τ = 0}^{t} (γ λ)^{τ} I_{{N_{t - τ} = N_{t}}} e_{N_{t - τ}}, \end{aligned}$ (77) where $e_{N_{s}}$ is a unit vector with a 1 in location $N_{s}$ and zeros elsewhere. Finally, update the estimate ${\hat{v}}_{t}$ as (78) $\begin{aligned} {\hat{v}}_{t + 1} = {\hat{v}}_{t} + δ_{t + 1} α_{t} z_{t}, \end{aligned}$ (78) where $α_{t}$ is the step size chosen in accordance with either a global or a local clock. The distinction between the two is described next.

To complete the problem specification, we need to specify how the step size $α_{t}$ is chosen in (Equation78(78) $\begin{aligned} {\hat{v}}_{t + 1} = {\hat{v}}_{t} + δ_{t + 1} α_{t} z_{t}, \end{aligned}$ (78) ). The two possibilities studied here are global clocks and local clocks. If a global clock is used, then $α_{t} = β_{t}$ , whereas if a local clock is used, then $α_{t} = β_{ν_{t, i}}$ , where $ν_{t, i} = \sum_{τ = 0}^{t} I_{{z_{τ, i} \neq 0}} .$ Note that in the traditional implementation of the $T D (λ)$ algorithm suggested in Refs. [Citation17–19], a global clock is used. Moreover, the algorithm is shown to converge provided (79) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} < \infty, \sum_{t = 0}^{\infty} α_{t} = \infty, a.s. \end{aligned}$ (79) As we shall see below, the theorem statements when local clocks are used involve slightly fewer assumptions than when global clocks are used. Moreover, neither involves probabilistic conditions such as (Equation38(38) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} I_{{N_{t} = i}} < \infty, a.s. \forall i \in [n], \end{aligned}$ (38) ) and (Equation39(39) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t} I_{{N_{t} = i}} = \infty, a.s. \forall i \in [n] . \end{aligned}$ (39) ), in contrast to Theorem 3.8.

Next we present two theorems regarding the convergence of the $T D (0)$ algorithm. As the hypotheses are slightly different, they are presented separately. But the proofs are quite similar and can be found in Ref. [Citation24].

Theorem 5.2

Consider the $T D (λ)$ algorithm using a local clock to determine the step size. Suppose that the state transition matrix A is irreducible, and that the deterministic step size sequence ${β_{t}}$ satisfies the Robbins–Monro conditions $\sum_{t = 0}^{\infty} β_{t} = \infty, \sum_{t = 0}^{\infty} β_{t}^{2} < \infty .$ Then $v_{t} \to v$ almost surely as $t \to \infty$ .

Theorem 5.3

Consider the $T D (λ)$ algorithm using a global clock to determine the step size. Suppose that the state transition matrix A is irreducible and that the deterministic step size sequence is nonincreasing (i.e. $β_{t + 1} \leq β_{t}$ for all t) and satisfies the Robbins–Monro conditions as described above. Then $v_{t} \to v$ almost surely as $t \to \infty$ .

5.3. Q-learning

The Q-learning algorithm proposed in Ref. [Citation16] is now recalled for the convenience of the reader.

Choose an arbitrary initial guess $Q_{0} : X \times U \to R$ and an initial state $X_{0} \in X$ .
At time t, with current state $X_{t} = x_{i}$ , choose a current action $U_{t} = u_{k} \in U$ , and let the Markov process run for one time step. Observe the resulting next state $X_{t + 1} = x_{j}$ . Then update the function $Q_{t}$ as follows: (80) $\begin{aligned} \begin{aligned} Q_{t + 1} (x_{i}, u_{k}) & = Q_{t} (x_{i}, u_{k}) + β_{t} [R (x_{i}, u_{k}) \\ + γ V_{t} (x_{j}) - Q_{t} (x_{i}, u_{k})], \\ Q_{t + 1} (x_{s}, w_{l}) & = Q_{t} (x_{s}, w_{l}) \forall (x_{s}, w_{l}) \neq (x_{i}, u_{k}), \end{aligned} \end{aligned}$ (80) where (81) $\begin{aligned} V_{t} (x_{j}) = max_{w_{l} \in U} Q_{t} (x_{j}, w_{l}), \end{aligned}$ (81) and ${β_{t}}$ is a deterministic sequence of step sizes.
Repeat.

In earlier work such as Refs. [Citation18,Citation20], it is shown that the Q-learning algorithm converges to the optimal action-value function $Q^{*}$ provided (82) $\begin{aligned} \sum_{t = 0}^{\infty} β_{t} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} = \infty \forall (x_{i}, u_{k}) \in X \times U, \end{aligned}$ (82) (83) $\begin{aligned} \sum_{t = 0}^{\infty} β_{t}^{2} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} < \infty \forall (x_{i}, u_{k}) \in X \times U . \end{aligned}$ (83) These conditions are stated here as Theorem 3.12. Similar hypotheses are present in all existing results in ASA. Note that in the Q-learning algorithm, there is no guidance on how to choose the next action $U_{t}$ . Presumably $U_{t}$ is chosen so as to ensure that (Equation82(82) $\begin{aligned} \sum_{t = 0}^{\infty} β_{t} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} = \infty \forall (x_{i}, u_{k}) \in X \times U, \end{aligned}$ (82) ) and (Equation83(83) $\begin{aligned} \sum_{t = 0}^{\infty} β_{t}^{2} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} < \infty \forall (x_{i}, u_{k}) \in X \times U . \end{aligned}$ (83) ) are satisfied. However, we now demonstrate a way to avoid such conditions using Theorem 5.1. We also introduce batch updating and show that it is possible to use a local clock instead of a global clock.

The batch Q-learning algorithm introduced here is as follows:

Choose an arbitrary initial guess $Q_{0} : X \times U \to R$ , and m initial states $X_{0}^{k} \in X, k \in [m]$ , in some fashion (deterministic or random). Note that the m initial states need not be distinct.
At time t, for each action index $k \in [m]$ , with current state $X_{t}^{k} = x_{i}^{k}$ , choose the current action as $U_{t} = u_{k} \in U$ , and let the Markov process run for one time step. Observe the resulting next state $X_{t + 1}^{k} = x_{j}^{k}$ . Then update function $Q_{t}$ as follows, once for each $k \in [m]$ : (84) $\begin{aligned} Q_{t + 1} (x_{i}^{k}, u_{k}) = {\begin{cases} Q_{t} (x_{i}^{k}, u_{k}) \\ + α_{t, i, k} [R (x_{i}, u_{k}) \\ + γ V_{t} (x_{j}^{k}) \\ - Q_{t} (x_{i}^{k}, u_{k})] & if x_{s} = x_{i}^{k}, \\ Q_{t} (x_{s}^{k}, u_{k}) & if x_{s}^{k} \neq x_{i}^{k}, \end{cases} \end{aligned}$ (84) where (85) $\begin{aligned} V_{t} (x_{j}^{k}) = max_{w_{l} \in U} Q_{t} (x_{j}^{k}, w_{l}) . \end{aligned}$ (85) Here $α_{t, i, k}$ equals $β_{t}$ for all i, k if a global clock is used, and equals (86) $\begin{aligned} α_{t, i, k} = \sum_{τ = 0}^{t} I_{{X_{t}^{k} = x_{i}}} \end{aligned}$ (86) if a local clock is used.
Repeat.

Remark

Note that m different simulations are being run in parallel, and that in the kth simulation, the next action $U_{t}$ is always chosen as $u_{k}$ . Hence, at each instant of time t, exactly m components of $Q (\cdot, \cdot)$ (viewed as an $n \times m$ matrix) are updated, namely the $(X_{t}^{k}, u_{k})$ component, for each $k \in [m]$ . In typical MDPs, the size of the action space m is much smaller than the size of the state space n. For example, in the Blackjack problem discussed in Ref. [Citation4, Chapter 4], $n \sim 2^{100}$ while m = 2! Therefore the proposed batch Q-learning algorithm is quite efficient in practice.

Now, by fitting this algorithm into the framework of Theorem 5.1, we can prove the following general result. The proof can be found in Ref. [Citation24].

Theorem 5.4

Suppose that each matrix $A^{u_{k}}$ is irreducible, and that the step size sequence ${β_{t}}$ satisfies the Robbins–Monro conditions (Equation61(61) $\begin{aligned} \sum_{t = 0}^{\infty} α_{t}^{2} < \infty, \sum_{t = 0}^{\infty} α_{t} = \infty . \end{aligned}$ (61) ) with $α_{t}$ replaced by $β_{t}$ . With this assumption, we have the following:

(1)	If a local clock is used as in (Equation84(84) $\begin{aligned} Q_{t + 1} (x_{i}^{k}, u_{k}) = {\begin{cases} Q_{t} (x_{i}^{k}, u_{k}) \\ + α_{t, i, k} [R (x_{i}, u_{k}) \\ + γ V_{t} (x_{j}^{k}) \\ - Q_{t} (x_{i}^{k}, u_{k})] & if x_{s} = x_{i}^{k}, \\ Q_{t} (x_{s}^{k}, u_{k}) & if x_{s}^{k} \neq x_{i}^{k}, \end{cases} \end{aligned}$ (84) ), then $Q_{t}$ converges almost surely to $Q^{*}$ .
(2)	If a global clock is used (i.e. $α_{t, i, k} = β_{t}$ for all t, i, k), and ${β_{t}}$ is nonincreasing, then $Q_{t}$ converges almost surely to $Q^{*}$ .

Remark

Note that, in the statement of the theorem, it is not assumed that every policy π leads to an irreducible Markov process – only that every action leads to an irreducible Markov process. In other words, the assumption is that the m different matrices $A^{u_{k}}, k \in [m]$ correspond to irreducible Markov processes. This is a substantial improvement. It is shown in Ref. [Citation37] that the following problem is NP (Nondeterministic Polynomial)-hard: given an MDP, determine whether every policy π results in a Markov process that is a unichain, that is, consists of a single set of recurrent states with the associated state transition matrix being irreducible, plus possibly some transient states. Our problem is slightly different, because we do not permit any transient states. Nevertheless, this problem is also likely to be very difficult. By not requiring any condition of this sort, and also by dispensing with conditions analogous to (Equation82(82) $\begin{aligned} \sum_{t = 0}^{\infty} β_{t} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} = \infty \forall (x_{i}, u_{k}) \in X \times U, \end{aligned}$ (82) ) and (Equation83(83) $\begin{aligned} \sum_{t = 0}^{\infty} β_{t}^{2} I_{(X_{t}, U_{t}) = (x_{i}, u_{k})} < \infty \forall (x_{i}, u_{k}) \in X \times U . \end{aligned}$ (83) ), the above theorem statement is more useful.

6. Conclusions and problems for future research

In this brief survey, we have attempted to sketch some of the highlights of RL. Our viewpoint, which is quite mainstream, is to view RL as solving Markov decision problems (MDPs) when the underlying dynamics are unknown. We have used the paradigm of SA as a unifying approach. We have presented convergence theorems for the standard approach, which might be thought of as “synchronous” SA as well as variants such as ASA and BASA. Many of these results are due to the author and his collaborators.

In this survey, due to length limitations, we have not discussed actor-critic algorithms. These can be viewed as applications of the policy gradient theorem [Citation38,Citation39] coupled with SA applied to two-time-scale (i.e. singularly perturbed) systems [Citation25,Citation27]. Some other relevant references are [Citation28,Citation29,Citation40]. Also, the rapidly emerging field of finite-time SA has not been discussed. Finite-Time Stochastic Approximation (FTSA) can lead to estimates of the rate of convergence of various RL algorithms, whereas conventional SA leads to only asymptotic results. Some recent relevant papers include Refs. [Citation41,Citation42].

Disclosure statement

No potential conflict of interest was reported by the author.

Additional information

Funding

This work was supported by Science and Engineering Research Board, Government of India.

Notes on contributors

Mathukumalli Vidyasagar

Mathukumalli Vidyasagar was born in Guntur, India on September 29, 1947. He received the B.S., M.S. and Ph.D. degrees in electrical engineering from the University of Wisconsin in Madison, in 1965, 1967 and 1969 respectively. Between 1969 and 1989, he was a Professor of Electrical Engineering at Marquette University, Concordia University, and the University of Waterloo. In 1989 he returned to India as the Director of the newly created Centre for Artificial Intelligence and Robotics (CAIR) in Bangalore under the Ministry of Defence, Government of India. In 2000 he moved to the Indian private sector as an Executive Vice President of India's largest software company, Tata Consultancy Services. In 2009 he retired from TCS and joined the Erik Jonsson School of Engineering & Computer Science at the University of Texas at Dallas, as a Cecil & Ida Green Chair in Systems Biology Science. In January 2015 he received the Jawaharlal Nehru Science Fellowship and divided his time between UT Dallas and the Indian Institute of Technology Hyderabad. In January 2018 he stopped teaching at UT Dallas and now resides full-time in Hyderabad. Since March 2020, he is a SERB National Science Chair. His current research is in the area of Reinforcement Learning, with emphasis on using stochastic approximation theory. More broadly, he is interested in machine learning, systems and control theory, and their applications. Vidyasagar has received a number of awards in recognition of his research contributions, including Fellowship in The Royal Society, the world's oldest scientific academy in continuous existence, the IEEE Control Systems (Technical Field) Award, the Rufus Oldenburger Medal of ASME, the John R. Ragazzini Education Award from AACC, and others. He is the author of thirteen books and more than 160 papers in peer-reviewed journals.

Notes

1 For clarity, we have changed the notation so that the value vector is now denoted by $v^{*}$ instead of $v$ as in (Equation6(6) $\begin{aligned} v = r + γ A v, \end{aligned}$ (6) ).

References

Bertsekas DP, Tsitsiklis JN. Neuro-dynamic programming. Nashua: Athena Scientific; 1996.
Google Scholar
Puterman ML. Markov decision processes: discrete stochastic dynamic programming. Hoboken: John Wiley; 2005.
Google Scholar
Szepesvári C. Algorithms for reinforcement learning. San Rafael: Morgan and Claypool; 2010.
Google Scholar
Sutton RS, Barto AG. Reinforcement learning: an introduction. 2nd ed. Cambridge: MIT Press; 2018.
Google Scholar
Ljung L. Strong convergence of a stochastic approximation algorithm. Ann Stat. 1978;6:680–696.
Web of Science ®Google Scholar
Kushner HJ, Clark DS. Stochastic approximation methods for constrained and unconstrained systems. Springer-Verlag; 1978 (Applied mathematical sciences).
Google Scholar
Benveniste A, Metivier M, Priouret P. Adaptive algorithms and stochastic approximation. Springer-Verlag; 1990.
Google Scholar
Kushner HJ, Yin GG. Stochastic approximation and recursive algorithms and applications. Springer-Verlag; 1997.
Google Scholar
Benaim M. Dynamics of stochastic approximation algorithms. Springer Verlag; 1999.
Google Scholar
Borkar VS. Stochastic approximation: a dynamical systems viewpoint. Cambridge University Press; 2008.
Google Scholar
Borkar VS. Stochastic approximation: a dynamical systems viewpoint. 2nd ed. Hindustan Book Agency; 2022.
Google Scholar
Spong MW, Hutchinson SR, Vidyasagar M. Robot modeling and control. 2nd ed. John Wiley; 2020.
Google Scholar
Shannon CE. Programming a computer for playing chess. Philos Mag Ser7. 1950;41(314):1–18.
Google Scholar
Vidyasagar M. Hidden Markov processes: theory and applications to biology. Princeton University Press; 2014.
Google Scholar
Arapostathis A, Borkar VS, Fernández-Gaucherand E, et al. Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J Control Optim. 1993;31(2):282–344.
Web of Science ®Google Scholar
Watkins CJCH, Dayan P. Q-learning. Mach Learn. 1992;8(3–4):279–292.
Web of Science ®Google Scholar
Sutton RS. Learning to predict by the method of temporal differences. Mach Learn. 1988;3(1):9–44.
Google Scholar
Jaakkola T, Jordan MI, Singh SP. Convergence of stochastic iterative dynamic programming algorithms. Neural Comput. 1994;6(6):1185–1201.
Web of Science ®Google Scholar
Tsitsiklis JN, Roy BV. An analysis of temporal-difference learning with function approximation. IEEE Trans Autom Contr. 1997;42(5):674–690.
Web of Science ®Google Scholar
Tsitsiklis JN. Asynchronous stochastic approximation and q-learning. Mach Learn. 1994;16:185–202.
Web of Science ®Google Scholar
Robbins H, Monro S. A stochastic approximation method. Ann Math Stat. 1951;22(3):400–407.
Google Scholar
Blum JR. Multivariable stochastic approximation methods. Ann Math Stat. 1954;25(4):737–744.
Google Scholar
Borkar VS. Asynchronous stochastic approximations. SIAM J Control Optim. 1998;36(3):840–851.
Web of Science ®Google Scholar
Karandikar RL, Vidyasagar M. Convergence of batch asynchronous stochastic approximation with applications to reinforcement learning [Arxiv:2109.03445v2]; 2022.
Google Scholar
Borkar VS. Stochastic approximation in two time scales. Syst Control Lett. 1997;29(5):291–294.
Web of Science ®Google Scholar
Tadić VB. Almost sure convergence of two time-scale stochastic approximation algorithms. In: Proceedings of the American Control Conference, Boston, Vol. 4; 2004. p. 3802–3807.
Google Scholar
Lakshminarayanan C, Bhatnagar S. A stability criterion for two timescale stochastic approximation schemes. Automatica. 2017;79:108–114.
Web of Science ®Google Scholar
Konda VR, Borkar VS. Actor-critic learning algorithms for Markov decision processes. SIAM J Control Optim. 1999;38(1):94–123.
Web of Science ®Google Scholar
Konda VR, Tsitsiklis JN. Actor-critic algorithms. In: Neural information processing systems (NIPS1999); 1999. p. 1008–1014.
Google Scholar
Vidyasagar M. Convergence of stochastic approximation via martingale and converse Lyapunov methods [Arxiv:2205.01303v1]; 2022.
Google Scholar
Vidyasagar M. Nonlinear systems analysis (SIAM classics series). Soc Ind Appl Math SIAM; 2002;42.
Google Scholar
Hahn W. Stability of motion. Springer-Verlag; 1967.
Google Scholar
Khalil HK. Nonlinear systems. 3rd ed. Prentice Hall; 2002.
Google Scholar
Durrett R. Probability: theory and examples. 5th ed. Cambridge University Press; 2019.
Google Scholar
Gladyshev EG. On stochastic approximation. Theory Probab Appl. 1965;X(2):275–278.
Google Scholar
Borkar VS, Meyn SP. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM J Control Optim. 2000;38:447–469.
Web of Science ®Google Scholar
Tsitsiklis JN. NP-hardness of checking the unichain condition in average cost MDPs. Oper Res Lett. 2007;35:319–323.
Web of Science ®Google Scholar
Sutton RS, McAllester D, Singh S, et al. Policy gradient methods for reinforcement learning with function approximation. In: Advances in neural information processing systems 12 (proceedings of the 1999 conference), Denver: MIT Press; 2000. p. 1057–1063.
Google Scholar
Marbach P, Tsitsiklis JN. Simulation-based optimization of Markov reward processes. IEEE Trans Autom Contr. 2001;46(2):191–209.
Web of Science ®Google Scholar
Konda V, Tsitsiklis J. On actor-critic algorithms. SIAM J Control Optim. 2003;42(4):1143–1166.
Web of Science ®Google Scholar
Chen Z, Maguluri ST, Shakkottai S, et al. Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes [Arxiv:2002.00874v4]; 2020.
Google Scholar
Chen Z, Maguluri ST, Shakkottai S, et al. Finite-sample analysis of off-policy TD-learning via generalized bellman operators [Arxiv:2106.12729v1]; 2021.
Google Scholar

A tutorial introduction to reinforcement learning

Abstract

1. Introduction

2. Markov reward processes

3. MDPs

3.1. Problem formulation

3.2. MDPs: solution

3.2.1. Policy evaluation

3.2.2. Optimal value determination

3.2.3. Optimal policy determination

3.3. Iterative algorithms for MDPs with unknown dynamics

3.3.1. Temporal difference learning without function approximation

3.3.2. TD-learning with function approximation

3.3.3. Q-learning

4. SA algorithms

4.1. SA and relevance to RL

4.2. Problem formulation

4.3. A new theorem for global asymptotic stability

4.4. A convergence theorem for synchronous SA

4.5. Convergence of BASA

5. Applications to RL

5.1. A useful theorem about irreducible Markov processes

5.2. TD-learning without function approximation

5.3. Q-learning

6. Conclusions and problems for future research

Disclosure statement

Notes on contributors

Mathukumalli Vidyasagar

References

Information for

Open access

Opportunities

Help and information

A tutorial introduction to reinforcement learning

Abstract

1. Introduction

2. Markov reward processes

3. MDPs

3.1. Problem formulation

3.2. MDPs: solution

3.2.1. Policy evaluation

3.2.2. Optimal value determination

3.2.3. Optimal policy determination

3.3. Iterative algorithms for MDPs with unknown dynamics

3.3.1. Temporal difference learning without function approximation

3.3.2. TD-learning with function approximation

3.3.3. Q-learning

4. SA algorithms

4.1. SA and relevance to RL

4.2. Problem formulation

4.3. A new theorem for global asymptotic stability

4.4. A convergence theorem for synchronous SA

4.5. Convergence of BASA

5. Applications to RL

5.1. A useful theorem about irreducible Markov processes

5.2. TD-learning without function approximation

5.3. Q-learning

6. Conclusions and problems for future research

Disclosure statement

Additional information

Funding

Notes on contributors

Mathukumalli Vidyasagar

Notes

References

Related research

To cite this article:

Download citation

Your download is now in progress and you may close this window

Login or register to access this feature

Information for

Open access

Opportunities

Help and information

Keep up to date