DeepMind researchers claimed state-of-the-art results in the Arcade Learning Environment in their recent ICML paper “A Distributional Perspective on Reinforcement Learning”. They investigate a distributional perspective of the value function in the reinforcement learning context and further design an algorithm applying Bellman’s equation to approximately learn value distributions, which results in better policy optimization. The discussion in their paper follows closely the classical analysis of Bellman operator and dynamic programming, which should be familiar to most AI researchers.

In this article, I will introduce the mathematical background of the classical analysis of reinforcement learning, that is, the Bellman operator is essentially a contraction mapping on a complete metric space, and explain how value-based reinforcement learning, e.g. Q-learning, finds the optimal policy. Based on these general ideas, readers will see how straightforward the idea that the DeepMind paper comes up with, and hopefully draw some inspiration from this topological perspective to design future RL algorithms.

For the reader who prefers to systematically learn from books, I strongly recommend you to read the textbook Linear Operator Theory in Engineering and Science for acquiring mathematical background, and the book Abstract Dynamic Programming by Dimitri P. Bertsekas to study the connection between contraction mapping and dynamic programming. If you don’t remember Markov decision processes or policy optimization in reinforcement learning, my previous blog has introduced them briefly.

Topology of Metric Spaces

Literally, the word “topology” means “the study of spatial properties”. The space here could be any abstract space with certain properties. A natural way for us to quickly start the discussion is to observe a metric space.

Definition 1. (Metric Space) A metric space is an ordered pair consists of an underlying set and a real-valued function , called metric, defined for such that for any the following conditions are satisfied:

  1.                          [non-negativity]
  2.         [identity of indiscernibles]
  3.                 [symmetry]
  4. [triangle inequality]

All these four conditions are in harmony with our intuition of distance. Indeed, the Euclidean distance is a valid metric.

Cauchy Sequences and Completeness

When we construct a metric space , it is common to expect our metric space has no “point missing” from it. For example, the space of rational numbers, with the metric by the absolute value of the difference, is not complete. Just consider the rational sequence , , whose elements become arbitrarily close to each other as the sequence progresses, but it doesn’t converge to any point in space . Therefore, the space is incomplete, since it obviously misses points.

A sequence like that in our example is called Cauchy sequence. Formally, we define the Cauchy sequence in and the completeness of a metric space as follows:

Definition 2. (Cauchy Sequence) A sequence is said to be Cauchy sequence if for each there exists an such that for any .

Definition 3. (Complete Metric Space) A metric space is said to be complete if each Cauchy sequence is a convergent sequence in it.

The significance of complete metric space is difficult to be overemphasized. In many application, it is easier to show a given sequence is a Cauchy sequence than to show it is convergent. However, if the underlying metric space is complete, to show a given sequence is a Cauchy sequence is sufficient for convergence.

Contraction Mapping Theorem

Now we introduce the contraction mapping based on the concept of completeness.

Definition 4. (Contraction) Let (X, d) be a metric space and . We say that is a contraction, or a contraction mapping, if there is a real number , such that \[ d(f(x), f(y)) \leq kd(x,y) \] for all x and y in X, where the term is called a Lipschitz coefficent for .

The reason that contractions play an important role in dynamic programming problems lies in the following Banach fixed-point theorem.

Theorem 1. (Contraction Mapping Theorem) Let be a complete metric space and let be a contraction. Then there is one and only one fixed point such that \[ f(x^* ) = x^* . \] Moreover, if is any point in and is inductively defined by , , then as .

proof. Let us first prove every sequence convergent to a fixed point. Since the metric space is complete, we only need to prove every is a Cauchy sequence. By the triangle inequality and symmetry of metric, for all ,

Consider two points , in the sequence. Their distance is bounded by

since the distance converges to zero as , proving that is a Cauchy sequence. Because the space is complete, converges.

We now show that is a fixed point, and it is unique. Since is a contration, by definition it is continous. Therefore \[ f(x^* ) = f(\lim_{m\rightarrow \infty} f^m(x)) = \lim_{m\rightarrow \infty} f(f^m(x)) = \lim_{m\rightarrow \infty} f^m(x) = x^* . \] The uniqueness of can be proved by contradiction. Assume and are two distinct fixed points of . We then get the contradiction \[ 0 < d(x^* , y^* ) = d(f(x^* ) , f(y^* )) \leq k d(x^* , y^* ) < d(x^* , y^* ). \] Hence has only one fixed point and converges to it for any .

Contractive Dynamic Programming

The contractive dynamic programming model is introduced by Eric V. Denardo in 1967, which applies contraction theorem to a broad class of optimization problems. It captures two widely shared properties of the optimization problem, contraction and monotonicity properties to develop a high level abstraction of optimization methods.

Some terminology is now introduced. Let , be two sets, which we refer to as a set of “states” and a set of “actions”, respectively. For each , let be the nonempty subset of actions that are feasible at state . We denote by the set of all stationary deterministic policies , with , for all .

To introduce return function, let be the set of all bouned real valued functions from to reals, and let be a given mapping. One might think as the total payoff for starting at state and choosing action with the prospect of receiving if the transition is happened. We consider the operator defined by \[ (\mathcal T_\pi v)(s) := H(s, \pi(s), v), ~\forall s\in S, v\in V, \] and we also consier the operator defined by \[ (\mathcal T v)(s) := \sup_{a \in A_s}H(s, a, v), ~\forall s\in S, v\in V. \] We call the “evaluation operator” and the “optimality operator”. Our goal is to find a policy maximizing our total payoff. Following two properties provide conditions under which our goal can be realized by iteratively applying these two operators.

Contraction Property

Let a metric on is defined by . Using the fact is a space of bounded functions, it is easy to show that the metric space is complete.

Assumption 1. (Contraction) For all and , for some , we have \[ d(\mathcal T_{\pi} u, \mathcal T_{\pi} v) \leq \gamma d(u ,v). \]

Proposition 1. Suppose the contraction assumption is satisfied. Then is also a contraction with the same Lipschitz coefficient .

Proof. Consider arbitrary and we can write for each by assumption. For any fixed state , we take supremum of both side over , we have \[ \sup_{\pi\in\Pi} \left\{(\mathcal T_\pi u) (s)\right\} = \sup_{\pi(s) \in A_s}H(s, \pi(s), v) = (\mathcal T u) (s) \leq (\mathcal T v)(s) + \gamma d(u, v) \] Reversing the rolee of and , we have for any fixed state . Jointly, we have proved for all . Hence, \[ d(\mathcal T u, \mathcal T v) = \sup_{s\in S} |(\mathcal T u)(s), (\mathcal T v)(s)| \leq \gamma d(u, v). \] The optimality operator is also a contraction with Lipschitz coefficient .

The Banach fixed-point theorem guarantees that iteratively applying evaluation operator to any function will converge to a unique function in such that \[ v_\pi(s) = H(s, \pi(s), v_\pi), ~~ \text{for each}~ s \in S. \] The function is called value fuction of the policy . The optimal value function is defined by for all . Similarly, iteratively applying optimality operator to any funciton will converge to a unique function such that \[ v^* (s) = \sup_{a\in A_s}H(s, \pi(s), v^* ), ~~\text{for each}~ s \in S. \] How should we interpretate this fixed point ? When does equal that we desire? One may have a sense that since the following result aways holds.

Lemma 1. () For any , there exists a policy such that for any , , and any such satisfies .

Proof. Note that for any and , we have an upper bound

Now let , and such that for any , we have .

If we can show the reverse inequality, , then we conclude . However, it does not always hold true. A sufficient condition, monotonicity property, is therefore introduced to make them identical.

Monotonicity Property

Assumption 2. (Monotonicity) For , if , then for each , where we write if for each state , and if and .

The monotonicity property was introduced by L. G. Mitten in 1964. There are many value functions, including the one in the infinite horizon MDP, have the monotonicity property. The following theorem shows why the contraction and monotonicity properties are important in the contractive dynamic programming model.

Theorem 2. Suppose the contraction and monotonicity properties hold. Then .

Proof. Note that for any and . It means for all . Repeatedly applying the monotonicity property times, we derive . By contraction property . Therefore for any and . On the other hand, by Lemma 1 which utilizes the contraction property we have for any . Jointly, \[ v^* - \epsilon \leq v_{\mathtt{opt}} \leq v^* \] for any . This implies and completes the proof.

Value-Based Reinforcement Learning

Reinforcement learning (RL) approximates the optimal policy of a Markov decision process. This method succeeded in solving many challenging problems in diverse fields, such as robostics, transportation, medicine, and finance. Many RL methods are considered by the classic textbook Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto (We are translating its latest version into Chinese!).

Roughly speaking, the value (function) based reinforcement learning is a large category of RL methods which take advantage of the Bellman equation and approximate the value function to find the optimal policy, such as SARSA and Q-Learning. By contrast, the policy-based reinforcement learning methods directly optimize the quantity of total reward, e.g. REINFORCE, which are more stable but less sample efficient than value-based methods.

The key idea of value-based RL relates two Bellman equations of an infinite horizon discounted MDP , where is the immediate reward function, and is the probabilistic transition model, and is the discount factor (If you just forget the definition of MDPs, please refer to my previous blog). The first equation is so-called Bellman expectation equation: \[ v_{\pi}(s) = R(s, \pi(s)) + \gamma \sum_{s’ \in S} \mathcal P(s’ | s, \pi(s))\cdot v_{\pi}(s’). \] This equation describe the expected total reward for taking the action prescribed by policy . The other equation is called Bellman optimality equation: \[ v^* (s) = \max_{a\in A} \left\{R(s, a) + \gamma \sum_{s’ \in S} \mathcal P(s’ | s, a)\cdot v^* (s’)\right\}. \]

Now we consider two corresponding operators, defined by the following \[ (\mathcal T_{\pi}v)(s) := R(s, \pi(s)) + \gamma \sum_{s’ \in S} \mathcal P(s’ | s, \pi(s))\cdot v(s’) ~~~\text{[Bellman evaluation operator]}, \] and \[ (\mathcal Tv)(s) := \max_{a\in A} \left\{ R(s, a) + \gamma \sum_{s’ \in S} \mathcal P(s’ | s, a)\cdot v(s’) \right\} \text{[Bellman optimality operator]} \] for any value function . Note that and are fixed points of operator and respectively. If we can show that satisfies both contraction and monotonicity properties, according to results we have drawn in the last section, and are unique fixed points that repeatedly applying and will converge to respectively, and .

Proposition 2. The contraction and monotonicity properties hold for Bellman evaluation operator for any stationary policy .

Proof. For any value functions , and any policy , we have

Therefore is a contraction with Lipschitz coefficient . The contraction property holds. According to Proposition 1, the optimality operator is also an -contraction.

Further more, for any value functions , we have \[ (\mathcal T_{\pi} u)(s) - (\mathcal T_{\pi} v)(s) = \gamma \sum_{s’ \in S} \mathcal P(s’ | s, \pi(s))\cdot (u(s’) - v(s’)) \geq 0 \] for all . Hence . The monotonicity property holds.

Approximate Value Iteration

As the discussion above indicates, if the model is known, one can design an policy optimization algorithm by generating which converges to and then extracting the corresponding optimal policy. This is exactly the value iteration (VI) method. This algorithm simply starts with any value function guess and repeatedly apply the Bellman optimality operator on it to obtain the optimal value function. Sometimes VI method may be implementable only by approximation, e.g., when the state space is extremely large or when the value function is updated by some constant step size. A common solution is depicted as the following approximate value iteration algorithm.

# Approximate Value Iteration 
Initialize value function v arbitrarily.
    Update v, such that sup|v-T(v)| ≤ δ
until v converges
Extract a policy: π, such that sup|T_π(v) - T(v)| ≤ ϵ

By applying the triangle inequality and the contraction property of and , we can assert the error bounds for the output value function , which is bounded by , and the extracted policy , whose value is bounded by (proof is straightforward, check them yourself :-)). The VI algorithm converges exponentially when and .

Approximate Policy Iteration

The VI method is a direct creation based on the fact that the Bellman optimality operator is a contraction. However, people may think by involving policy evaluation each step one can have a better rate of convergence and a better approximation of optimal policy, though it will incur more calculation each iteration. This is plausible, but not insignificantly so in theory. The approximate policy iteration (PI) algorithm is stated as follows:

# Approximate Policy Iteration
Given the current policy π_k
    Policy evaluation: Compute v such that
        sup|v - (T_π_k)^m_k (v) | ≤ δ
    by appying Bellman evaluation operator T_π

    Policy improvement: Extract a policy π_{k+1} that satisfies
        sup|T_π_{k+1}(v)- T(v)| ≤ ϵ
    π_k ← π_{k+1}
until v converges.
Extract π_{k+1} 

In this algorithm, the parameter is an positive intergers. Taken the usual choice of , we can show that for each iteration. Therefore, its final error bound is asymptotically proportional to , which indicates PI does not hold an accuracy advantage over VI. And the exact PI converges at an exponential rate which is as same as that of exact VI.

Q-Learning: Asynchronous Value Iteration

In the last two section we explained how we do classical model based planing for MDPs from the topological point of view. Many readers by now have a sense that how this stuff and value-based reinforcement learning are connected. But there remains a critical question: Can this explanation is still suitable for those general model free cases? Can we explain those classical reinforcement “learning” algorithm, such as Q-learning, in an analog way?

Intuitively, the answer should be “yes”. Consider the evaluation operator and optimality operator defined by \[ (\mathcal T_{\pi} q)(s, a) := R(s, a) + \gamma\mathbb E_{\mathcal P, \pi} q(s’, a’) \] and \[ (\mathcal T q)(s, a) := R(s, a) + \gamma\mathbb E_{\mathcal P} \max_{a’\in A} q(s’, a’) \] respectively, over the Q value function space , with the similar metric as that of . Following the similar analysis, the contraction and the monotonicity properties hold for them. However, since the model is unknown, we cannot update the Q value function as that in value iteration. We can only approximate and update the Q value for some certain pair by sampling or temporal difference method each iteration. Is its convergence still guaranteed? The following theorem resolves our doubts.

Definition 5. (Asynchronous Value Iteration) Conside value function as a composition of such that in each iteration, \[ v_s^{t+1}(s’) = \begin{cases} \mathcal Tv^{t}(s’), & \text{if}~s’ = s\newline \mathcal v_s^{t}(s’), & \text{otherwise.} \end{cases} \] Theorem 3. (Asynchronous Convergence Theorem) If each restricted value function can be update infinite times, and there is a sequence of nonempty subsets with , such that if is any sequence with for all , the converges pointwise to . Assume further the following:

  1. Synchronous Convergence Condition: We have \[ \mathcal T v \in V^{k+1}, \forall v \in V^k \]
  2. Box Condition: For all is a Cartesian product of the form \[ V^k = \times_{s\in S} V_s^k \]

where is a set of bounded real-valued functions on state . Then for every the sequence generated by the asynchronous VI algorithm converges to .

Proof. To show that the algorithm converges is equivalent to show that the iterates from will get in to eventually, i.e., for each , there is a time such that for all . We can prove it by mathematical induction.

When , the statement is true since . Assuming the statement is true for a given , we will show there is a time with the required properties. For each , let set record the time an update happend on the state . Let be the first element in such that . Then by the synchrnous convergence condition, we have for all . In the view of the box condition, it implies for all for any . Therefore, let , using the box condition, we have for all . By mathematical induction, the statement holds, and will get in to eventually. Hence, converges to .

A Methodology for Developing Iterative RL Algorithms

It is really exciting to see that the convergence and correctness of value-based reinforcement learning algorithm, such as Q-learning, can be interpreted in a topological view. In fact, the contraction mapping theory provides us with a methodology for developing iterative value-based RL algorithm. In a high-level overview, iterative value-based RL algorithms are concerned with the following five essential elements:

  • Value Space : A common choice of is the space of value functions in or the space of Q-value functions in . There are many other choices such as the space of ordered pair of (Example 2.6.2 in Dimitri’s book), or the space of multi-dimensional value functions.

  • Value Metric : It defines the “distance” between two points in the value space. Besides the basic four conditions, the metric should ensure a complete metric space to validate the Banach’s fixed-point theorem. A compatible selection of value metric will make our analysis much easier.

  • Evaluation Operator : We have constructed a recursive expression of a certain value point in the value space associated with some policy, e.g., the Bellman expectation equation, \[ v_{\pi}(\cdot) = H(\cdot, \pi(\cdot), v_{\pi}) \] to depict the value of a policy as a fixed point we desire. Carefully verify that the contraction and the monotonicity properties hold for .

  • Optimality Operator : Write down recursive expression of the optimal point in the value space, e.g., the Bellman optimality equation, \[ v^* (\cdot) = \sup_{\pi\in \Pi}\mathcal T_{\pi} v^* (\cdot). \] One thing important is to make sure the contraction property holds for the optimality operator . Note that when the metric is the supremum of the absolute value of the difference, the contraction property of is automatically satisfied according to Proposition 1.

  • Updating Scheme: It is the most tricky part. For those small-scale problems, the asynchronous value iteration is enough. However, to make our reinforcement learning algorithm practical and scalable, we have to consider much more factors. For example: Should our algorithm be on-policy or off-policy? How do we approximate the value and policies? If it is an online algorithm, how do we trade off exploration and exploitation? All these details will significantly influence the performance of our algorithm on real-world tasks.

Learning from Value Distribution

Let’s get back to the DeepMind paper mentioned at the begining, which proposes a new reinforcement learning algorithm from a distributional perspective. This work is actually a good example of developing and analyzing a new reinforcement learning algorithm using the methodology that we summarize above.

In this example, the value space consists of all the “value distribution” . The value metric of the value space is defined by , where is the Wasserstein distance between two random variables. The evaluation operator is defined by \[ \mathcal T_{\pi} Z(s, a) \stackrel{D}{:=} R(s, a) + \gamma P^{\pi}Z(s, a), \] which is proved to be a contraction in , where “” means random variables on both sides are drawn from the same distribution. The monotonicity property also holds as they define the order as the order of the magnitude of expectation (). The optimality operator is defined by . However, since the relative order is incompatible with the metric, this optimality operator is NOT a contraction in . They use approximate asynchronous value iteration as the updating scheme, where the value distribution function is approximated by a neural network which outputs a discrete distribution between .

If we only consider the space of the first-order moment of the value distribution and the metric , both and are contractions. And this explains why the C51 algorithm works well. However, as for predicting value distribution, I doubt the correctness of directly using the result of value iteration. In my opinion, a policy iteration algorithm would be a better choice here for maintaining value distributions, or we should correct the value distribution function by repeatedly applying evaluation operator after the optimal policy is found.

Multi-Objective RL

Another example which might be helpful for us to understand this methodology is the multi-objective reinforcement learning. And this is also an interesting topic I am recently working on. The value space of this problem should be some multi-dimensional value function. The metric can be extended from that of a normed vector space. If we only want to find one policy resulting a Pareto optimal return, or simply want to find the best policy maximizing the return associated with specific weights, the evaluation operator and the optimality operator are very similar to and (the is defined by Pareto domination and the order of magnitude of weighted sum, respectively). Designing an updating scheme according to the real problem, we are guaranteed to have a viable algorithm. If we want to find all the policies resulting in a Pareto optimal return, or make the policies controllable, how to represent a collection of policies and write down the corresponding evaluation/optimality operator are very challenging.

Summary: Contraction, Value Iteration, and More

In this article, some topological backgrounds of metric spaces and the Banach fixed-point theorem are first introduced. Based on these mathematical foundations, we investigate the contractive dynamic programming, where the contraction property and the monotonicity property together guarantee a unique optimal solution can be achieved exponentially fast. Bring these ideas to the reinforcement learning context, we can explain and analyze value iteration, policy iteration and Q-learning algorithms in a unified framework. It further inspires us to develope a methodology for developing iterative value-based reinforcement algorithm, which requires us to consider value space, value metric, evaluation operator, optimality operator and updating scheme.

The next time someone asks, “how does a value-based RL algorithm converge?”, or “how do people come up with a new RL algorithm?”, we can answer him not first by complex, detailed mathematical deduction, but simply by drawing an intuitive picture and analyzing on those accessible abstractions! Our journey on the topic of topological explanation of value-based reinforcement learning comes to an end here. Hope you enjoyed it!