This article is about the theoretical derivation of Policy Improvement Bound and practical policy optimization algorithms discussed in the paper Trust Region Policy Optimization (Schulman, et al, 2015). TRPO is an interesting idea which optimizes policies with guaranteed monotonic improvement. In theory, its algorithm design looks elegant and justified. In practice, it performs robustly on a wide variety of tasks. However, the original proof in Appendix A of their ICML version of paper is a little bit unsatisfying. I will also try to perfect their proof in this article.
 MDPs, Policies, and Optimization
 Analysis on Policy Difference Equation
 Conservative Policy Iteration
 General Policy Improvement Bound
 Trust Region Policy Optimization
Let’s start with some background.
MDPs, Policies, and Optimization
MDPs The interaction between the agent and the environment is typically formulated as an infinitehorizon discounted Markov decision process (MDP), which is defined by the tuple . Here, and are finite sets of states and actions respectively. And is the probabilistic transition model, which means when the agent is at some state and it takes an action , the probability of the agent moving to a certain subsequent state is given by \(\mathcal P(s’s,a) = \mathsf{Pr}[s_{t+1} = s’s_t = s, a_t = a]\), where states and action . The function in the tuple is called the expected immediate cost function, indicating how costly it is for a particular move from a particular state. The initial state probability distribution gives a probability distribution over states the agent initially stays. The scalar is the discount factor, which represents the difference in importance between future effects and present effects.
Policies Let be the set of all stationary stochastic policies depicting how the agent takes actions in given states in , i.e., \(\pi(as) = \mathsf{Pr}[a_t=as_t=s]\) for some . Then for the agent who acts under a certain policy , there would be many possible trajectories of alternate states and actions, like , where \(s_0 \sim \rho_0(\cdot), a_t\sim \pi(\cdots_t)\) and . We simply write the trajectory (sampled by a policy) given context for convenience. For survival in the environment, or for expanding its individual utility out of individual rationality, the agent always wants to make its cost lower. Thus, the ability to search for an optimal policy which minimizes the agent’s expected discounted cumulative cost
\[\pi^\star \in \arg\min_{\pi \in \Pi} \mathbb E_{\tau\sim\pi} \left[ \sum_{t=0}^{\infty} \gamma^{t}c(s_t,a_t) \right],\]
is a key feature of intelligent agents. The existence and characterization of such have been widely studied in the theory of optimal control. In reinforcement learning, we are concerned with how to learn or approximate the optimal policy.
Example (2D Navigation Problem as MDP)
The illustration above shows a 2D navigation problem, where Tony plans to go back school as soon as possible, but unfortunately a volcano blocks his way. Even worse, a strong east wind blows occasionally, which might shift Tony out of the direction intending to move in. To reach out the school in safety, Tony needs to make a detour.
Obviously, Tony is the agent and the map is the environment. The state space consists of all locations in the map, . The action space contains moves in four directions, , where denotes the unit vector. The trainsition model where is a deterministic model, which assigns a valid move with probability 1, and transports Tony to the location one step left (if valid) with probability 1. Every step causes one unit time cost. Dropping in to the volcano cost an addtional 10 units time penalty.
States Space Actions Space Trasition Model Policy such as some deterministic policies. Cost Funciton Initialization Discount Factor Trajectory almost surely 0.99 e.g., the one shown in illustration [Terminal] In this problem Tony’s trave has a terminal (school ), while MDP is often in potentially infinit horizon, i.e. nonterminating process. However, we can covert the terminate state to a trap by enforcing the trainsition probability , and setting the cost after arrival as 0. [Reward/Cost] Most people write MDPs in maximizing reward other than minimizing cost. Actually, these two are equivelent, by seeing the cost as the negtive reward. If the immediate reward is defined as , in our definition, the expected immediate cost is , which will be sufficient for policy evaluation and optimization.
Optimization There are roughly there categories of policy optimization algorithms.

Policy iteration methods, which alternate between estimating the value function and improving policy greedily. Here, the value function is defined as \(V^{\pi}(s)=\mathbb E_{\tau\sim\pis_0=s}\left[ \sum_{t=0}^{\infty} \gamma^{t}c(s_t,a_t) \right]\), meaning the discounted expected cumulative cost performing the policy staring at a specific state . Once is estimated, we can greedily update policy accordingly. Optimal policies can also be implicitly obtained (though it requires more iterations) by Value Iteration according to Bellman optimality equation: \[V^{\pi^\star}(s) = \min_{a\in\mathcal A}c(s,a)+\gamma\mathbb E_{s’\sim\mathcal P(\cdots,a)}\left[V^{\pi^\star}(s’)\right]\] which can be applied to the Q value function .
In the modelfree control problem, policy iteration methods is the foundation of some famous RL algorithms such as Sarsa and Qlearning. 
Policy gradient methods, which parameterize the policy and estimate the gradient of the expected discounted cumulative cost from sampled trajectories. Take the advantage actorcritic method as example, the gradient of loss function is \[\nabla_\theta J(\theta) = \mathbb E_{\tau\sim\pi_\theta} \left[\nabla_\theta \log\pi_\theta(as)~A^{\pi_\theta}(s, a)\right], \] where is called advantage function, which is useful to reduce the variance while training. Update with a suitable learning rate each time after estimating advantage function by several steps sampling.

Derivativefree stochastic optimization methods, such as the crossentropy method (CEM), which treat the total cost as a black box function to be minimized in terms of the policy parameters. This kind of approaches achieves good results while being simple to understand and implement. However, it relies on deliberately designed handengineered policy classes with lowdimensional parameterizations, which is highly taskspecific and also limits the expression power of policy models.
On the other hand, gradientbased optimization algorithms can optimize large nonlinear policies such as neural networks so that have much better complexity guarantees. If an efficient training process could also be guaranteed, then the agent would be able to learn more complex and powerful policies.
Analysis on Policy Difference Equation
Let denote the expected cumulative discounted cost of policy for short. A useful identity is given by Kakade & Langford (2002) to express the cost of another policy in term of .
Lemma 1. Given two policy and for an MDP , we have
\[
\eta(\widetilde \pi) = \eta(\pi) + \mathbb E_{\tau \sim \widetilde \pi}\left[\sum_{t = 0}^{\infty} \gamma^{t} A^{\pi}(s_t, a_t)\right]
\]
where all the notations are defined as above.
Proof. It is easy to see by definition that the advantage can be rewritten as
Therefore the expectation of the advantage over possible trajectory generate by is
By defining the discounted visitation measure as , which can be interpreted as unnormalized probability distribution over states that the agent encounters when practicing policy under the MDP , we can rewrite the equation in Lemma 1 as , which we shall call as policy difference equation.
This equation is very informative. When optimizing policy from the current policy , if we can find another policy such that at every state , then a nonincreasing improvement on cost is guaranteed. This also implies a classic result in policy iteration optimization. When the agent takes a deterministic policy maximizing the Q value, i.e., for certain , the minimum is taken for all . Then if there is at least one feasible action to reduce the cost (negative advantage) at current state, and discounted visitation measure is nonzero with respect to , the improvement of policy can be always possible by taking it deterministically, otherwise, the algorithm has already converged to the optimal policy.
However, the advantage or Q value is estimated and approximated during planning, and it is unavoidable that there exist some states at which the expected advantage is positive. In this case, the classic policy iteration would not work easily. To continue to enjoy benefits from theoretical guaranteed improvement at every step, we have to optimize policy directly from the policy difference equation. But the complex dependency of on make the direct optimization intractable. Instead, we study on a local approximation to as follow
The local approximation ignores changes in state visitation measure due to changes in the policy, but matches to the first order for any differentiable parameterized policy .
Claim 1. For any parameter value , matches to the first order at , i.e.,
Proof. Note the identity the first equation is obvious. Taking a derivative on the lefthand side wrt. , we have,
Representing in term of according to policy difference equation, we take a derivative on the righthand side,
The second last equivalence holds due to the identity again.
This result gives us an intuition: a sufficiently small step updating that improves will also improve . But the most important questions is, how big a step we could take, will guarantee a monotonic improvement of ?
Conservative Policy Iteration
To answer the question above, Kakade and Langford proposed a policy updating scheme called conservative policy iteration, for which they could provide explicit lower bounds on the improvement of . They update the policy as the following mixture
where and can be interpreted as the step size.
Lemma 2. There is an upper bound where , on the expected cost of new policy updated by conservative policy iteration.
Proof. Temporally expand and . We get
and
where we use the notation to represent the estimated expected advantage of new policy at each state.
Note that policy can be executed by flip a biased coin first, then decide whether to sample an action from or . That means for every step , , where the random variable .
Let be the counter of how many times that picks as its execution plan. We can divide the expectation of when is executed into two part, condition on whether has been executed,
And also a similar partition can be performed on the expected advantage
Since means that is equivalent to first steps so far, we can have a nice replacement . And it is easy to see that by definition, thus .
Use the Linearity of Expectation, we can derive an upper bound of the difference of and
Rearranging, the Lemma 2 follows.
Corollary 1. A simpler upper bound holds for conservative policy iteration. It is easy to verify since and .
See that when , . Hence, by minimize the righthand side of inequality in Corollary 1 with respect to , conservative policy iteration could find a reasonable step size wich guarantees a monotonic improvement.
General Policy Improvement Bound
The conservative policy iteration is limited on mixture policies. In most case, a mixture of policies is what not we prefer due to its high design complexity and expressive power restriction. As a principal theoretical result of the paper we are introducing, John Schulman extends this result of policy improvement bound to general stochastic policies.
The idea is that if a new policy is updated from with a small step as measured by some statistical distance, such as total variation distance , where and are two probability distribution over a countable set, the difference between and is bounded. Then a monotonic improvement algorithm is induced just as the same argument for conservative policy iteration.
Theorem 1. Let . Then the following bound holds
where .
Proof. Using the coupling lemma, there exists of a joint distribution that we can regard the old policy and the new policy as two marginal distributions of it, i.e.,
such that .
The construction of coupled policy is a kind of routine. If , a justified construction is
where . Otherwise, trivially, let when .
Similar to lemma 2, we can divide the expected into two parts to induce an upper bound
Let be the random indicator of whether disagrees with , by seeing policy as the margin of coulping policy . And let . Because is equivalent to within first steps ( and always agree for ), following almost the same argument in the proof of lemma 2 we have
Applying the result of corollay 1, we have proved
According to Pinsker’s inequality, we obtained a better looking bound in term of KL divergence.
where is the maximum KL divergence of policies at some state.
By minimizing the righthand side of the inequality above at each iteration, it guarantees the expected cumulative cost is nonincreasing.
Trust Region Policy Optimization
So far, the main task of this article has been accomplished. We have proved the policy improvement bound for general stochastic policies. It derives a policy iteration algorithm which ensures nonincreasing cost at each step.
However, the obtained algorithm is inefficient in practice. If we used the penalty coefficient recommended by the theory above, the step sizes would be very small. In order to take larger steps in a robust way, Schulman et al. use a constraint on the KL divergence between the new policy and the old policy, and replace by a heuristic using the average KL divergence . The optimization task to update the current policy is
where is a hyperparameter which relates to the trust region radius, though I still doubt a bit whether their algorithm could be somehow regarded as the trust region method that we familiar with.
Parameterizing policies as differentiable functions with respect to parameters in , estimating the local cost approximation by sampling with current policy , we can obtain following expanded optimization objective
This is exactly the optimization task corresponding to the “single path” sampling scheme in their paper. Another sampling scheme called “vine”, which involves constructing a rollout set and then performing multiple actions from each state in the rollout set. It can reduce variance in estimation, while it requires the environment reset to an arbitrary state and therefore is not suitable for the modelfree reinforcement learning.
They solve the constrained optimization problem above by following procedure:
 Calculate the firstorder approximation to the objective by expanding in the Taylor series around the current point , i.e., , where the dimensional vectors and .
 Use Fisher information matrix to calculate the quadratic approximation to the KL divergence constraint as , where .
 Determine the update direction by approximately solving equation with the conjugate gradient method. This step actually changes the optimization objective as , which is quite similar to the optimization objective we obtained from policy improvement bound directly but without a coefficient on KL regularizer.
 Having computed the search direction , we compute the maximal step size such that satisfies the KL divergence constraint. That is an upper bound on it, . We have .
 Perform the line search on the objective by shrinking exponentially until the objective improves, where equals 0 when its assertion is satisfied and otherwise.
After performing these five steps, we finish one iteration and obtain a return as the new policy which approximately guarantees a nonincreasing improvement of the expected cumulative cost.