Manan Tomar

Mirror Descent Policy Optimization (MDPO)

2020-11-07T00:00:00-08:00

Reinforcement Learning (RL) approaches can be broadly casted in three categories--- value based, model based, and policy based. Value based methods tend to model the optimal value function and then extract the optimal policy. Model based methods try to learn the model (transition and reward dynamics) and then extract the optimal policy using planning techniques. Intead, policy based methods try to learn the optimal policy by directly optimizing the objective funtion of inerest, i.e. the expected discounted return. Trus-region based policy optimization methods fall under this third category and includes popular off-the-shelf RL methods such as TRPO [1], PPO [3], and SAC [2]. The general idea in trust-region methods is to keep the policy updates close to each other, which results in more stable learning.

Mirror Descent

Mirror Descent (MD) is a first order trust-region method for constrained convex optimization.

$x^* \in \arg\!\min_{x\in C} \; f(x)$

MD offers the following iterative update rule for the above problem:

$x_{k+1} \in \arg\!\min_{x \in C} \; \langle \nabla f(x_{k}), x - x_{k} \rangle + \frac{1}{t_k} B_{\psi}(x, x_k)$

Mirror Descent in RL

Applying the MD update to the RL objective results in having to solve an optimization problem at each iteration $k$. We test a simple solution to this--- solving the optimization problem approximately using multiple steps of SGD. In doing so, we can derive on-policy and off-policy versions based on how we wish to sample data:

On-policy MDPO

On-policy MDPO tries to solve the following optimization problem at each iterate $k$, using multiple steps of SGD:

$\theta_{k+1} \leftarrow \arg\max_{\theta\in\Theta} \mathbb{E_{\color{red}{s\sim\rho_{\theta_k}}}}\Big[\mathbb{E_{a\sim\pi_\theta}}\big[A^{\theta_k}(s, a)\big] - \frac{1}{t_{k}}\text{KL}(s;\pi_\theta,\pi_{\theta_k})\Big]$

Off-policy MDPO

Off-policy MDPO tries to solve the following optimization problem at each iterate $k$, using multiple steps of SGD:

$\theta_{k+1} \leftarrow \arg\max_{\theta\in\Theta} \mathbb{E_{\color{red}{s\sim\mathcal{D}}}}\Big[\mathbb{E_{a\sim\pi_\theta}}\big[A^{\theta_k}(s, a)\big] - \frac{1}{t_{k}}\text{KL}(s;\pi_\theta,\pi_{\theta_k})\Big]$

Key Differences with TRPO

TRPO deploys a hard constraint on the policy update, through a line search routine. MDPO does not have a hard constraint.
TRPO performs natural gradient descent, thus computing the FIM at each step. MDPO performs simple SGD.
TRPO uses the opposite KL direction compared to MDPO.
TRPO uses heuristics for step size. MDPO uses a simple annealing schedule.

Key Differences with PPO

PPO uses a clipping based objective, removing any dependence on the KL divergence.
PPO-KL is similar to MDPO but uses opposite KL direction/does not use multiple SGD steps. The adaptive version also uses heuristics for computing the step size.

MDPO and SAC

Coming back to the off-policy version of MDPO, we see that after applying the reparameterization trick, the MDPO update closely resembles the SAC update:

$L^\text{MDPO}(\theta,\theta_k) = \mathbb{E_{s \sim \mathcal{D}, \epsilon \sim \mathcal{N}}} \big[\log \pi_\theta\big(\widetilde{a}_\theta(\epsilon,s)|s\big) - \log \pi_{\theta_k}\big(\widetilde{a}_\theta(\epsilon,s)|s\big) - t_k Q^{\theta_k}_\psi\big(s,\widetilde{a}_\theta(\epsilon,s)\big)\big]$

$L^\text{SAC}(\theta,\theta_k) = \mathbb{E_{s \sim \mathcal{D}, \epsilon \sim \mathcal{N}}} \big[\lambda \log \pi_{\theta}\big(\widetilde{a}_\theta (\epsilon, s)|s\big) - Q^{\theta_k}_\psi\big(s,\widetilde{a}_\theta(\epsilon, s)\big)\big]$

Key Differences with SAC

SAC constraints the current policy to be close to uniform policy. MDPO constraints it to be close to the previous step policy.
MDPO offers a new derivation to SAC, through a optimization perspective. SAC is originally derived from a ‘soft’ policy iteration perspective.
SAC argues that any projection can be used for it’s loss. MDPO shows that the projection must be the same as the choice of Bregman.

Takeaways (on-policy)

MDPO performs better than or on par to TRPO and better than PPO.
MDPO reduces wall clock time, and is more efficient than TRPO.
TRPO is a better performing algorithm than PPO.
PPO is prone to instability issues, particularly when run for longer iterations.

Takeaways (off-policy)

MDPO performs better than or on par with SAC.
Off-policy MDPO has a better sample efficiency than on-policy MDPO.
Off-policy MDPO has a much higher wall clock time than on-policy MDPO.
Regularized and un-regularized MDPO do not have a huge performance gap, similar to how SAC and it's deterministic variant do not as well.

References

Trust Region Policy Optimization. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. ICML 2015.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. ICML 2018.
Proximal Policy Optimization Algorithms. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Preprint 2017.

Successor Representation and Eigen Options

2018-04-10T00:00:00-07:00

This post is written to gather a better understanding of recent work done in eigen option discovery using successor representation. I try to list out most of the major ideas building upto eigen option discovery and show results obtained on simple gridworld tasks. I start by introducing proto value functions and move onto eigen option discovery and how the successor representation (SR) comes into play.

Proto Value Functions

These were introduced in Mahadevan et al.[3] and can be considered as task independent basis functions. A linear combination of these can thus be used to define the value function of a particular task. They are constructed by diagonalizing a diffusion flow matrix which in turn is the Laplacian of the a function formed using the adjacency matrix. The advantage of learning proto value functions is that they represent the general topology of the state space irrespective of or in absence of the reward function. Although a large state space may have highly complex value function vector, it can be approximately represented as a linear combination of a substantially small number of proto value functions.

Fig 1. a. A 10 x 10 three room gridworld. b. The SR learned for start state as 15

Proto value functions (PVFs) try to move away from the idea of representing the value function in terms of a vector space and try to represent them using the manifold itself. For ex. let the optimal value function be represented by a circlur arc, which is a one dimensional manifold. Now, estimating this value function will require us to calculate distance between any two points on the circle, which, in the ambient space ie. the two dimensional euclidean space is given by the length of the line segment joining these points. However, in fact the distance between the two points is the length of the arc formed between the two ie. distance on the manifold and not the ambient space. Distance here basically tells us how far two points or states are from each other. "The inherent smoothness in value functions is due to the bellman equations which dictate that the value at a given state is a linear function of the value at neighboring states. In the study of continuous manifolds, the eigenfunctions of the Laplacian form a discrete orthonormal basis [Rosenberg, 1997]" The manifold can be learnt by constructing a laplacian operator matrix on the graph representing the state space, and finding its *k* eigenvectors. These *k* eigenvectors can be used as the columns to construct the basis function matrix. Thereafter, a linear least square approximation using the optimal value function can be used to construct the value function.

Fig 2. The eigenvectors corresponding to the largest five eigenvalues for the given gridworld

Eigen Purposes

They are described as intrinsic reward functions represented by the difference in one of the protovalue functions at two given states. Since eigenpurposes are defined using protovalue functions, they too are independent of the reward function. Such a reward will induce a policy, which is referred as an eigenbehaviour and can be looked at as an option[6] (called as an eigenoption). Different such options are produced for different intrinsic reward functions or eigenpurposes which correspond to different protovalue basis functions. In the tabular case, this intrinsic reward is just the difference in the proto value function at two states ie. the current and the successor state. Therefore, since each such intrinsic reward corresponds to an eigen option, the number of eigen options discoverable through this method is equal to the number of protovalue functions which is essentially the number of eigenvectors of the combinatorial laplacian as descibed by the PVF theory. The one assumption considered in the initial idea is that the adjacency matrix of the state graph is available so as to generate the graph laplacian. The authors note that learning options which do not use the external reward and are thus task independent allow them to not only focus on finding bottleneck states, but finding more general options which can help in learning the optimal policies for a varied number of tasks. They also show that only learning options having goals as bottleneck states can actually hinder exploration as they allow traversing only between the subgoals, and not other parts of the state space. Learning options that are task independent allows for more efficient exploration.
This idea is then extended to the case when the adjacency matrix is not readily available by sampling transitions and creating an 'incidence' matrix with each row being the difference in the feature representation of the sampled transition. SVD is performed on this matrix to get the eigenpurposes as the columns of the V matrix. Avoiding appending a sampled transition again in such an 'incidence' matrix allows the authors to extend this to the continuous state case.
The next work in this line tries to establish a link between finding such eigenoptions using the successor representation[4] instead of using the combinatorial laplacian. The authors show that the eigenvectors generated by the successor representation matrix are similar to those generated by the graph laplacian and show results on learning a deep successor representation[5] on the Atari domain. A SR for the continuous case or SF (Successor Feature) is learnt while learning an auxiliary task of predicting the next state given current state and action. The SR vector value for each sampled state is appended to the transition matrix using which the eigenpurposes are formed. In the presence of extrinsic reward, in the tabular case, Q learning is used over primitive actions while following a uniform behavior policy over both the primitive actions and the learned eigenoptions.

$M(s, s^{'}) = \mathop{\mathbb{E}}_{s^{'} \sim P} \ [\sum_{t=0}^{\infty} \ \gamma^{t} \mathop{\mathbb{I}}(s_{t} = s^{'}) \ | \ s_{0} = s ]$

where, $\mathop{\mathbb{I}}(.)$ is 1 if its argument is true, else 0.

$ M(s, :) = 1_{s} + \gamma \ \mathop{\mathbb{E}}_{s^{'} \sim P} \ [M(s^{'}, :)] $
$r_{e}(s, s^{'}) = e[s^{'}] - e[s] $

where $M$ and $e$ denote the successor representation and proto value function respectively.

Here we consider a maze for which initial exploration is unable to discover all states in the maze*. The SR constructed therefore has values of 0 for such states. Now, some of the eigenoptions discovered for such a SR are able to reach the unexplored part of the state space, and thus relearn a SR for starting at the state where the corresponding option terminates. The relearnt SR can be used to generate eigenvectors again which now correspond to all of the space (which is now explored). This way eigenoptions can help in exploration. To achieve this, it was important for the eigenvector to have the states for the unexplored space as having value zero (as they remain unexplored) and the rest as having negative values (so as to motivate the agent to reach high reward states ie. unexplored space). Basically, it is perhaps recommended in such a scenario to learn the SR and the eigenoptions in an iterative fashion instead of learning the SR and then the corresponding eigenoptions (the learned SR might not be accurate as not the whole state space is explored). It will be interesting to see the difference in learning a reward based task for both the approaches. However, this cannot be assured in the function approximation setting, and therefore does not guarantee that we find / learn eigenoptions that can help in exploring. Learning eigenoptions for Atari through Q learning is regarded as computationally expensive. This is because we are required to learn almost 1024 options using Q learning on an Atari game, each of which at least takes more than a million steps.

Fig 3. Comparing the first eigen vector with (left) and without (right) iteratively updating SR

Learning the SR once, and then using the combined set of learnt eigenoptions and primitive actions to learn a reward based task can lead to exploratory issues when the goal state is not easily explored. Since a uniform policy over eigenoptions and primitive actions will be sampling eigenoptions more frequently than a policy of just the primitve actions, it is more likely to lead the agent in different parts of the explored space rather than into the unexplored regions. Consider the gridworld task as shown in the figure above. Since the reward lies in a region which is unexplored in the initial stages, the eigenoptions learned are only guaranteed to move the agent smoothly in the explored region. Consider an eigenoption which lands the agent near the unexplored region. Now since primitive actions are very rarely selected (especially as the number of eigenoptions increase), it is highly unlikely for the uniform behivor policy to choose a string of primitive actions that are able to explore the unexplroed space efficiently.On the other hand, in the case where we iteratively learn SR, we allow the agent to choose only primitive actions (while learning the SR), and thus when an eigenoption terminates near the unexplored region, the SR learning ensures that more unexplored region is explored, the SR and thus the eigenoptions are better estimated. It is clear from this argument that as the number of eigenoptions in the behivor policy grow very large, the agent is forced to spend most of the time in the explored regions of the state space, which hinders exploration required for a sparse reward based task.
Some questions worth exploring : 1. How is the option discovery affected when the initial SR is not a good estimate ? Is a policy made of primitive actions and learnt eigenoptions enough for exploration required to learn optimal policies in extremely sparse reward settings ? I feel I have tried to answer part of this question in the above argument. 2. How are the smoothest eigen vectors identified in the function approximation setting ? 3. Is there an alternative to the incidence matrix formulation and / or the eigenpurpose definition ?

References

Eigenoption Discovery through the Deep Successor Representation, Marlos C. Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, Murray Campbell.
A Laplacian Framework for Option Discovery in Reinforcement Learning, Marlos C. Machado, Marc G. Bellemare, Michael Bowling.
Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes, Sridhar Mahadevan, Mauro Maggioni.
Dayan P. Improving Generalisation for Temporal Difference Learning: The Successor Representation.
Deep Successor Reinforcement Learning, Tejas D. Kulkarni and Ardavan Saeedi and Simanta Gautam and Samuel J. Gershman.
Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning, Richard S. Sutton, Doina Precup, Satinder Singh.

$*$ (transitions through wall joining room 1 and 2 are restricted only for actions from room 1 towards 2)