Mirror Descent Policy Optimization (MDPO)

November 07, 2020

Reinforcement Learning (RL) approaches can be broadly casted in three categories--- value based, model based, and policy based. Value based methods tend to model the optimal value function and then extract the optimal policy. Model based methods try to learn the model (transition and reward dynamics) and then extract the optimal policy using planning techniques. Intead, policy based methods try to learn the optimal policy by directly optimizing the objective funtion of inerest, i.e. the expected discounted return. Trus-region based policy optimization methods fall under this third category and includes popular off-the-shelf RL methods such as TRPO [1], PPO [3], and SAC [2]. The general idea in trust-region methods is to keep the policy updates close to each other, which results in more stable learning.

Mirror Descent

Mirror Descent (MD) is a first order trust-region method for constrained convex optimization.

$x^* \in \arg\!\min_{x\in C} \; f(x)$

MD offers the following iterative update rule for the above problem:

$x_{k+1} \in \arg\!\min_{x \in C} \; \langle \nabla f(x_{k}), x - x_{k} \rangle + \frac{1}{t_k} B_{\psi}(x, x_k)$

Mirror Descent in RL

Applying the MD update to the RL objective results in having to solve an optimization problem at each iteration $k$. We test a simple solution to this--- solving the optimization problem approximately using multiple steps of SGD. In doing so, we can derive on-policy and off-policy versions based on how we wish to sample data:

On-policy MDPO

On-policy MDPO tries to solve the following optimization problem at each iterate $k$, using multiple steps of SGD:

$\theta_{k+1} \leftarrow \arg\max_{\theta\in\Theta} \mathbb{E_{\color{red}{s\sim\rho_{\theta_k}}}}\Big[\mathbb{E_{a\sim\pi_\theta}}\big[A^{\theta_k}(s, a)\big] - \frac{1}{t_{k}}\text{KL}(s;\pi_\theta,\pi_{\theta_k})\Big]$

Off-policy MDPO

Off-policy MDPO tries to solve the following optimization problem at each iterate $k$, using multiple steps of SGD:

$\theta_{k+1} \leftarrow \arg\max_{\theta\in\Theta} \mathbb{E_{\color{red}{s\sim\mathcal{D}}}}\Big[\mathbb{E_{a\sim\pi_\theta}}\big[A^{\theta_k}(s, a)\big] - \frac{1}{t_{k}}\text{KL}(s;\pi_\theta,\pi_{\theta_k})\Big]$

Key Differences with TRPO

TRPO deploys a hard constraint on the policy update, through a line search routine. MDPO does not have a hard constraint.
TRPO performs natural gradient descent, thus computing the FIM at each step. MDPO performs simple SGD.
TRPO uses the opposite KL direction compared to MDPO.
TRPO uses heuristics for step size. MDPO uses a simple annealing schedule.

Key Differences with PPO

PPO uses a clipping based objective, removing any dependence on the KL divergence.
PPO-KL is similar to MDPO but uses opposite KL direction/does not use multiple SGD steps. The adaptive version also uses heuristics for computing the step size.

MDPO and SAC

Coming back to the off-policy version of MDPO, we see that after applying the reparameterization trick, the MDPO update closely resembles the SAC update:

$L^\text{MDPO}(\theta,\theta_k) = \mathbb{E_{s \sim \mathcal{D}, \epsilon \sim \mathcal{N}}} \big[\log \pi_\theta\big(\widetilde{a}_\theta(\epsilon,s)|s\big) - \log \pi_{\theta_k}\big(\widetilde{a}_\theta(\epsilon,s)|s\big) - t_k Q^{\theta_k}_\psi\big(s,\widetilde{a}_\theta(\epsilon,s)\big)\big]$

$L^\text{SAC}(\theta,\theta_k) = \mathbb{E_{s \sim \mathcal{D}, \epsilon \sim \mathcal{N}}} \big[\lambda \log \pi_{\theta}\big(\widetilde{a}_\theta (\epsilon, s)|s\big) - Q^{\theta_k}_\psi\big(s,\widetilde{a}_\theta(\epsilon, s)\big)\big]$

Key Differences with SAC

SAC constraints the current policy to be close to uniform policy. MDPO constraints it to be close to the previous step policy.
MDPO offers a new derivation to SAC, through a optimization perspective. SAC is originally derived from a ‘soft’ policy iteration perspective.
SAC argues that any projection can be used for it’s loss. MDPO shows that the projection must be the same as the choice of Bregman.

Takeaways (on-policy)

MDPO performs better than or on par to TRPO and better than PPO.
MDPO reduces wall clock time, and is more efficient than TRPO.
TRPO is a better performing algorithm than PPO.
PPO is prone to instability issues, particularly when run for longer iterations.

Takeaways (off-policy)

MDPO performs better than or on par with SAC.
Off-policy MDPO has a better sample efficiency than on-policy MDPO.
Off-policy MDPO has a much higher wall clock time than on-policy MDPO.
Regularized and un-regularized MDPO do not have a huge performance gap, similar to how SAC and it's deterministic variant do not as well.

References

Trust Region Policy Optimization. John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, Pieter Abbeel. ICML 2015.
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, Sergey Levine. ICML 2018.
Proximal Policy Optimization Algorithms. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov. Preprint 2017.