Mirror Descent Policy Optimization

3 minute read


Reinforcement Learning (RL) approaches can be broadly casted in three categories--- **value** based, **model** based, and **policy** based. Value based methods tend to model the optimal value function and then extract the optimal policy. Model based methods try to learn the model (transition and reward dynamics) and then extract the optimal policy using planning techniques. Intead, policy based methods try to learn the optimal policy by directly optimizing the objective funtion of inerest, i.e. the expected discounted return. Policy optimization falls under this third category and includes popular off-the-shelf RL methods such as TRPO, PPO, and SAC.

Mirror Descent

Mirror Descent (MD) is a first order trust-region method for constrained convex optimization.

$x^* \in \arg\!\min_{x\in C} \; f(x)$

MD offers the following iterative update rule for the above problem:

$x_{k+1} \in \arg\!\min_{x \in C} \; \langle \nabla f(x_{k}), x - x_{k} \rangle + \frac{1}{t_k} B_{\psi}(x, x_k)$

Mirror Descent in RL

Applying the MD update to the RL objective results in having to solve an optimization problem at each iteration $k$. We test a simple solution to this--- solving the optimization problem approximately using multiple steps of SGD. In doing so, we can derive on-policy and off-policy versions based on how we wish to sample data:

On-policy MDPO

On-policy MDPO tries to solve the following optimization problem at each iterate $k$, using multiple steps of SGD:

$\theta_{k+1} \leftarrow \arg\max_{\theta\in\Theta} \mathbb{E_{s\sim\rho_{\theta_k}}}\Big[\mathbb{E_{a\sim\pi_\theta}}\big[A^{\theta_k}(s, a)\big] - \frac{1}{t_{k}}\text{KL}(s;\pi_\theta,\pi_{\theta_k})\Big]$

Off-policy MDPO

Off-policy MDPO tries to solve the following optimization problem at each iterate $k$, using multiple steps of SGD:

$\theta_{k+1} \leftarrow \arg\max_{\theta\in\Theta} \mathbb{E_{s\sim\mathcal{D}}}\Big[\mathbb{E_{a\sim\pi_\theta}}\big[A^{\theta_k}(s, a)\big] - \frac{1}{t_{k}}\text{KL}(s;\pi_\theta,\pi_{\theta_k})\Big]$

Key Differences with TRPO

  • TRPO deploys a hard constraint on the policy update, through a line search routine. MDPO does not have a hard constraint.
  • TRPO performs natural gradient descent, thus computing the FIM at each step. MDPO performs simple SGD.
  • TRPO uses the opposite KL direction compared to MDPO.
  • TRPO uses heuristics for step size. MDPO uses a simple annealing schedule.

Key Differences with PPO

  • PPO uses a clipping based objective, removing any dependence on the KL divergence.
  • PPO-KL is similar to MDPO but uses opposite KL direction/does not use multiple SGD steps. The adaptive version also uses heuristics for computing the step size.


Coming back to the off-policy version of MDPO, we see that after applying the reparameterization trick, the MDPO update closely resembles the SAC update:

$L^\text{MDPO}(\theta,\theta_k) = \mathbb{E_{s \sim \mathcal{D}, \epsilon \sim \mathcal{N}}} \big[\log \pi_\theta\big(\widetilde{a}_\theta(\epsilon,s)|s\big) - \log \pi_{\theta_k}\big(\widetilde{a}_\theta(\epsilon,s)|s\big) - t_k Q^{\theta_k}_\psi\big(s,\widetilde{a}_\theta(\epsilon,s)\big)\big]$

$L^\text{SAC}(\theta,\theta_k) = \mathbb{E_{s \sim \mathcal{D}, \epsilon \sim \mathcal{N}}} \big[\lambda \log \pi_{\theta}\big(\widetilde{a}_\theta (\epsilon, s)|s\big) - Q^{\theta_k}_\psi\big(s,\widetilde{a}_\theta(\epsilon, s)\big)\big]$

Key Differences with SAC

  • SAC constraints the current policy to be close to uniform policy. MDPO constraints it to be close to the previous step policy.
  • MDPO offers a new derivation to SAC, through a optimization perspective. SAC is originally derived from a ‘soft’ policy iteration perspective.
  • SAC argues that any projection can be used for it’s loss. MDPO shows that the projection must be the same as the choice of Bregman.