Successor Representation and Eigen Options
Published:
Proto Value Functions
These were introduced in Mahadevan et al.[3] and can be considered as task independent basis functions. A linear combination of these can thus be used to define the value function of a particular task. They are constructed by diagonalizing a diffusion flow matrix which in turn is the Laplacian of the a function formed using the adjacency matrix. The advantage of learning proto value functions is that they represent the general topology of the state space irrespective of or in absence of the reward function. Although a large state space may have highly complex value function vector, it can be approximately represented as a linear combination of a substantially small number of proto value functions.
Proto value functions (PVFs) try to move away from the idea of representing the value function in terms of a vector space and try to represent them using the manifold itself. For ex. let the optimal value function be represented by a circlur arc, which is a one dimensional manifold. Now, estimating this value function will require us to calculate distance between any two points on the circle, which, in the ambient space ie. the two dimensional euclidean space is given by the length of the line segment joining these points. However, in fact the distance between the two points is the length of the arc formed between the two ie. distance on the manifold and not the ambient space. Distance here basically tells us how far two points or states are from each other. "The inherent smoothness in value functions is due to the bellman equations which dictate that the value at a given state is a linear function of the value at neighboring states. In the study of continuous manifolds, the eigenfunctions of the Laplacian form a discrete orthonormal basis [Rosenberg, 1997]" The manifold can be learnt by constructing a laplacian operator matrix on the graph representing the state space, and finding its *k* eigenvectors. These *k* eigenvectors can be used as the columns to construct the basis function matrix. Thereafter, a linear least square approximation using the optimal value function can be used to construct the value function.
Eigen Purposes
They are described as intrinsic reward functions represented by the difference in one of the protovalue functions at two given states. Since eigenpurposes are defined using protovalue functions, they too are independent of the reward function. Such a reward will induce a policy, which is referred as an eigenbehaviour and can be looked at as an option[6] (called as an eigenoption). Different such options are produced for different intrinsic reward functions or eigenpurposes which correspond to different protovalue basis functions. In the tabular case, this intrinsic reward is just the difference in the proto value function at two states ie. the current and the successor state. Therefore, since each such intrinsic reward corresponds to an eigen option, the number of eigen options discoverable through this method is equal to the number of protovalue functions which is essentially the number of eigenvectors of the combinatorial laplacian as descibed by the PVF theory. The one assumption considered in the initial idea is that the adjacency matrix of the state graph is available so as to generate the graph laplacian. The authors note that learning options which do not use the external reward and are thus task independent allow them to not only focus on finding bottleneck states, but finding more general options which can help in learning the optimal policies for a varied number of tasks. They also show that only learning options having goals as bottleneck states can actually hinder exploration as they allow traversing only between the subgoals, and not other parts of the state space. Learning options that are task independent allows for more efficient exploration.This idea is then extended to the case when the adjacency matrix is not readily available by sampling transitions and creating an 'incidence' matrix with each row being the difference in the feature representation of the sampled transition. SVD is performed on this matrix to get the eigenpurposes as the columns of the V matrix. Avoiding appending a sampled transition again in such an 'incidence' matrix allows the authors to extend this to the continuous state case.
The next work in this line tries to establish a link between finding such eigenoptions using the successor representation[4] instead of using the combinatorial laplacian. The authors show that the eigenvectors generated by the successor representation matrix are similar to those generated by the graph laplacian and show results on learning a deep successor representation[5] on the Atari domain. A SR for the continuous case or SF (Successor Feature) is learnt while learning an auxiliary task of predicting the next state given current state and action. The SR vector value for each sampled state is appended to the transition matrix using which the eigenpurposes are formed. In the presence of extrinsic reward, in the tabular case, Q learning is used over primitive actions while following a uniform behavior policy over both the primitive actions and the learned eigenoptions.
$M(s, s^{'}) = \mathop{\mathbb{E}}_{s^{'} \sim P} \ [\sum_{t=0}^{\infty} \ \gamma^{t} \mathop{\mathbb{I}}(s_{t} = s^{'}) \ | \ s_{0} = s ]$
where, $\mathop{\mathbb{I}}(.)$ is 1 if its argument is true, else 0.
$ M(s, :) = 1_{s} + \gamma \ \mathop{\mathbb{E}}_{s^{'} \sim P} \ [M(s^{'}, :)] $
$r_{e}(s, s^{'}) = e[s^{'}] - e[s] $
where $M$ and $e$ denote the successor representation and proto value function respectively.
Here we consider a maze for which initial exploration is unable to discover all states in the maze*. The SR constructed therefore has values of 0 for such states. Now, some of the eigenoptions discovered for such a SR are able to reach the unexplored part of the state space, and thus relearn a SR for starting at the state where the corresponding option terminates. The relearnt SR can be used to generate eigenvectors again which now correspond to all of the space (which is now explored). This way eigenoptions can help in exploration. To achieve this, it was important for the eigenvector to have the states for the unexplored space as having value zero (as they remain unexplored) and the rest as having negative values (so as to motivate the agent to reach high reward states ie. unexplored space). Basically, it is perhaps recommended in such a scenario to learn the SR and the eigenoptions in an iterative fashion instead of learning the SR and then the corresponding eigenoptions (the learned SR might not be accurate as not the whole state space is explored). It will be interesting to see the difference in learning a reward based task for both the approaches. However, this cannot be assured in the function approximation setting, and therefore does not guarantee that we find / learn eigenoptions that can help in exploring. Learning eigenoptions for Atari through Q learning is regarded as computationally expensive. This is because we are required to learn almost 1024 options using Q learning on an Atari game, each of which at least takes more than a million steps.
Learning the SR once, and then using the combined set of learnt eigenoptions and primitive actions to learn a reward based task can lead to exploratory issues when the goal state is not easily explored. Since a uniform policy over eigenoptions and primitive actions will be sampling eigenoptions more frequently than a policy of just the primitve actions, it is more likely to lead the agent in different parts of the explored space rather than into the unexplored regions. Consider the gridworld task as shown in the figure above. Since the reward lies in a region which is unexplored in the initial stages, the eigenoptions learned are only guaranteed to move the agent smoothly in the explored region. Consider an eigenoption which lands the agent near the unexplored region. Now since primitive actions are very rarely selected (especially as the number of eigenoptions increase), it is highly unlikely for the uniform behivor policy to choose a string of primitive actions that are able to explore the unexplroed space efficiently.On the other hand, in the case where we iteratively learn SR, we allow the agent to choose only primitive actions (while learning the SR), and thus when an eigenoption terminates near the unexplored region, the SR learning ensures that more unexplored region is explored, the SR and thus the eigenoptions are better estimated. It is clear from this argument that as the number of eigenoptions in the behivor policy grow very large, the agent is forced to spend most of the time in the explored regions of the state space, which hinders exploration required for a sparse reward based task.
Some questions worth exploring : 1. How is the option discovery affected when the initial SR is not a good estimate ? Is a policy made of primitive actions and learnt eigenoptions enough for exploration required to learn optimal policies in extremely sparse reward settings ? I feel I have tried to answer part of this question in the above argument. 2. How are the smoothest eigen vectors identified in the function approximation setting ? 3. Is there an alternative to the incidence matrix formulation and / or the eigenpurpose definition ?
References
- Eigenoption Discovery through the Deep Successor Representation, Marlos C. Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, Murray Campbell.
- A Laplacian Framework for Option Discovery in Reinforcement Learning, Marlos C. Machado, Marc G. Bellemare, Michael Bowling.
- Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes, Sridhar Mahadevan, Mauro Maggioni.
- Dayan P. Improving Generalisation for Temporal Difference Learning: The Successor Representation.
- Deep Successor Reinforcement Learning, Tejas D. Kulkarni and Ardavan Saeedi and Simanta Gautam and Samuel J. Gershman.
- Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning, Richard S. Sutton, Doina Precup, Satinder Singh.
$*$ (transitions through wall joining room 1 and 2 are restricted only for actions from room 1 towards 2)