Cell-Free Latent Go-Explore

Quentin Gallouédec, Emmanuel Dellandréa

Abstract

In this paper, we introduce Latent Go-Explore (LGE), a simple and general approach based on the Go-Explore paradigm for exploration in reinforcement learning (RL). Go-Explore was initially introduced with a strong domain knowledge constraint for partitioning the state space into cells. However, in most real-world scenarios, drawing domain knowledge from raw observations is complex and tedious. If the cell partitioning is not informative enough, Go-Explore can completely fail to explore the environment. We argue that the Go-Explore approach can be generalized to any environment without domain knowledge and without cells by exploiting a learned latent representation. Thus, we show that LGE can be flexibly combined with any strategy for learning a latent representation. We show that LGE, although simpler than Go-Explore, is more robust and outperforms all state-of-the-art algorithms in terms of pure exploration on multiple hard-exploration environments. The LGE implementation is available as open-source at https://github.com/qgallouedec/lge.

\affiliations

École Centrale de Lyon
LIRIS, CNRS UMR 5205, France
quentin.gallouedec@ec-lyon.fr

1 Introduction

RL algorithms aim to learn a policy by maximizing a reward signal. In some cases, the rewards from the environment are sufficiently informative for the agent to learn a complex policy, and thus achieve impressive results, including world level in Go silver2016mastering, StarCraft vinyals2019grandmaster, or even learning sophisticated robotic tasks lee2019robust. However, many real-world environments provide extremely sparse bellemare2016unifying, deceptive lehman2011abandoning rewards, or none at all. In such environments, random exploration, on which many current RL approaches rely, may not be sufficient to collect data that is diverse and informative enough for the agent to learn anything. In these cases, the agent must adopt an efficient exploration strategy to reach high reward areas, which may require a significant amount of interaction with the environment.

Recently, ecoffet2021first introduced a new paradigm in which a goal-conditioned agent is trained to reach states it has already encountered, and then explore from there. The agent thus iteratively pushes back the frontier of its knowledge of the environment. We call this family of algorithms return-then-explore. ecoffet2021first provide Go-Explore, an algorithm of this new family, that outperforms by several orders of magnitude the state-of-the-art scores on the game Montezuma’s Revenge, known as a hard-exploration problem. Go-Explore relies on a grouping of observations into cells. These cells are used (1) to select target observations that are at the frontier of yet undiscovered states and (2) to build a subgoal trajectory that the agent follows to reach the final goal cell. As ecoffet2021first initially spotted, the cell design is not obvious. It requires detailed knowledge of the observation space, the dynamics of the environment, and the subsequent task. If any important information about the dynamics of the environment is missing from the cell representation, the agent may fail to explore at all. For example, in Montezuma’s Revenge, possession of a key is a crucial piece of information that when included in the cell representation increases exploration by several orders of magnitude. We also show in Appendix B that the cell design has a major influence on the results.

In this paper, we present Latent Go-Explore (LGE), a new algorithm derived from Go-Explore which operates without cells. This new algorithm meets the definition of a return-then-explore family of algorithms since the agent samples a final goal state at the frontier of the achieved states, returns to it, and then explores further from it. Our main contribution consists of three major modifications.

A latent representation is learned simultaneously with the exploration of the agent to provide the most up-to-date and informative representation possible.
Sampling of the final goal is based on a non-parametric density model in latent space. This leverages the learned latent representation for sampling the states of interest to be reached.
The subgoal path pursued by the agent is reduced using a characteristic latent distance.

These three modifications, detailed in Section 3, allow us to generalize the Go-Explore approach to any continuous high-dimensional environment. It also allows to automate the encoding of the observation into an informative latent representation, and thus get rid of the tedious work of designing by hand a cell representation. The end-to-end architecture of LGE is presented in Figure 1.

Figure 1: End-to-end learning architecture of LGE. The encountered observations are encoded in a latent space. A latent density is estimated. A final goal is sampled from the states already reached, by skewing the distribution with the density. A goal-conditioned agent is trained to reach this goal by pursuing a sequence of subgoals, derived from the experiment that led to the final goal. Once the agent has reached the final goal, it explores from it with any exploration strategy.

To evaluate LGE, we place ourselves in the context of reward-free exploration in two hard-exploration environments: a maze and the control of a robotic arm interacting with an object. LGE is compatible with any type of latent representation learning. We show two of them in this paper: one based on inverse dynamics, and the other on an auto-encoding mechanism. We show in Section 4.4 that for the two environments studied, LGE outperforms all state-of-the-art algorithms, and in particular Go-Explore for the exploration task.

2 Preliminaries and Related Work

2.1 Preliminaries

Markow Decision Process

This paper uses the standard formalism of a discounted Markov Decision Process (MDP) defined as the tuple $(S, A, P, R, γ, ρ_{0})$ where $S$ is the set of states, $A$ is the set of actions, $P : S \times A \to S$ is the (unknown) transition function providing the probability distribution of the next state given a current state and action, $R : S \times A \times S \to R$ is the reward function, $γ$ is the discount factor and $ρ_{0}$ is the initial distribution of states. A policy, denoted $π (\cdot | \cdot) : S \times A \to R^{+}$ is the probability distribution such that $π (s | a)$ is the probability of selecting action $a \in A$ in state $s \in S$ . We denote the previously defined values with discrete time $t$ such that $s_{t}$ , $a_{t}$ and $r_{t}$ denote respectively the state, action, and reward at timestep $t$ . The aim is to learn a policy $π$ that maximises the long-term expected reward $E [\sum_{t = 0}^{+ \infty} γ^{t} r_{t}]$ .

Goal-conditioned MDP

We note that every MDP can be augmented into a goal-conditioned MDP with a goal space $G$ and an initial goal distribution $ρ_{g}$ . At each timestep, the observation is augmented with a goal and the reward function depends on this goal. A goal-conditioned policy kaelbling1993learning, denoted $π (\cdot | \cdot, \cdot)$ also depends on the goal.

2.2 Related Work

Exploration in RL can be divided into three types ladosz2022exploration: random exploration, intrinsic rewards-based methods, and goal-based methods.

Random exploration

In random exploration, the agent does not follow a specific exploration approach. The agent’s actions can be purely random, or in the continuous case, augmented by exploration noise parametrized by the current state haarnoja2018soft or not lillicrap2016continuous.

Intrinsic rewards-based methods

Intrinsic rewards-based methods are inspired by intrinsic motivation from cognitive science oudeyer2009intrinsic. They consist in adding an additional reward (called intrinsic) to the reward signal from the environment (called extrinsic). This intrinsic reward is intended to foster exploration. It can be based on the state visitation count bellemare2016unifying; machado2020count or on the prediction error of a model learned from the collected data houthooft2016vime; pathak2017curiosity; achiam2017surprise; pathak2019self; burda2019exploration; tao2020novelty.

Goal-based methods

Methods that modify the reward of the environment provide no mechanism for distilling the knowledge gained from visiting various states. Agents visit new states, and they quickly forget them when other states become newer. In contrast, recent work suggests the use of a goal-conditioned autotelic agent specifically trained for the exploration task allows to use this knowledge for the realization of new user-specified goals levine2021understanding; colas2022autotelic. The reward signal is ignored during the exploration phase. After the exploration phase, the data collected by the agent are used to learn one or more subsequent tasks jin2020reward. In goal-based methods, the agent is conditioned by a goal that is used to guide exploration towards unknown areas. These methods rely on a goal generator to create goals for the agent. We divide goal-based methods into two categories: exploratory goal methods and goals to explore from methods.

Exploratory goal methods follow the intuition that the agent discovers new areas of the observation space by pursuing goals that have been little or not achieved before. The challenge of these methods is to choose the goal to be neither too easy nor too hard. The literature contains several ways to approach this trade-off. Some methods sample goals that either maximize Learning Progress colas2019curious; portelas2020teacher or value disagreement zhang2020automatic. Other methods sample goals from the least visited areas using a parametric density model on the visited states pong2020skew. It is also possible to imagine goals that have never been reached using a language model colas2020language, a generative model racaniere2020automated or a GAN florensa2018automatic.

In goals to explore from methods the agent sample a goal from previously visited states. It returns to it, either by teleportation ecoffet2019go; matheron2020efficient, or with goal-conditioned policy ecoffet2021first. The challenge of these methods is to choose a goal that is of high exploratory interest. Similarly, some methods estimate the density of the encountered states, either with parametric methods pitis2020maximum or with non-parametric methods ecoffet2021first; matheron2020efficient, to target the low-density areas.

In summary, methods based on a goal reaching policy should facilitate scalable RL. The stunning results of Go-Explore illustrate this point but remain circumscribed to few environments and require a lot of domain knowledge to work. By bridging with concepts already used in the intrinsic reward literature, we show a way to make this approach more general and simple.

3 Latent Go-Explore

Input: The number of iterations in the exploration phase $T$ .
Initialize: Replay buffer $D = \emptyset$ ; Encoding module; Goal-conditioned policy $π$

1: while

t < T

2: Sample a final goal state with Equation (3).

3: Build a subgoal trajectory

τ^{g}

using Equation (4).

4: Initialize the subgoal index:

i \leftarrow 0

5: while the last goal of

τ^{g}

is not reached do

6: Collect interaction using

π (\cdot | \cdot, τ_{i}^{g})

and store it into dataset

D

7: if subgoal

τ_{i}^{g}

is reached, i.e.

| | ϕ (s_{t}) - ϕ (τ_{i}^{g}) | | < d

then

8: Move to the next subgoal:

i \leftarrow i + 1

9: end if

10: end while

11: Explore until the end of the episode with any exploration strategy.

12: Update policy

π

with any off-policy algorithm and HER.

13: Every

u p d a t e_e n c o d e r_f r e q

timesteps, update encoder

ϕ

with any representation learning algorithm.

14: end while

15: return the goal-conditioned policy

π

and the dataset

D

Algorithm 1 LGE

LGE meets the definition of the return-then-explore family of algorithms. First, a final goal state is sampled from the replay buffer, then the agent learns a goal-conditioned policy to reach this goal. When the agent reaches the goal, the agent starts to explore. LGE learns a latent representation of observations and samples the goal pursued by the goal-conditioned agent in priority in low latent density areas. In Section 3.1, we present how the latent representation of observations is learned. In Section 3.2, we show how the latent density is estimated and how the final goal state pursued by the agent is sampled. Finally, in Section 3.3, we show how to build a subgoal trajectory from the final goal to increase the agent’s performance, in particular in far-away goal situations. The pseudo-code of the resulting algorithm is presented in Algorithm 1.

3.1 Learning a latent representation

The literature contains several latent representation learning methods for RL. Learning such a representation is orthogonal to our approach. Thus, LGE can be combined with any learning method without the need for further modifications. Choosing the best representation learning method given the environment is out of the scope of this paper.

In this paper, we present two methods of representation learning that have been found to work well with our test environments. These methods are inspired by the literature on intrinsic reward-based methods.

(a) Inverse dynamic to learn representation.

Inverse dynamic representation learning

pathak2017curiosity proposed an intrinsic reward calculated based on the agent’s prediction error of the consequence of its own actions. The representation is learned using two submodules. The first encodes the observation into a latent representation $ϕ (s_{t})$ . The second takes as input $ϕ (s_{t})$ and $ϕ (s_{t} + 1)$ and outputs the prediction of the action taken by the agent at time step $t$ . The parameters $θ$ of the inverse model $P_{θ}^{i n v}$ are optimized by minimizing the loss function:

L = \frac{1}{| N |} \sum (s_{t}, a_{t}, s_{t + 1}) \sim D \frac{1}{2} ∥ a_{t} - P_{θ}^{i n v} (s_{t}, s_{t + 1}) ∥_{2}^{2}

(1)

The inverse dynamics representation learning allows getting a latent representation of the states containing only the aspects of the state on which the agent can have an influence.

Forward dynamic representation learning

In achiam2017surprise, the intrinsic reward is calculated based on the prediction error of a model approximating the transition probability function of the MDP. Two submodules are used. The first one encodes the observation to a latent representation $ϕ (s_{t})$ . The second takes as input $ϕ (s_{t})$ and $a_{t}$ and outputs the prediction of the next state ${^s}_{t + 1}$ . The model parameters $θ$ are optimized by minimizing the loss function:

L = - \frac{1}{| N |} \sum (s_{t}, a_{t}, s_{t + 1}) \sim D log P_{θ} (s_{t + 1} ∣ s_{t}, a_{t})

(2)

3.2 Density estimation for intrinsic goal sampling

The success of the proposed method relies on the agent’s ability to generate for itself goals that it will be able to reach and then explore from there. For the agent to progress in the exploration of the environment, these goals must be at the edge of the yet unexplored areas. To identify these areas, we use an estimator of the density of latent representations of the encountered states. Moreover, we require the goal to be reachable. The set of reachable states is a subset of the state space that we assume to be unknown. The easiest way to satisfy the previous requirement is therefore to sample among the states that have already been reached.

We estimate the density of latent representations of the encountered states (called latent density) using the particle-based entropy estimator originally proposed by kung2012optimal and used in the literature on intrinsically motivated RLs liu2021behavior; liu2021active. This estimator has the advantage of being nonparametric and thus does not hinge on the learning capabilities of a learned model. Appendix C describes the details of the implementation of this estimator, denoted $^f$ .

Denote $d_{i} =^f (ϕ (s_{i}))$ the estimation of the latent density of $s_{i}$ . Following vaart1998asymptotic, we denote the order statistics of $d_{1}, d_{2}, \dots, d_{N}$ by $d_{N, (1)} ⩽ d_{N, (2)} ⩽ \dots ⩽ d_{N, (N)}$ , and denote the order statistic vector by $X_{N, (\cdot)}$ . Let the rank $R_{N, i}$ be the position of $d_{i}$ in the order statistic, so in the absence of ties, $d_{i} = d_{N, (R_{N, i})}$ . Therefore, $R_{N, i}$ refers to the rank of $s_{i}$ in the latent density sorting of the encountered states.

The sampling of the final goal state follows a geometric law on the rank in the latent density sort. The probability to draw $s_{i}$ as the final goal state is

P (G = s_{i}) = (1 - p)^{R_{N, i} - 1} p

(3)

Figure 3: PandaNoTask-v2, a reward-free robotics environment.

where $G$ is the random variable corresponding to the final goal state, and $0 ⩽ p ⩽ 1$ is a hyperparameter controlling the bias towards states with a low latent density.

The representation is jointly learned with the exploration of the agent. Therefore, the latent density must be regularly recomputed to take into account the most recent representation on the one hand, and the recently visited states on the other hand. However, considering the slow evolution of this value, we choose to recompute the latent density only once every 5000 timesteps. This allows us to significantly reduce the computation needs while having a low empirical impact on the results.

3.3 Subgoal trajectory

As learning progresses, the sampled final goal states are increasingly distant. However, reaching a distant goal is challenging because it implies a sparse reward problem.

To overcome this problem, we condition the agent to successive intermediate goals $τ^{g} = (g_{0}, g_{1} . . ., g_{L})$ that should guide it to the final goal state $g_{L}$ . These intermediate goals are chosen from the trajectory that led the agent to the final goal state $(s_{0}, s_{1}, \dots, s_{T})$ .

The trajectory that led the agent to the final goal state is unlikely to be optimal. Plus, if the agent is conditioned by the whole trajectory, it may fail to reach all of them, even though some of them may not be necessary to reach the final goal state. To allow the agent to find a better path to the final goal state, we remove some subgoals from this trajectory. To decide whether a subgoal should be removed from the trajectory, we evaluate the latent distance to the previous subgoal. If the distance is less than the threshold, then the goal is removed.

\forall i ⩽ L - 1, | | ϕ (g_{i}) - ϕ (g_{i + 1}) | | > d

(4)

In contrast to Go-Explore, we do not use the best known trajectory that leads to the sampled goal area (cell). The main reason is that the best known trajectory may be particularly difficult to reproduce, due to the dynamics and the stochasticity of the environment. Consider a game in which two paths exist to a target position. The first one is short, but the environment kills the agent 9 times out of 10. The second one is longer but succeeds every time. If the agent always chooses the shorter path (like Go-Explore), it will fail to reach the target most of the time.

Once the goal is reached, the agent explores using any exploration strategy. For the sake of simplicity, we choose a random exploration strategy for our experiments. We also impose that the agent repeats the previous action with a probability of 90%. This technique has been shown to increase the results significantly ecoffet2021first.

4 Experiments

To demonstrate the effectiveness of our method, we apply it to a range of pure exploration tasks. We focus on environments for which naive random exploration is not sufficient to explore the rich variety of reachable states.

We compare the results obtained with LGE with the results obtained using several algorithms based on intrinsic curiosity and others based on goal-directed strategies.

In terms of infrastructure, each run was performed on a single worker machine equipped with one CPU and one NVIDIA® V100 GPU + RAM 16 Gb of RAM.

4.1 Environments

Continuous maze

We train an agent to navigate in a continuous 2D maze. The corresponding configuration is shown in Figure 5. The agent is initially at the center of the maze. At each timestep, the agent receives the current coordinates as an observation and chooses an action that controls its location change. If the agent collides with a wall, it returns to its previous position. The reachable space is a square of $12 \times 12$ and the agent’s action is limited to $[- 1, 1]$ horizontally and vertically. The agent can interact 100 times with the environment, after which the episode ends and the agent returns to its initial position in the center of the maze.

Robotics environment

Robotic environments are interesting and challenging application cases of RL, especially since the reward is often sparse. We simulate a Franka robot under the PyBullet physics engine using panda-gym gallouedec2021pandagym. Figure 3 depicts a rendering of the setup. The robot can move and interact with an object. The agent has access to the position of the end-effector and the position of the object, as well as to the opening of the gripper. The agent interacts 50 times with the environment and then the object and the robot arm are reset to their initial position.

4.2 Baselines

Random exploration

Most RL methods from the literature do not follow any specific exploration strategy. In a reward-free context, the performance of the latter is often equivalent to a random walk. We will take as a baseline a random agent, whose actions are uniformly sampled over the action space at each timestep, Soft Actor-Critic (SAC; haarnoja2018soft) and Deep Deterministic Policy Gradient (DDPG; lillicrap2016continuous).

Intrinsic reward-based exploration

In this paper, we take as reference two widely used intrinsic reward-based methods combined with either SAC or DDPG. These methods stand out from the others because, despite their simplicity, they have demonstrated good performance on a wide variety of tasks.

Figure 4: Comparison of the space coverage of the maze environment. Each experiment is run 10 times. The left plot represents the space coverage across timesteps. The solid lines are the interquantile means and the shaded areas are the 95% confidence intervals. The right plot is the final performance profile (higher is better).

Intrinsic Curiosity Module (ICM; pathak2017curiosity): the intrinsic reward is computed as the mean square error between the true latent representation and the one predicted by a learned dynamic model given the action taken. The encoder is trained jointly with an inverse dynamics model.
Surprise achiam2017surprise: the intrinsic reward is the approximation of the KL divergence between the actual transition probabilities and a learned transition model.

Goal-directed exploration

We use the following algorithms as a baseline.

Go-explore ecoffet2021first: the agent discretizes the state space into cells, selects the least visited ones, returns to them using a goal-conditioned policy, and further explores from there. We use a constant cell generation strategy during learning, like the one proposed in the original paper for robotic tasks. Go-Explore is the closest baseline to our algorithm. Appendix D details the differences between LGE and Go-Explore.
Diversity Is All You Need (DIAYN; eysenbach2019diversity): the agent is conditioned by a skill and a discriminator predicts the skill pursued by the agent. The more the discriminator predicts with certainty the skill pursued, the bigger the reward. Conjointly, the discriminator is trained to maximize the distinguishability of skills.
Skew-Fit pong2020skew: the agent’s goal sampling is skewed to maximize the entropy of a density model learned on the achieved states.

For the goal-directed methods, we use Hindsight Experience Replay (HER; andrychowicz2017hindsight) relabeling which has shown to significantly increase learning.

To nullify the variation in results due to different implementations, we implement all algorithms in the same framework: Stable-Baselines3. raffin2021stable.

The set of intrinsic reward-based methods and goal-directed methods are underpinned by the same off-policy algorithm. The hyperparameters for this algorithm are identical. For the Maze environment, we use SAC, while for the robotic environment, we use DDPG as it gives better results for all methods. To negate the influence of a bad choice of hyperparameter on the results, the method-specific hyperparameters are optimized. Appendix A details the optimization process for all methods and the resulting hyperparameters.

4.3 Measuring the exploration

In this paper, we focus on the agent’s ability to explore its environment in a pure exploration context, i.e. in the absence of extrinsic reward. This step is particularly important because, in the case of an environment with very sparse rewards, the agent can interact a large number of times with the environment without getting any reward. It is therefore necessary to follow an efficient exploration strategy to discover the few areas of the state space where the agent can get a reward. To be able to compare the results obtained by different methods in this context, it is necessary to use a common metric for the quality of exploration.

The literature uses many different metrics. Some papers use the average reward on a hard-exploration task taiga2020on, the zero-shot performance on a predefined task sekar2020planning or monitor specific identifiable events in the environment that indirectly informs the degree of exploration gulcehre2020making. We argue that these indirect measures are unsatisfactory because they rely on the subsequent learning ability of an online and offline agent respectively.

Figure 6: Comparison of exploration with the robotic environment. Each experiment is run 10 times. The left plot represents the number of explored bins across timesteps. The solid lines are the interquantile means and the shaded areas are the 95% confidence intervals. The right plot is the final performance profile (higher is better).

For the discrete case, the authors use the number of unique states encountered (menard2021fast). However, we are working in a continuous framework and must therefore extend this idea by discretizing the continuous space. For both environments, we discretize the state space into bins. For the maze environment, we use a 24 x 24 grid. The possible positions being bounded and known, we report the coverage rate of the space. For the robotic environment, we use the position of the object and the gripper position that we discretize into 0.1 m bins in each dimension. We report the raw number of explored bins. Although arbitrary, we observe empirically that the choice of the discretization parameter has hardly any impact on the results obtained.

Following the guidelines of agarwal2021deep, we use for all plots in this paper the interquartile mean (IQM) with the 95% confidence interval. We also plot the performance profile, which is the empirical tail distribution function of the score at the end of the training.

4.4 Main Results

The exploration results for the Maze environment are presented in Figure 4. A rendering of the positions explored by the agent is presented in Figure 5. We note that only LGE and Go-Explore significantly outperform the results obtained with random exploration. This demonstrates the effectiveness of the return-then-explore paradigm in this environment. We note that exploration based on intrinsic curiosity does not yield significantly better results than those obtained by random exploration. We hypothesize that this is due to the simple dynamics of the environment that quickly converges the intrinsic reward to $0.0$ . Surprisingly, neither Skew-Fit nor DIAYN performs significantly better than random exploration. For DIAYN, we find that most of the skills were concentrated in the initial position area of the agent. Finally, we note that, although the cell size has been optimized, LGE significantly outperforms Go-Explore. LGE manages to cover almost the entire reachable space at the end of the runs while exhibiting low variability in the results.

The exploration results for the robotic environment are presented in Figure 6. We notice that LGE significantly outperforms all other methods. Notably, Go-Explore performs only slightly better than random exploration. We note that Go-Explore does not learn to grasp the object throughout the learning process. The results presented on a robotic environment by ecoffet2021first are much better. We presume this is mainly due to the meticulous work done on the state space examination and the induced cell design. Here, we use a naive grid-like cell design. Although the grid parameter is optimized, it does not yield good exploration results with this environment. We thus demonstrate the benefit of using a learned representation to automatically capture important features of the environment’s dynamics.

4.5 Ablation study

We conduct several ablation studies to measure the contribution of each component in LGE. Specifically, we test the three following variants:

LGE without further exploration: the environment is reset once the agent reaches the final goal instead of performing the exploration random interactions. Exploratory goal methods (see Section 2.2) are designed that way.
LGE without skewing the final goal distribution: instead of skewing the final goal distribution in favor of low latent density areas, the final goals are sampled uniformly among the reached goals.
LGE without subgoals trajectory reduction: the agent is directly conditioned by the final goal state.

We perform these ablation studies on the maze environment. We use the same hyperparameters as those used in Section 4.4. The results are shown in Figure 7.

Figure 7: Result of ablation study on the maze environment. Each experiment is run 10 times. The left plot represents the space coverage across timesteps. The solid lines are the interquantile means and the shaded areas are the 95% confidence intervals. The right plot is the final performance profile (higher is better).

Three ablations studied have a substantial impact on the results. We show that the exploration that follows the reaching of a goal is crucial: without it, the agent reaches the frontier of its knowledge but has little chance to go beyond it. We also show that promoting the sampling of low latent density goals can significantly increase the results. In doing so, we guide exploration towards states with high exploratory value. Finally, we show that the agent faces significant difficulties in reaching distant goals. Conditioning the agent by successive subgoals greatly improves its exploration.

5 Discussion

5.1 Limitation and future work

The initial state must remain the same across episodes

The approach we propose is based on the assumption that the agent is always initialized in the same state. This assumption guarantees that at the beginning of each episode, all the states previously reached are reachable and that the subgoal trajectory starts with the initial state of the agent. However, in some environments, especially in procedurally generated environments, this assumption is not fulfilled kuttler2020nethack. In this situation, trying to follow the subgoal trajectory may be counterproductive in reaching the final goal. It is also possible that the final goal is not even reachable. However, note that even the pursuit of an unreachable goal can foster exploration. Future work should investigate how to tailor the algorithm to overcome this problem. We believe that an approach inspired by generative networks such as racaniere2020automated may be appropriate.

Goal-achievement functions

In LGE, an agent is considered to have reached a goal (whether final or intermediate) when the latent distance between its state and the goal is below a specified threshold. This is a naive way of defining a goal achievement function colas2022autotelic that depends crucially on the latent representation. We believe that the results could be improved by envisioning a more informative and suitable goal achievement function for our method.

High-dimension environments and representation learning

The main contribution of our work consists in the generalization of the Go-Explore approach by using a latent representation. LGE can be applied without further adaptation to high-dimensional environments. Our approach is particularly useful in environments where the observation is an image, and for which learning directly from the raw pixels is not desirable. Thus, representation learning is the keystone of the method. We provide a proof of concept for a forward and inverse model. We believe that these results can be greatly improved by choosing more finely the representation learning method for each environment by taking advantage of the many works dealing with this subject lesort2018state. Future work should confirm this intuition.

5.2 Conclusion

We introduce LGE, a new exploration method for RL. In this method, our agent explores the environment by selecting its own goals based on a jointly learned latent representation. LGE can be used as pre-training in environments where rewards are sparse or deceptive. Our main contribution is to generalize the Go-Explore algorithm, allowing us to benefit from representation learning algorithms for exploration. We present statistically robust empirical results conducted in a maze environment and in a robotic environment. These results show that our approach significantly improves exploration performance.

Acknowledgments

This work was granted access to the HPC resources of IDRIS under the allocation 2022-[AD011012172R1] made by GENCI.

References

Appendix A Hyperparameters

To limit the impact of the large variability of results depending on the hyperparameters, we chose to optimize the hyparameters for each experiment. We selected 100 unique sets of hyperparameters from a search space presented in Table 1 using Optuna akiba2019optuna. For hyperparameter set, we train the model with 3 different seeds and keep the median score.

Method	Hyperparameter	Search space	Maze	Robotic
LGE	Latent distance threshold	$[0.1, 0.2, 0.5, 1.0, 2.0]$	$1.0$	$1.0$
	Latent dimension	$[4, 8, 16, 32, 64]$	$16$	$8$
	Geometric parameter	$[0.001, 0.002, 0.005, 0.01, 0.02, 0.05]$	$0.05$	$0.01$
Go-Explore	Cell size	$[0.2, 0.5, 1.0, 2.0, 5.0]$	$2.0$	$0.2$
ICM	Scaling factor	$[10^{- 3}, 10^{- 2}, 10^{- 1}, 10^{0}, 10^{1}, 10^{2}]$	$10^{- 1}$	$10^{- 2}$
	Actor loss coefficient	$[10^{- 3}, 10^{- 2}, 10^{- 1}, 10^{0}, 10^{1}, 10^{2}]$	$10^{2}$	$10^{- 3}$
	Inverse loss coefficient $^{a}$	$[10^{- 3}, 10^{- 2}, 10^{- 1}, 10^{0}, 10^{1}, 10^{2}]$	$10^{1}$	$10^{- 3}$
	Forward loss coefficient $^{a}$	$[10^{- 3}, 10^{- 2}, 10^{- 1}, 10^{0}, 10^{1}, 10^{2}]$	$10^{1}$	$10^{2}$
Surprise	Feature dimension	$[2, 4, 8, 16, 32]$	2	16
	Desired average bonus	$[10^{- 2}, 10^{- 1}, 10^{0}, 10^{1}]$	$10^{- 2}$	$10^{- 2}$
	Model train frequency	$[2, 4, 8, 16, 32, 64, 128]$	64	8
	Model learning rate	$[10^{- 6}, 10^{- 5}, 10^{- 4}, 10^{- 3}, 10^{- 2}]$	$10^{- 5}$	$10^{- 5}$
DIAYN	Number of skills	$[4, 8, 16, 32, 64, 128]$	32	32
Skew-Fit	Number of models	$[5, 10, 20, 50, 100, 200]$	50	50
	Density power	$[- 5.0, - 2.0, - 1.0, - 0.5, - 0.2, - 0.1]$	$- 1.0$	$- 0.2$
	Number of pre-sampled goals	$[64, 128, 256, 512, 1024, 2048]$	64	128
	Success distance threshold	$[0.05, 0.1, 0.2, 0.5, 1.0]$	0.5	0.2

$^{a}$ In the original paper, the sum of the forward and inverse loss coefficients is $1$ . We get better results without this constraint.

Table 1: Search space and resulting hyparparameters after optimization

The method-specific parameters that have not been optimized are presented in Table 2.

Method	Hyperparameter	Value
LGE	Encoder module train frequency	Every 1000 timesteps
	Learning rate	0.001
	Batch size	32
	Gradient steps	500
	Exploration strategy	Random
	Repeat action probability	0.9
Go-Explore	Exploration strategy	Random
	Repeat action probability	0.9
ICM	Feature size	16
	Networks	$[64, 64]$
	Activation function	ReLU
Surprise	Networks	$[64, 64]$
	Activation function	ReLU
DIAYN	Discriminator networks	$[256, 256]$
	Activation function	ReLU
Skew-Fit	Gradient steps	100
	Batch size	2048
	Learning rate	0.01

Table 2: Hyperparameters specific to each method. Their value is identical for all experiments and have not been optimized.

The hyperparameters used for the off-policy agent are identical for all algorithms. They are presented in Table 3.

Hyperparameter	SAC	DDPG
Networks	$[300, 400]$	$[300, 400]$
Learning rate	$3 \times 10^{- 4}$	$10^{- 3}$
Learning starts after	100 timesteps	100 timesteps
Batch size	256	100
Polyak update coefficient ( $τ$ )	0.005	0.005
Discount factor ( $γ$ )	0.99	0.99
Target entropy	2.0	N/A
Train frequency	1 timestep	1 timestep
Gradient steps	1	1
HER sampling probability	0.8	0.8
HER relabeling strategy	Future	Future

Table 3: Hyperparameters of the off-policy agent. These hyperparameters are identical for all methods and for all experiments. The hyperparameters related to HER relabeling only apply to the methods for which the agent is goal-conditioned (DIAYN, Go-Explore, Skew-Fit and LGE).

Appendix B On the criticality of cell representation in Go-Explore

In Go-Explore, similar observations are grouped into cells and each cell encountered is stored in an archive. The cell representation is a critical aspect of Go-Explore. In the Montezuma’s Revenge environment, a slight variation in cell representation results in an order of magnitude difference in the results. The cells are used to (1) estimate the density of states encountered in the observation space and sample a target cell against it; (2) divide this goal reaching task into a sequence of subgoals.

We argue that building a cell representation to capture the relevant components of an environment to perform the desired task requires a significant amount of domain knowledge. In general, this cell representation cannot be generalized to other tasks or to other environments.

To support our claim, we present in Figure 8 the space coverage in a continuous maze for different cell design. We show that even in this simple environment, a small variation in cell design has a significant impact on the result.

On the left, the cells are small, and the agent must visit each of them. If the agent interacts long enough with the environment, it should eventually explore the whole space. On the right, the cells are large. We can observe some detached areas, because the agent has not visited the cell enough to discover the next one, but enough so that this cell is no longer listed as a target cell.

Appendix C On the density estimation

Let $s_{1}, \dots, s_{n}$ be a sample of event locations in a $R^{d}$ -space. Assume that the event location $s$ follows a common distribution with density function $f (s)$ . For any sample $s_{i}$ and $s_{j}$ in this sample, assume that $D_{i} (s_{j}) = | | s_{i} - s_{j} | |$ , denotes the euclidian distance between $s_{i}$ and $s_{j}$ . For any $k ⩽ n$ , let $D_{(k)} (s_{i})$ be the distance with $k$ -th nearest neighbors of $s_{i}$ with respect to the euclidian distance.

kung2012optimal propose an optimal unbiased estimator $^f$ for the density:

^f = \frac{k U_{k}^{*}}{k - 1}

(5)

where

U_{k}^{*} = \frac{(k - 1)}{n C_{d} D_{(k)}^{d}}

(6)

and

C_{d} = \frac{π^{d / 2}}{Γ (d / 2 + 1)}

(7)

Thus we have

^f = \frac{k}{n C_{d}} D_{(k)}^{- d}

(8)

We follow the recommendation of kung2012optimal to take

k = 2 n^{1 / d}

(9)

Appendix D Comparing Go-Explore and LGE

The Go-Explore algorithm as presented by ecoffet2021first has many components. All these components allow to obtain good results on test environments. In this article, we implement our own version of Go-Explore. We have tried to stick as much as possible to the initial implementation and to improve some aspects. We keep the essence of Go-Explore, but our implementation is not intended to be equivalent to the initial implementation. The main goal here is to compare LGE and Go-Explore. Thus, the two implementations differ only in the elements that make them unique. To the best of our knowledge, all the composnts that we did not implement are compatible with LGE. It is likely that they improve LGE and Go-Explore in a similar way. In this section, we describe the implementation of LGE and Go-Explore. We explain their differences if any.

Policy-based Go-Explore

The initial implementation of Go-Explore distinguishes between the case where the environment can be reset to any desired state and the case where this is not possible. In this paper, we choose the general setting where the environment can’t be reset to any desired state, and we therefore work with the so-called policy-based implementation of Go-Explore.

Exploration after returning

In the original implementation of Go-Explore, once a cell is returned, exploration proceeds with random actions for a certain number of timesteps. For both LGE and Go-Explore, we set this number of timesteps to 50 for all environments. Note that the agent can interrupt this exploration beforehand if the maximum number of interactions with the environment is reached. ecoffet2021first shows that action consistency generally allows for more effective exploration, especially in the robotic environment. For LGE and Go-Explore, we use the same trick: the agent chooses the previous action with a probability of 90%, and uniformly samples an action with a probability of 10%.

Cell design

The original implementation of Go-Explore provides two methods for generating the cell representation.

When the observation is an image, the observation is grayscaled and downscaled. The image produced is the cell representation. The parameters to get this representation (downscaling width and height and number of shades of gray) are optimized during training to maximize an objective function that depends on a target split factor.
When the observation is a vector, each component of the vector is discretized separately by hand before learning.

For all the environments presented in this paper, we use a naive method of cell generation corresponding to a discretization of the observation. The granularity of the discretization is a hyperparameter. The choice of this hyperparameter is crucial, we develop it in more detail in the appendix B.

Goal-conditioning

In the original implementation of Go-Explore, the agent is conditioned by the cell representation of the goal. We note that this representation can vary during learning, and even in size (see previous paragraph). It is not clear how to structure the agent’s network when the size of the input varies during the learning process.

In our implementation, we choose to condition the agent by the goal observation rather than by the representation of its cell. We also condition the agent by the goal observation in LGE.

RL agent

In the original implementation of Go-Explore, the goal-conditioned agent is based on the on-policy PPO algorithm schulman2017proximal. For both LGE and Go-Explore, we rather chose to use an off-policy algorithm (SAC or DDPG) to use a Hindsight Experience Replay (HER, andrychowicz2017hindsight) relabelling, which has shown to perform better in a sparse reward environment.

Reward

In the original implementation of Go-Explore, the agent gets $+ 1$ reward for reaching intermediate cells and $+ 5$ reward for reaching the final cell of a path. The rest of the time, it receives a $0$ reward. For both LGE and Go-Explore, following the suggestions of tang2021hindsight, we rather choose the following structure for the reward: the agent gets a reward of $0$ for a success (target cell reached for Go-Explore and latent distance with the goal state below the distance threshold for LGE) and $- 1$ the rest of the time.