C3PO: Learning to Achieve Arbitrary Goals
via Massively Entropic Pretraining

Alexis Jacq*
Manu Orsini*
Gabriel Dulac-Arnold
Olivier Pietquin
Matthieu Geist
Olivier Bachem
Google Research
October 2022

Introduction

Reinforcement learning (RL) has shown great results in optimizing for single reward functions, that is when a controller has to solve a specific task and/or the task is known beforehand. If the task is not known a priori, or is likely to be often re-configured, then re-training a new policy from scratch can be very expensive and looks as a waste of resources. In the case of multipurpose systems deployed in contexts where they will likely be required to perform a large range of tasks, investing significant resources into training a high-performance general goal-based controller beforehand makes sense. We propose an approach allowing for training a universal goal achievement policy, a policy able to attain any arbitrary state the system can take.

Our proposed approach, Conditioned Continuous Control Policy Optimization (C3PO), is based on the hypothesis that disentangling the exploration phase from the policy learning phase can lead to simpler and more robust algorithms. It is composed of two steps:

To address the goal discovery step in C3PO, we propose the Chronological Greedy Entropy Maximization (ChronoGEM) algorithm, designed to exhaustively explore reachable states, even in complex high dimensional environments. ChronoGEM does not rely on any form of trained policy and thus doesn't require any interaction with the environment to learn to explore. Instead it uses a highly-parallelized random-branching policy to cover the environment, whose branching tree is iteratively re-pruned to maintain uniform leaf coverage. This iterative pruning process leverages learnt density models and inverse sampling to maintain a set of leaf states that are as uniform as possible over the state space. Training the goal-conditioned policy is then performed by leveraging the uniform states generated by ChronoGEM as a dataset of goals that provide well-distributed coverage over achievable states.

Conditioned Continuous Control
Policy Optimization (C3PO)

Massively Entropic Pre-Training. As described above, the first step is to discover the set of achievable goals. This collection will be the key of the effectiveness of the resulting policy: We want it as uniform as possible such that no reachable region is neglected. Therefore, without any prior, the ideal set of goals should be uniformly sampled from the manifold of states that are reachable in a given number of steps (T). Since the shape of that manifold is totally unknown and can be arbitrarily complex, such sampling is impossible.

However, it is possible to approximate such a sampling if enough states are simultaneously explored at the previous time step (T-1). Assume we are able to sample N states that approximate the uniform distribution at time T-1. Then, from each of these states, playing K uniform actions to obtain NK next states would not lead to a uniform sampling over the possible next states. However, with N large enough, it would at least induce a good coverage of the set of reachable next states. Let \rho_T be the distribution induced by these NK next states. Since the set of achievable states in T steps is necessary bounded (at least in any realistic environment), and given that we are able to closely estimate \rho_T, we can sub-sample with a probability weighted by the inverse of the density \frac{1}{\rho_T} in order to approximate a uniform sampling.

ChronoGEM. This suggests a recursive approach to approximate a uniform sampling of the reachable states in T steps, by starting with states sampled from the environment's initial distribution \rho_0, playing uniform actions, sub-sampling to get an highly covering set that approximates a uniform distribution, re-exploring actions from that new set, and then again, for T iterations. We call this process ChronoGEM (for Chronological Greedy Entropy Maximization) since at a given step, it only focus on maximizing the entropy by directly approximating a uniform distribution over the next step, without further planning.

Continuous maze. As a tool scenario to test ChronoGEM, we implemented a two-dimensional continuous maze in which actions are d_x,d_y steps bounded by the [-1, 1]^2 square, and the whole state space is bounded by the [-100, 100]^2 square. The starting point is at the center of the square. This maze is particularly hard to fully explore as two narrow corridors need to be crossed in order to reach the up-left room. The main goal of this experiment is to verify that even in a challenging tool game, ChronoGEM still manages to induce a uniform distribution over the whole state space. In order to emphasize the relative difficulty of the exploration of this maze, we also run exploration baselines RND and SMM to compare the obtained state coverages.

SMM
RND
ChronoGEM
Evolution of the discretized states visitation when taking the last states from 4000 episodes (a cell's aira is colored if at least one trajectory ends in it), sampled according to a SMM, RND and ChronoGEM. Both RND and SMM fails at passing through the first corridor, and only ChronoGEM managed to visit states in the up-left room.

Goal-conditioned training. To build C3PO, we modified Brax' implementation of SAC to take a set of goal as input and train to reach them. The reward is the opposite of the maximum of the euclidean distance between a body (e.g. an arm, the torso, a leg) and its goal position. As a result the policy is encouraged to move to the correct location and then match the correct pose. The goal is added to the observation as a relative position to the current position. We says that a goal is reached if the Euclidean distance between the agent's state and the goal is smaller that a tolerance threshold \epsilon. In other terms, an episode \mathcal{E} is successful when its closest state to the goal is close enough: \text{success}(\mathcal{E}\vert g) \Leftrightarrow \min_{s\in \mathcal{E}}||s - g||^2 <\epsilon. We set the environment to stop an episode as soon as it is successful, or when it exceeds the number of allowed steps. We initially set the tolerance \epsilon to be high (1.) and slowly anneal it down when the success rate reaches 90\% on the training data. As a result SAC learns first to move towards the target and then to match the exact position. We call C3PO the resulting procedure that combines ChronoGEM for the training data collection and goal-conditioned SAC with tolerance annealing.

Experiments

Environments. We used the control tasks from Brax as high dimensional environments. To compare the entropy and the richness of the obtained set of states with bonus-based exploration baselines, we focused on four classical tasks: Hopper, Walker2d, Halfcheetah and Ant. Since we focus on achieving arbitrary poses and positions of the embodied agents, we modified the environments observations so they contain the (x, y, z) positions of all body parts. All measures (cross entropy or reaching distances) are based on that type of state. To get reasonable trajectories (as opposed to trajectories where HalfCheetah jumps to the sky), we explore the environment within the low energy regime by putting a multiplier on the maximum action. The multiplier is .1 for Hopper, .1 for Walker, .01 for HalfCheetah and 1. for Ant. In the two following subsections, we considered episodes of length T=128. So the physical time horizon is similar on all tasks, we added an action repeat of 6 for Hopper and Walker. All episode end conditions (because the torso is too low or too high for example) have been removed, so we have no prior.

Algorithm and baselines. To collect training data, ChronoGEM was run with N=2 ^{17} paralleled environments and branching factor K=4 in all following experiments, except for Humanoid in which N=2 ^{15} and K=64. We detail C3PO and all baselines (goal-SAC+RND, goal-SAC+SMM and goal-SAC+random walk) implementations in the paper's appendix. For each baseline, we separately tuned the hyper parameters in order to obtain the best performance in all different environments.

Training goals entropy. Given a set of points x_1... x_N sampled from a distribution with an unknown density \rho, one can estimate an upper bound of the entropy of \rho given by the cross entropy H(\rho, \hat{\rho}) where \hat{\rho} is an estimation of \rho:

H(\rho, \hat{\rho}) = -\mathbb{E}_{x\sim \rho}[\log \hat{\rho}(x)] = H(\rho) + \text{kl}(\rho\vert\vert\hat{\rho}) \geq H(\rho).

The estimation \hat{\rho} being trained by maximum likelihood specifically on the set of points, it directly minimises the cross entropy and closely approximate the true entropy. We used this upper-bound to compare the coverage and richness of the set of training goals produced by ChronoGEM compared to RND, SMM and a random walks. The figure above displays box plots over 10 seeds of the resulting cross entropy measured on the sets of states induced by different algorithms, on the 4 continuous control tasks. As expected, the random walk has the lowest entropy, and RND and SMM have, in average over the environments, similar performances. ChronoGEM has the highest entropy on all environments, especially on HalfCheetah, where it was the only method to manage exploration while the actions were drastically reduced by the low multiplier.

Distribution over 10 seeds of the cross entropies of the state visitation induced by ChronoGEM, RND, SMM and a random walk, on different continuous control tasks.

Reaching unseen goals. If an exploration method is good, drawing from the states it explored should be a good approximation of drawing from all achievable states. The state distribution induced by an exploration algorithm can be used both as a training set of goal, but also as an evaluation set of goals.
For each environment, we ran every of the four examined exploration methods (ChronoGEM, Random Walk, SMM and RND) with 3 seeds to build 3 training goal sets per method and 3 evaluation goal sets per method. Training goal sets have 4096 goals and evaluation goal sets have 128 goals. We plot the success rate with regard to the tolerance, for each environment and evaluation goal set. The next figure shows that evaluated on ChronoGEM goals, only C3PO -- which is trained on ChronoGEM -- gets good results while evaluated on goals from other methods. This is a good hint that the diversity of ChronoGEM goals is higher than other exploration methods. C3PO performs well on other evaluation sets as well, in particular in the low distance threshold regime (see Hopper and Walker). This can be explained by the fact that C3PO learns to reach a high variety of poses, since being able to achieve poses with high fidelity is what matters for low distance threshold regime.

For each environment (lines) and each set of evaluation goals (columns), success rates as a function of tolerance thresholds obtained by SAC when trained on the different sets of training goals (ChronoGEM, Random Walk, SMM, RND). Each exploration algorithm was run over 3 seeds to collect evaluation and training goals, and each SAC training was also run over 3 seeds, so the resulting curves are actually averaging 9 different values.

Entropy Weighted Goal Achievement (EWGA). However, previous achievement rates alone are still hardly interpretable: for example, being good at reaching goals generated by the random walk is less important than achieving harder goals, especially those from the highly entropic distributions (like ChronoGEM goals on Halfcheetah or SMM goals on Walker). We hence summarized the results by collecting all the areas under the curve (AUC), and weighting them proportionally to the exponential of the evaluation goals entropy in the following Figure. Indeed, if a set is very diverse, it means more to be able to achieve all of its goals, and vice-versa: if a set is not diverse we don't want to give too much importance to it, as achieving always the same goal is not so interesting. The exponential of the entropy quantifies the number of states in the distribution. We call this metric Entropy Weighted Goal Achievement (EWGA):

EWGA(method) = \frac{\sum_{s \in evaluation\;sets} exp(entropy(s))* AUC(method\;on\;s)}{\sum_{s \in evaluation\;sets} exp(entropy(s))}

Entropy Weighted Goal-Achievement (EWGA). This estimates the ability of a policy to achieve goal sets that better covers the space (for example, a policy like C3PO that reaches a larger variety of states has an higher EWGA than a policy like SAC trained on the Random Walk, that only reaches states that are close to the origin).

Massive goal-conditioned training

Now that we established that ChronoGEM is the best exploration method for the purpose of producing training goals for a goal-conditioned setup, we will only use this method. We know allow ourselves to train for massive amount of steps, and see what is the best policy we can achieve. Thanks to Brax's high parallelization and efficient infrastructure, it is possible to run 30G steps in a couple days.

Humanoid. We also add Humanoid to our set environments. By default, ChronoGEM would mostly explore positions where the humanoid is on the floor. However, it was simple to modulate the algorithm to only explore uniformly in the space of state where the humanoid is standing. For example, on can just associate zero weight to undesired states during the re-sampling step. That way, we avoided states in which the torso go under the altitude of .8 (the default failure condition). ChronoGEM is modified to not draw states where the humanoid is too low. The goal-conditioned learner gets a high penalty for going too low too. The visual results of a policy able to achieve 90\% success at .25 tolerance are represented in the following gifs. This shows that when we do have a prior, we can leverage it to steer the exploration and policy learning.

Hopper
Walker
HalfCheetah
Ant
Humanoid

Conclusion

In the real world, no reward function is provided. To be able to learn anyways, we designed ChronoGEM, an exploration method that generates high entropy behaviors, in theory (see a proof in our paper's appendix) and in practice (see the goals entropy experiment), outperforming baseline algorithms. All the skills discovered by an exploration algorithm can be used to train a goal-conditioned policy. We showed that training ChronoGEM goals results in the most potent policies compared to other exploration methods. On Hopper, Walker, HalfCheetah, Ant and Humanoid, visuals and metrics show that the policy we trained is able to achieve a large variety of goals - by moving to the correct position and then matching the pose - with high fidelity.

This article was prepared using the Distill template.

Citation

For attribution in academic contexts, please cite this work as

Jacq et al., "C3PO: Learning to Achieve Arbitrary Goals
via Massively Entropic Pretraining", 2022

BibTeX citation

@article{jacq2022c3po,
  title   = {C3PO: Learning to Achieve Arbitrary Goals
via Massively Entropic Pretraining},
  author  = {Jacq, Alexis and Orsini, Manu and Dulac-Arnold, Gabriel and Pietquin, Olivier
             and Geist, Matthieu and Bachem, Olivier},
  journal = {arXiv preprint arXiv:TBD},
  year    = {2022},
  pdf     = {https://arxiv.org/pdf/TBD.pdf},
}