Action Quantization from Demonstrations

Reinforcement Learning (RL) relies on Markov Decision Processes (MDP) as its cornerstone, a general framework under which vastly different problems can be casted. There is a clear separation in the class of MDPs between the finite discrete action setup, where an agent faces a finite number of possible actions, and the continuous action setup, where an agent faces an infinite number of actions. The former is arguably simpler, since exploration is more manageable with a finite number of actions, and computing the maximum of the action-value function is straightforward (and implicitly defines a greedily-improved policy). In the continuous action setup, the parametrized policy either directly optimizes the expected value function that is estimated through Monte Carlo rollouts

Therefore, a natural workaround consists in turning a continuous control problem into a discrete one. The simplest approach is to naively (uniformly) discretize the action space, an idea which dates back to the "bang-bang" controller

We thus propose **A**ction **Qua**ntization from **Dem**onstrations, or AQuaDem, a novel paradigm where we learn a state dependent discretization of a continuous action space using demonstrations, enabling the use of discrete-action deep RL methods by virtue of this learned discretization. We formalize this paradigm, provide a neural implementation and analyze it through visualizations in simple grid worlds. We empirically evaluate this discretization strategy on three downstream task setups: Reinforcement Learning with demonstrations, Reinforcement Learning with play data, and Imitation Learning. We test the resulting methods on manipulation tasks and show that they outperform state-of-the-art continuous control methods both in terms of sample-efficiency and performance on every setup.

The AQuaDem framework quantizes a continuous action space conditionally on the state, from a dataset of demonstrations *action candidates*. When this mapping is learned (fully offline), we can learn a controller uniquely on a discrete action space (using a discrete deep RL method *e.g.* DQN), which consists in executing one of the K actions candidates. We propose the following objective to learn the quantization mapping (where T is a hyperparameter):

This equation corresponds to minimizing a soft-minimum of the distances between the candidate actions

The choice of reducing the action space to a few actions is somewhat counterintuitive, as it constrains the family of policies for the problem (and might exclude near-optimal policies). Nevertheless, this discretization strategy has the benefit of 1) turning a continuous action problem into a discrete one 2) constraining the possible actions to be close to the one taken by the demonstrator (which supposedly are * good*). The reason why discrete action spaces are arguably preferable is that they enable the exact computation of the maximum of the approximate action-value function. In the continuous action setting, state of the art methods such as SAC

We introduce a grid world environment where the start state is in the bottom left, and the goal state is in the top right. Actions are continuous (2-dimensional), and give the direction in which the agent take a step. These steps are normalized by the environment to have fixed L2 norm. The stochastic demonstrator moves either right or up in the bottom left of the environment then moves diagonally until reaching the edge of the grid, and goes either up or right to reach the target. The demonstrations are represented in the different colors.

We define a neural network *diagonal* even in the bottom left part of the action space, where the demonstrator only takes horizontal or vertical actions. On the contrary, the action candidates learned by AQuaDem include the actions taken by the demonstrator conditioned on the states.

Behavioral Cloning Actions

AQuaDem Candidates Actions

K
Temperature

We represent the actions candidates learned using the AQuaDem framework on the Door environment which comes with 25 demonstrations of the task. As the action space is of high dimensionality, we choose to represent each action dimension on the x-axis, and the value for each dimension on the y-axis. We connect the dots on the x-axis to facilitate the visualization through time. We replay trajectories from the human demonstrations and show at each step the 10 actions proposed by the AQuaDem network, and the action actually taken by the human demonstrator. Each action candidate has a color consistent across time (meaning that the blue action always correspond to the same head of the

Demonstration trajectory #

**Setup. ** In the Reinforcement Learning with demonstrations setup (RLfD), the environment of interest comes with a reward function and demonstrations (which include the reward), and the goal is to learn a policy that maximizes the expected return. This setup is particularly interesting for sparse reward tasks, where the reward function is easy to define (say reaching a goal state) and where RL methods typically fail because the exploration problem is too hard. We consider the Adroit tasks**sparse reward**: 1 if the goal is achieved, 0 otherwise.

**Algorithm & baselines. **The algorithm we propose is a two-fold training procedure: 1. we learn a quantization of the action space using the AQuaDem framework from human demonstrations; 2. we train a discrete action deep RL algorithm on top of this this quantization. We refer to this algorithm as AQuaDQN. The RL algorithm considered is Munchausen DQN

**Evaluation & results. ** We train the different methods on 1M environment interactions on 10 seeds for the chosen hyperparameters (a single set of hyperameters for all tasks) and evaluate the agents every 50k environment interactions (without exploration noise) on 30 episodes. An episode is considered a success if the goal is achieved during the episode. The AQuaDem discretization is trained offline using 50k gradient steps on batches of size 256. The number of actions considered were 10,15,20 and we found 10 to be performing the best. The Figure below shows the returns of the trained agents as well as their success rate. On Door, Pen, and Hammer, the AQuaDQN agent reaches high success rate, largely outperforming SACfD in terms of success and sample efficiency.

On Relocate, all methods reach poor results (although AQuaDQN slightly outperforms the baselines). The task requires a larger degree of generalisation than the other three since the goal state and the initial ball position are changing at each episode. However, we show below that when tuned uniquely on the Relocate environment and with more environment interactions, AQuaDQN manages to reach a 50% success rate where other methods still fail. Notice that on the Door environment, the SAC and SACfD agents outperform the AQuaDQN agent in terms of final return (but not in term of success rate). The behavior of these agents are however different from the demonstrator since they consist in slapping the handle and abruptly pulling it back. We provide videos of all resulting agents below (one episode for each seed which is not cherry picked) to demonstrate that AQuaDQN consistently learns a behavior that is qualitatively closer to the demonstrator.

AQuaDQN
environment

seed #

seed #

Algo
environment

seed #

seed #

**Setup. ** In Imitation Learning, the task is not specified by the reward function but by the demonstrations themselves. The goal is to mimic the demonstrated behavior. There is no reward function and the notion of success is ill-defined. Imitation Learning is of particular interest when designing a satisfying reward function --one that would lead the desired behavior to be the only optimal policy-- is harder than directly demonstrating this behavior. In this setup, there is no reward provided, not in the environment interactions nor in the demonstrations. We again consider the Adroit environments and the human demonstrations which consist of 25 episodes acquired via a virtual reality system.

**Algorithm & baselines. ** Again, the algorithm we propose has two stages: 1) we learn --fully offline-- a discretization of the action space using AQuaDem, 2) we train a discrete action version of the GAIL algorithm

**Evaluation & results. ** We train AQuaGAIL and GAIL for 1M environment interactions on 10 seeds for the selected hyperparameters (a single set for all tasks). BC is trained for 60k gradient steps with batch size 256. We evaluate the agents every 50k environment steps during training (without exploration noise) on 30 episodes. The AQuaDem discretization is trained offline using 50k gradient steps on batches of size 256. Evaluating imitation learning algorithms has to be done carefully as the goal to "mimic a behavior" is ill-defined. Here, we provide the results according to two metrics. On top, the success rate (notice that the human demonstrations do not have a success score of 1 on every task). We see that, except for Relocate, which is a hard task to solve with only 25 human demonstrations due to the necessity to generalize to new positions of the ball and the target, AQuaGAIL solves the tasks as successfully as the humans, outperforming GAIL and BC. Notice that our results corroborate previous work

AQuaGAIL
environment

seed #

seed #

Algo
environment

seed #

seed #

**Setup. ** The Reinforcement Learning with play data is an under-explored yet natural setup

**Algorithm & baselines. ** Similarly to the RLfD setting, we propose a two-fold training procedure: 1) we learn a discretization of the action space in a fully offline fashion using the AQuaDem framework on the play data, 2) we train a discrete action deep RL algorithm using this discretization on each tasks. We refer to this algorithm as AQuaPlay. Unlike the RLfD setting, the demonstrations do not include any task specific reward nor goal labels meaning that we cannot incorporate the demonstration episodes in the replay buffer nor use some form of goal-conditioned BC. We use SAC as a baseline, which is trained to optimize task specific rewards. Since the action space dimensionality is fairly low (5-dimensional), we can include naive discretization baselines "bang-bang" (BB)

We train the different methods on 1M environment interactions on 10 seeds for the chosen hyperparameters (a single set of hyperameters for all tasks) and evaluate the agents every 50k environment interactions (without exploration noise) on 30 episodes. The AQuaDem discretization is trained offline on play data using 50k gradient steps on batches of size 256. The number of actions considered were 10,20,30,40 and we found 30 to be performing the best. It is interesting to notice that it is higher than for the previous setups. It aligns with the intuition that with play data, several behaviors needs to be modelled. The AQuaPlay agent consistently outperforms SAC in this setup. Interestingly, the performance of the BB agent decreases with the discretization granularity, well exemplifying the curse of dimensionality of the method. In fact, BB with a binary discretization (BB-2) is competitive with AQuaPlay, which validates that discrete action RL algorithms are well performing if the discrete actions are sufficient to solve the task. Note however that the Robodesk environment is a relatively low-dimensional action environment, making it possible to have BB as a baseline, which is not the case of e.g. Adroit where the action space is high-dimensional.

Play episode example

Algo
Task
Seed #

With the AQuaDem paradigm, we provide a simple yet powerful method that enables to use discrete-action deep RL methods on continuous control tasks using demonstrations, thus escaping the complexity or curse of dimensionality of existing discretization methods. We showed in three different setups that it provides substantial gains in sample efficiency and performance and that it leads to qualitatively better agents, as enlightened by the videos. There are a number of different research avenues opened by AQuaDem. Other discrete action specific methods could be leveraged in a similar way in the context of continuous control: count-based exploration, planning or offline RL. Similarly a number of methods in Imitation Learning or in offline RL are evaluated on continuous control tasks and are based on Behavioral Cloning regularization which could be refined using the same type of multioutput architecture used in this work. Another possible direction for the AQuaDem framework is to be analyzed in the light of risk-MDPs as the constraint of the action space arguably reduces a notion of risk when acting in this environment. Finally, as the gain of sample efficiency is clear in different experimental settings, we believe that the AQuaDem framework could be an interesting avenue for learning controllers on physical systems.

This article was prepared using the Distill template, we also took inspiration from CLIPort.

For attribution in academic contexts, please cite this work as

Dadashi et al., "Continuous Control with Action Quantization from Demonstrations", 2021

BibTeX citation

@article{dadashi2021continuous, title = {Continuous Control with Action Quantization from Demonstrations}, author = {Dadashi, Robert and Hussenot, L{\'e}onard and Vincent, Damien and Girgin, Sertan and Raichuk, Anton and Geist, Matthieu and Pietquin, Olivier}, journal = {arXiv preprint arXiv:2110.10149}, year = {2021}, pdf = {https://arxiv.org/pdf/2110.10149.pdf}, }