• Aucun résultat trouvé

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

N/A
N/A
Protected

Academic year: 2022

Partager "Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics"

Copied!
2
0
0

Texte intégral

(1)

Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics

Denis Steckelmacher, H´el`ene Plisnier, Diederik M. Roijers, and Ann Now´e Vrije Universiteit Brussel (VUB), Brussels, Belgium

Contact:dsteckel@ai.vub.ac.be

1 Introduction

Sample efficiency is key to many applications of reinforcement learning in the real world, for instance when learning directly on a physical robot [1]. In discrete-action settings, value-based methods tend to be more sample-efficient than actor-critic ones [2]. We argue that this is because actor-critic algorithms learn a criticQπ, that must ac- curately evaluate the actor, instead ofQ, the optimal Q-function [3]. Some algorithms allow the agent to execute a policy different from the actor, which the authors refer to as off-policy, but the critic is still on-policy with regards to the actor [4,5]. We propose a new actor-critic algorithm, inspired from Conservative Policy Iteration [6], that uses off-policycritics that approximateQinstead ofQπ.

2 Bootstrapped Dual Policy Iteration

Our algorithm, fully described in [7], is divided in two parts:off-policycritics, and an actorthat is robust to off-policy critics. Our critic learning rule, inspired by Clipped DQN [8], is given in Equation 1. This learning rule is used to train 16 critics, each of them on distinct 256-experiences batches sampled from a single shared experience buffer, as suggested by [9]. Our actor learning rule consists of, after every time-step, updating each criticion a batchBiof experiences, then sequentially updating the actor with Equation 2 with every batchB1...B16:

QA,i(s, a)←QA,i(s, a) +α r+γV(s0)−QA,i(s, a)

∀(s, a, r, s0)∈Bi (1)

-40 0 50 100

0.0K 0.1K 0.2K 0.3K 0.4K 0.5K

Return per episode

Episode (Table)

ACKTR PPO BDQN BDPI/mimic ABCDQN BDPI (ours)

-400 -200 0 200

0.0K 0.2K 0.4K 0.6K 0.8K 1.0K Episode (LunarLander)

0.0 0.2 0.4 0.6 0.8

0.0K 0.2K 0.4K 0.6K 0.8K 1.0K Episode (FrozenLake)

0.0 0.2 0.4 0.6 0.8

0.0K 0.1K 0.2K 0.3K 0.4K 0.5K Episode (Hallway)

Fig. 1.BDPI outperforms many other algorithms in hard-to-explore, highly-stochastic and pixel- based environments.

Copyright c2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

(2)

V(s0)≡ min

l=A,BQl,i s0,argmaxa0QA,i(s0, a0)

π(s)←(1−λ)π(s) +λΓ(QA,i(s,·)) ∀i,∀s∈Bi (2) withQA,i andQB,i the two Clipped DQN Q-functions of critici, that are swapped every time-step, and Γ the greedy function, that returns the action having the largest Q-Value in a given state.

3 Experiment

We compare BDPI to a variety of state-of-the-art reinforcement-learning algorithms in three environments:Table[7],LunarLanderandFrozenLake(OpenAI Gym), andHall- way1. Figure 1 shows that BDPI largely outperforms every other algorithm, even in the pixel-based 3DHallwayenvironment. More importantly, BDPI outperformsABCDQN, the critics of BDPI used with no actor, and BDPI/mimic, that uses a different actor training rule [7]. This demonstrates that both our actor and critic learning rules advance the state of the art in sample-efficient reinforcement learning. Our results are further illustrated by our robotic wheelchair demonstration, also submitted to this conference.

References

1. Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.Journal of Machine Learning Research, 17:39:1–39:40, 2016.

2. Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dab- ney, Daniel Horgan, Bilal Piot, Mohammad Gheshlaghi Azar, and David Silver. Rainbow:

Combining improvements in deep reinforcement learning.Arxiv, abs/1710.02298, 2017.

3. Vijaymohan R. Konda and Vivek S. Borkar. Actor-Critic-Type Learning Algorithms for Markov Decision Processes. SIAM Journal on Control and Optimization, 38(1):94–123, jan 1999.

4. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic:

Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv, abs/1801.01290, 2018.

5. Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, R´emi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample Efficient Actor-Critic with Experience Replay.

Technical report, 2016.

6. Bruno Scherrer. Approximate policy iteration schemes: A comparison. InProceedings of the 31th International Conference on Machine Learning (ICML), pages 1314–1322, 2014.

7. Denis Steckelmacher, H´el`ene Plisnier, Diederik M Roijers, and Ann Now´e. Sample-efficient model-free reinforcement learning with off-policy critics.arXiv, abs/1903.04193, 2019.

8. Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational Conference on Machine Learning (ICML), pages 1582–1591, 2018.

9. Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. InAdvances in Neural Information Processing Systems (NIPS), 2016.

1https://github.com/maximecb/gym-miniworld

Références

Documents relatifs

Psychologically-based actor training stems from an ideology of individual self-management – mental, physical, and emotional – that accompanies the emergent practice of management

∇η B (θ) into a GP term and a deterministic term, 2) using other types of kernel functions such as sequence kernels, 3) combining our approach with MDP model estimation to al-

In this section, the network model is described and the details of the corona training protocol are presented, where each individual sensor has to learn the identity of the corona

Hence, a periodic sensor which does not skip any awake period hears the k corona identities in a specific order which depends on the parameters d, L, and on the time slot s at which

Hence, a periodic sensor which does not skip any awake period hears the k corona identities in a specific order which depends on the parameters d, L, and on the time slot s at which

Si ces spectacles ont parfois défrayé la chro- nique parce qu’ils mettaient en scène l’aliénation du corps de l’acteur vis-à-vis de la machine, parfois avec une

Neural Fitted Actor-Critic (NFAC) is a novel actor-critic algorithm whose roots lie in both CACLA and Q-Fitted Iteration. It is an offline actor-critic algorithm that can re-use

As presented in [11], based on the characteristics vector of the actors, the Static-Heuristic approach maps Sender and Dispatcher actors to the thread-pool policy, and Receiver to