• Aucun résultat trouvé

A Motorized Wheelchair that Learns to Make its Way through a Crowd

N/A
N/A
Protected

Academic year: 2022

Partager "A Motorized Wheelchair that Learns to Make its Way through a Crowd"

Copied!
2
0
0

Texte intégral

(1)

A Motorized Wheelchair that Learns to Make its Way through a Crowd

Denis Steckelmacher, H´el`ene Plisnier, and Ann Now´e

Vrije Universiteit Brussel (VUB), Brussels, Belgium dsteckel@ai.vub.ac.be

http://steckdenis.be/bnaic_demo_wheelchair.mp4

Abstract. Reinforcement Learning on physical robots is challenging, especially when the robot only provides high-dimensional sensor readings, and must learn a task in an environment for which no model is available. In this demonstra- tion, a robotic wheelchair moves in a crowd of people. It is rewarded for mov- ing fast, but punished when too close to a person or obstacle. Because the room the chair will be is unknown, and because it is impossible to model how every person moves, learning ourgo-fasttask requires the use of model-free reinforce- ment learning. We propose to use Bootstrapped Dual Policy Iteration [4], a highly sample-efficient model-free RL algorithm, that we show needs only two hours to learn to avoid obstacles on our wheelchair.

1 Algorithm

Bootstrapped Dual Policy Iteration [4] is a model-free actor-critic reinforcement-learning algorithm for continuous states and discrete actions. Contrary to conventional actor- critic algorithms [3], BDPI’s critic learns the optimal off-policy Q function with a variant of Q-Learning, instead of theon-policyQπfunction. BDPI’s actor learning rule, based on Conservative Policy Iteration [2], is able to useseveral off-policycritics. The combination of highly sample-efficient off-policy critics, and a smart actor learning rule, makes BDPI one or two orders of magnitude more sample-efficient than state-of- the-art algorithms. Because the sample-efficiency of BDPI increases with the amount of computation done per time-step, we extend BDPI by processing many batches of ex- periences and many critics in parallel,alaHogwild! [1], instead of sequentially training each critic on each set of experiences. This allows to fully utilize the 32 cores of an AMD Threadripper 2990WX machine, maximizing theamount of learning that hap- pens per time-step.

2 Setting, Task and Safety

The reinforcement-learning agent running on the wheelchair is able to observe a 160- dimensional vector of distance readings, produced by a webcam pointing towards the front of the wheelchair, with a simple edge-detection algorithm. We assume a smooth Copyright c2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

(2)

-150 -100 -50 0

0:20 0:40 1:00 1:20 1:40 2:00 2:20 Time (hour:minute)

ACKTR Rainbow BDPI (ours) Human Policy

Fig. 1.Left:our motorized wheelchair in a 4-by-4 meters area.Right:BDPI is the only algorithm able to learn to avoid obstacles in 2 hours, and even outperforms a good human-designed policy.

Our demonstration is a slightly more challenging task, as obstacles will be people, and the agent will be able to clear its throat to try to get people to make way for it.

floor, with few/weak edges on it, and observe that obstacles (shoes, people, walls) pro- duces strong edges. The agent has access to four actions: going forwards, turning left, turning right, and emitting a slight coughing sound. Actions are executed every half- second. Over two hours, this leads to only 14K actions being executed, an immense sample-efficiency challenge for an agent observing a 160-dimensional continuous in- put. A positive reward is given whenever the agent goes forwards. A negative reward is given when the agent is too close to an obstacle (less than 40cm), or if any person presses the punish button on the wheelchair.

Because the wheelchair will move in the crowd of people looking at the demonstra- tions, safety is critical. The wheelchair will run at its slowest setting, a backup policy prevents the chair from going to close to an obstacle, the off-button of the wheelchair is easily accessible from all sides, and one of the authors will always be closely super- vising the chair. Our demonstration requires one power plug and 1 square meter for our infrastructure. This demonstration can be contained in a4×3meters area, but would be more interesting if the chair is allowed to move amongst people.

Acknowledgments

The first and second authors are funded by the Science Foundation of Flanders (FWO, Belgium), respectively as 1129319N Aspirant, and 1SA6619N Applied Researcher.

References

1. Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free ap- proach to parallelizing stochastic gradient descent. InAdvances in neural information pro- cessing systems (NIPS), pages 693–701, 2011.

2. Bruno Scherrer. Approximate policy iteration schemes: A comparison. InProceedings of the 31th International Conference on Machine Learning (ICML), pages 1314–1322, 2014.

3. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.Arxiv, abs/1707.06347, 2017.

4. Denis Steckelmacher, H´el`ene Plisnier, Diederik M Roijers, and Ann Now´e. Sample-efficient model-free reinforcement learning with off-policy critics.arXiv, abs/1903.04193, 2019.

Références

Documents relatifs

The schemes are con- structed by combing a discontinuous Galerkin approximation to the Vlasov equation together with a mixed finite element method for the Poisson problem.. We

Also use the formulas for det and Trace from the

The existence of this solution is shown in Section 5, where Dirichlet solutions to the δ-Ricci-DeTurck flow with given parabolic boundary values C 0 close to δ are constructed..

At the conclusion of the InterNIC Directory and Database Services project, our backend database contained about 2.9 million records built from data that could be retrieved via

Delano¨e [4] proved that the second boundary value problem for the Monge-Amp`ere equation has a unique smooth solution, provided that both domains are uniformly convex.. This result

We introduce a family of DG methods for (1.1)–(1.3) based on the coupling of a DG approximation to the Vlasov equation (transport equation) with several mixed finite element methods

The problem of deciding the existence of a SEPI from G to G’ is NP-complete, but show that when G is fixed, this problem can be decided in constant-time (hint: this

La recherche serait alors vue comme un partage de responsabilités entre connaisseurs  : connais- seurs des résultats de recherche et connaisseurs de la pratique et des élèves