• Aucun résultat trouvé

MESSI: Maximum Entropy Semi-Supervised Inverse Reinforcement Learning


Academic year: 2021

Partager "MESSI: Maximum Entropy Semi-Supervised Inverse Reinforcement Learning"


Texte intégral


HAL Id: hal-01177446


Submitted on 16 Jul 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

MESSI: Maximum Entropy Semi-Supervised Inverse

Reinforcement Learning

Julien Audiffren, Michal Valko, Alessandro Lazaric, Mohammad Ghavamzadeh

To cite this version:

Julien Audiffren, Michal Valko, Alessandro Lazaric, Mohammad Ghavamzadeh. MESSI: Maximum

Entropy Semi-Supervised Inverse Reinforcement Learning. NIPS Workshop on Novel Trends and

Applications in Reinforcement Learning, 2014, Montreal, Canada. �hal-01177446�


MESSI: Maximum Entropy Semi-Supervised

Inverse Reinforcement Learning

Julien Audiffren CMLA, ENS Cachan julien.audiffren@cmla.ens-cachan.fr

Michal Valko INRIA Lille – Nord Europe


Alessandro Lazaric INRIA Lille – Nord Europe


Mohammad Ghavamzadeh INRIA/Adobe Research mohammad.ghavamzadeh@inria.fr

Introduction. The most common approach to solve a sequential decision-making problem is to formulate it as a Markov decision process (MDP). This process requires the definition of a reward function, but in many applications, such as driving or playing tennis, it is easier and more natural to learn how to perform such tasks by observing an expert’s demonstration, rather than by definition of a reward function. The task of learning from an expert is called apprenticeship learning. A pow-erful and relatively novel approach to apprenticeship learning (AL) is to formulate it as an inverse reinforcement learning(IRL) problem [6]. The basic idea is to assume that the expert is trying to optimize an MDP and to derive an algorithm for learning the task demonstrated by the expert [6, 1]. This approach has been shown to be effective in learning non-trivial tasks such as inverted helicopter flight control [5], ball-in-a-cup [2], and driving on a highway [1, 4].

In the IRL approach to AL, we assume that several trajectories generated by an expert are available and the unknown reward function optimized by the expert can be specified as a linear combination of a number of state features. In many applications, in addition to the expert’s trajectories, we may have access to a large number of trajectories that are not necessarily performed by an “expert”. For example, in learning to drive, we may ask an expert driver to demonstrate a few trajectories and use them in an AL algorithm to mimic her behavior. At the same time, we may record trajectories from many other drivers for which we cannot assess their quality and that may or may not demonstrate an expert-level behavior. We will refer to them as unsupervised trajectories and to the task of learning with them as semi-supervised apprenticeship learning following Valko et al. [8] who combine the IRL approach of Abbeel and Ng [1] with semi-supervised SVMs. However, unlike in classification, we do not regard the unsupervised trajectories as being a mixture of expert and non-expert classes. This is because the unsupervised trajectories might have been generated by the expert herself, by another expert(s), by near-expert agents, by agents maximizing different reward functions, or sim-ply they can be some noisy data. The objective of IRL is to find the reward function that expert trajectories maximize, and thus, semi-supervised apprenticeship learning cannot be considered as a special case of semi-supervised classification.

Maximum Entropy Semi-Supervised Inverse Reinforcement Learning. We propose the algo-rithm MESSI (MaxEnt Semi-Supervised IRL, see algoalgo-rithm 1) to address the challenge above by combining the MaxEnt-IRL approach of Ziebart et al. [9] with SSL. MESSI integrates the unsuper-vised trajectories in a principled way such that it performs better than MaxEnt-IRL. For this purpose, we assume that the learner is provided with a set of expert trajectoriesΣ∗ =

i}li=1and a set of unsupervised trajectories eΣ ={ζj}uj=1. We also assume that a function s is provided to measure the similarity s(ζ, ζ0) between any pair of trajectories (ζ, ζ0). We define the pairwise penalty R as

R(θ|Σ) = 1 |Σ| X ζ,ζ0∈Σ s(ζ, ζ0) (θT(fζ − fζ0))2, (1)

whereΣ = Σ∗∪ eΣ, and fζ and fζ0 are the feature counts for trajectories ζ, ζ0 ∈ Σ, and finally (θT

(fζ− fζ0))2= (¯rθ(ζ)− ¯rθ(ζ0))2is the difference in rewards accumulated by the two trajectories


Algorithm 1 MESSI - MaxEnt SSIRL

Input: Set of l expert trajectories Σ∗= {ζi∗}li=1, set of u unsupervised trajectories eΣ = {ζj}uj=1, similarity

function s, number of iterations T , constraint θmax, regularizer λ0

Initialization: Compute {fζ∗ i} l i=1, {fζj} u j=1and f ∗ = 1/lPl i=1fζ∗

i and generate a random reward vector θ0 for t = 1 to T do

1. Compute policy πt−1from θt−1(Solving the MDP)

2. Compute feature counts ft−1of πt−1(forward pass of MaxEnt)

3. Update the reward vector by doing a gradient descent step on (2) 4. If kθtk∞> θmax, project back by θt← θtkθθmaxtk∞

end for

w.r.t. the reward vector θ. The purpose of the pairwise penalty is to penalize reward vectors θ that assign very different rewards to similar trajectories (as measured by s(ζ, ζ0)). We then follow the framework of Erkan and Altun [3] to integrate the regularization R into the learning objective. This leads to the following optimization problem :

θ∗= argmax

θ (L(θ|Σ ∗)

− λR(θ|Σ)) , (2) where L is the the log-likelihood of θ w.r.t. the expert’s trajectories and λ is a parameter trading off between L and the coherence with the similarity between the provided trajectories both in eΣ and Σ∗. Although hand-crafted similarity functions usually perform better, our experiments show that even the simple RBF s(ζ, ζ0) = exp(

−kfζ− fζ0k2/2σ), (where σ is the bandwidth) is an effective

similarity for the feature counts.

0 20 40 60 80 100 Number of iterations −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 Re wa rd MaxEnt MESSIMAX MESSI with distribution Pµ1 MESSI with distribution Pµ2 MESSI with distribution Pµ3

0 10 20 30 40 50 60 Number of iterations −4000 −3900 −3800 −3700 −3600 −3500 Re wa rd MaxEnt MESSIMAX MESSI with distribution Pµ1 MESSI with distribution Pµ2 MESSI with distribution Pµ3

10−3 10−2 10−1 100 parameter lambda −12 −10 −8 −6 −4 −2 0 R ew ar d MaxEnt MESSIMAX

MESSI with distribution Pµ1 MESSI with distribution Pµ2 MESSI with distribution Pµ3

10−3 10−2 10−1 100 parameter lambda −1700 −1680 −1660 −1640 −1620 −1600 −1580 −1560 −1540 R ew ar d MaxEnt MESSIMAX

MESSI with distribution Pµ1 MESSI with distribution Pµ2 MESSI with distribution Pµ3

Figure 1: Results as a function of number of iterations (up) and the parameter lambda (down) of the MaxEnt, MESSIMAX (MESSI with all unsupervised trajectories drawn from expert) and MESSI (with different distributions of unsupervised trajectories) algorithms on the Highway driving (left) and the gridworld (right) dataset.

Experimental Results.

Our experiments shows that MESSI takes advantage of unsupervised trajectories and can perform better and more efficiently than MaxEnt-IRL in the highway driving problem of Syed et al. [7] and the grid-world domain in Abbeel and Ng [1] (see fig 1).



[1] P. Abbeel and A. Ng. Apprenticeship learning via inverse reinforcement learning. In Proceed-ings of the 21st International Conference on Machine Learning, 2004.

[2] A. Boularias, J. Kober, and J. Peters. Relative Entropy Inverse Reinforcement Learning. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, vol-ume 15, pages 182–189, 2011.

[3] A. Erkan and Y. Altun. Semi-Supervised Learning via Generalized Maximum Entropy. In Proceedings of JMLR Workshop, pages 209–216. New York University, 2009.

[4] S. Levine, Z. Popovic, and V. Koltun. Nonlinear Inverse Reinforcement Learning with Gaussian Processes. In Advances in Neural Information Processing Systems 24, pages 1–9, 2011. [5] A. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger, and E. Liang. Inverted

Autonomous Helicopter Flight via Reinforcement Learning. In International Symposium on Experimental Robotics, 2004.

[6] A. Ng and S. Russell. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pages 663–670, 2000.

[7] U. Syed, R. Schapire, and M. Bowling. Apprenticeship Learning Using Linear Programming. In Proceedings of the 25th International Conference on Machine Learning, pages 1032–1039, 2008.

[8] M. Valko, M. Ghavamzadeh, and A. Lazaric. Semi-Supervised Apprenticeship Learning. In Proceedings of the 10th European Workshop on Reinforcement Learning, volume 24, pages 131–241, 2012.

[9] B. Ziebart, A. Maas, A. Bagnell, and A. Dey. Maximum Entropy Inverse Reinforcement Learn-ing. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence, 2008.


Figure 1: Results as a function of number of iterations (up) and the parameter lambda (down) of the MaxEnt, MESSIMAX (MESSI with all unsupervised trajectories drawn from expert) and MESSI (with different distributions of unsupervised trajectories) algorith


Documents relatifs

A chirped pulse millimeter wave (CP- mmW) setup was combined with the buffer gas cooling apparatus for combined laser and millimeter wave spectroscopy experiments

Rather, we use the strength of classifiers that have access to images and associated tags to obtain additional examples to train a classifier that uses only visual features,

un monde qui ne va plus de l’aliénation à l’émancipation mais de l’intrication à l’encore plus intriqué, qui ne va plus du prémoderne au moderne, mais du moderne au

The Multi-Robot Deploying wireless Sensor nodes problem, called the MRDS problem, consists in minimizing the longest tour duration (i.e. the total deployment duration), the number

This paper focuses on the scientific results of the project “Development and validation of multitemporal image analysis methodologies for multirisk monitoring of critical

— In this paper, hypocoercivity methods are applied to linear kinetic equa- tions with mass conservation and without confinement, in order to prove that the solutions have an

In the following set of experiments with the practical FOM and CG algorithms, we assume that the products Ap can be computed in double, single or half precision (with

Assign to each OL of the test matrix, the obtained result, hence go back to step 1 till number of trigrams of the reference dataset does not change. 0 1 2 3 4 5

Minimum entropy logistic regression is also compared to the classic EM algorithm for mixture models (fitted by maximum likelihood on labeled and unlabeled examples, see

In the non-asymptotic case we should expect transduction to perform better than induction (again, assuming test samples are drawn from the same distribution as the training data),

The main idea is to apply the same smoothness criterion that is behind the original semi-supervised algorithm, adding the terms corresponding to the new example, but keeping

Bilmes, “Acoustic classification using semi-supervised deep neural networks and stochastic entropy- regularization over nearest-neighbor graphs,” in IEEE Interna- tional Conference

On peut en outre remarquer que le taux de croissance de l’économie augmente avec le ratio épargne privée/dette publique car les recettes fiscales sur les

L’objet de cette thèse est de comparer les systèmes de soins bucco-dentaires français et suédois grâce à l’analyse de l’état de santé bucco-dentaire des

With a set of 1, 000 labeling, the SVM classification error is 4.6% with SVM training set of 60 labels (1% of the database size), with no kernel optimization nor advanced

Notre outil d'aide diagnostique et thérapeutique est une proposition pour aider les médecins généralistes dans le diagnostic et la prise en charge des

q in bottom-up manner, by creating a tree from the observations (agglomerative techniques), or top-down, by creating a tree from its root (divisives techniques).. q Hierarchical

In practice, popular SSL methods as Standard Laplacian (SL), Normalized Laplacian (NL) or Page Rank (PR), exploit those operators defined on graphs to spread the labels and, from

In our last experiment, we show that L γ - PageRank, adapted to the multi-class setting described in Section 2.3, can improve the performance of G-SSL in the presence of

Contrary to the lack of clearly supported species clusters in the sequence analyses, the AFLP markers resolved the phylogeny with higher bootstrap and clade credibility support.

We propose an online SSL algorithm that uses a combi- nation of quantization and regularization to obtain fast and accurate performance with provable performance bounds. Our

A typical scenario for apprenticeship learning is to have an expert performing a few optimal (or close to the optimal) trajectories and use these to learn a good policy.. In this

The Inverse Reinforcement Learning (IRL) [15] problem, which is addressed here, aims at inferring a reward function for which a demonstrated expert policy is optimal.. IRL is one