• Aucun résultat trouvé

Bayesian Optimisation

N/A
N/A
Protected

Academic year: 2021

Partager "Bayesian Optimisation"

Copied!
19
0
0

Texte intégral

(1)

Bayesian optimisation

Gilles Louppe

(2)

Problem statement

x∗ = arg max

x f (x )

Constraints:

• f is a black box for which no closed form is known; gradients df

dx are not available. • f is expensive to evaluate;

• (optional) uncertainty on observations yi of f

e.g., yi = f (xi) + i because of Poisson fluctuations.

(3)

Disclaimer

If you do not have these constraints, there is certainly a better optimisation algorithm than Bayesian optimisation.

(4)

Bayesian optimisation

for t = 1 : T ,

1. Given observations (xi, yi) for i = 1 : t, build a probabilistic

model for the objective f .

Integrate out all possible true functions, using Gaussian process regression.

2. Optimise a cheap utility function u based on the posterior distribution for sampling the next point.

xt+1= arg max x u(x )

Exploit uncertainty to balance exploration against exploitation.

(5)

Where shall we sample next?

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x) True (unknown) Observations

(6)

Build a probabilistic model for the objective function

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x) True (unknown) Observations µGP(x) CI

This gives a posterior distribution over functions that could have generated the observed data.

(7)

Acquisition functions

Acquisition functions u(x ) specify which sample x should be tried next:

• Upper confidence bound UCB(x ) = µGP(x ) + κσGP(x );

• Probability of improvement PI(x ) = P(f (x ) ≥ f (xt+) + κ);

Expected improvement EI(x ) = E[f (x) − f (xt+)];

• ... and many others.

where xt+ is the best point observed so far.

In most cases, acquisition functions provide knobs (e.g., κ) for controlling the exploration-exploitation trade-off.

• Search in regions where µGP(x ) is high (exploitation)

(8)

Plugging everything together (t = 0)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x) x+ t = 0.1000 True (unknown) Observations µGP(x) u(x) CI

(9)

... and repeat until convergence (t = 1)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x) xt+= 0.1000 True (unknown) Observations µGP(x) u(x) CI

(10)

... and repeat until convergence (t = 2)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x) xt+= 0.1000 True (unknown) Observations µGP(x) u(x) CI

(11)

... and repeat until convergence (t = 3)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x) xt+= 0.1000 True (unknown) Observations µGP(x) u(x) CI

(12)

... and repeat until convergence (t = 4)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x) xt+= 0.1000 True (unknown) Observations µGP(x) u(x) CI

(13)

... and repeat until convergence (t = 5)

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 1.5 1.0 0.5 0.0 0.5 1.0 1.5 f(x) xt+= 0.2858 True (unknown) Observations µGP(x) u(x) CI

(14)

What is Bayesian about Bayesian optimization?

• The Bayesian strategy treats the unknown objective function as a random function and place a prior over it.

The prior captures our beliefs about the behaviour of the function. It is here defined by a Gaussian process whose covariance function captures assumptions about the smoothness of the objective.

• Function evaluations are treated as data. They are used to update the prior to form the posterior distribution over the objective function.

• The posterior distribution, in turn, is used to construct an acquisition function for querying the next point.

(15)

Limitations

• Bayesian optimisation has parameters itself! Choice of the acquisition function

Choice of the kernel (i.e. design of the prior) Parameter wrapping

Initialization scheme

• Gaussian processes usually do not scale well to many observations and to high-dimensional data.

Sequential model-based optimization provides a direct and effective alternative (i.e., replace GPs by a tree-based model).

(16)

Applications

• Bayesian optimization has been used in many scientific fields, including robotics, machine learning or life sciences.

• Use cases for high energy physics?

Optimisation of simulation parameters in event generators; Optimisation of compiler flags to maximize execution speed; Optimisation of hyper-parameters in machine learning for HEP; ... let’s discuss further ideas?

(17)

Software

• Python Spearminthttps://github.com/JasperSnoek/spearmint GPyOpthttps://github.com/SheffieldML/GPyOpt RoBOhttps://github.com/automl/RoBO scikit-optimizehttps://github.com/MechCoder/scikit-optimize (work in progress) • C++ MOEhttps://github.com/yelp/MOE

Check also thisGithub repo for a vanilla implementation reproducing these slides.

(18)

Summary

• Bayesian optimisation provides a principled approach for optimising an expensive function f ;

• Often very effective, provided it is itself properly configured;

• Hot topic in machine learning research. Expect quick improvements!

(19)

References

Brochu, E., Cora, V. M., and De Freitas, N. (2010). A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. (2016). Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175.

Références

Documents relatifs

Abstract: A novel technique applying bacterial foraging optimisation (BFO) in conjunction with the method of moments (MOM) is developed to calculate accurately the resonant

Composing Web services using the Integrated Approach to Software

HistIndex(R 1 S) is used to compute randomized com- munication templates for only records associated to relevant join attribute values (i.e. values which will effectively be present

This algorithm uses a specific chaotic dynamic obtained from the R¨ ossler system [4] in order to optimise the coverage of an unknown area by a swarm of UAVs.. The R¨ ossler system

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

On peut donc dire que sur cette période, les vents ont été assez fort, mais pas assez pour sortir de la gamme des vents chroniques, ce qui est bénéfique

Répartition de diffuseurs pour l’ajustement des caractéristiques d’un canal de propagation simulé dans un contexte V2V.. des Buttes de Coësmes 35043

Greater age, low HDL and excessive alcohol intake were associated in multivariate analysis to the common gout group, while familial history of gout, longer urate