• Aucun résultat trouvé

Bayesian Optimization Julien Bect, Emmanuel Vazquez

N/A
N/A
Protected

Academic year: 2022

Partager "Bayesian Optimization Julien Bect, Emmanuel Vazquez"

Copied!
411
0
0

Texte intégral

(1)

HAL Id: hal-03277561

https://cel.archives-ouvertes.fr/hal-03277561

Submitted on 4 Jul 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative CommonsAttribution - NonCommercial - NoDerivatives| 4.0 International License

Bayesian Optimization

Julien Bect, Emmanuel Vazquez

To cite this version:

Julien Bect, Emmanuel Vazquez. Bayesian Optimization: Engineering design under uncertainty using expensive-to-evaluate numerical models. Doctoral. CEA-EDF-INRIA Numerical analysis Summer school 2017. Design and optimization under uncertainty of large-scale numerical models., Université Pierre et Marie Curie (Paris VI), Paris, France. 2017, pp.224. �hal-03277561�

(2)

Bayesian Optimization

Engineering design under uncertainty using expensive-to-evaluate numerical models

Julien Bect and Emmanuel Vazquez

Laboratoire des signaux et systèmes, Gif-sur-Yvette, France

CEA-EDF-INRIA Numerical Analysis Summer School Université Pierre et Marie Curie (Paris VI)

Paris, 2017, July 3–7

(3)

Lecture 1 : From meta-models to UQ 1.1 Introduction

1.2 Black-box modeling 1.3 Bayesian approach

1.4 Posterior distribution of a quantity of interest 1.5 Complements on Gaussian processes

Lecture 2 : Bayesian optimization (BO) 2.1. Decision-theoretic framework

2.2. From Bayes-optimal to myopic strategies 2.3. Design under uncertainty

References

(4)

Lecture 1 : From meta-models to UQ 1.1 Introduction

1.2 Black-box modeling 1.3 Bayesian approach

1.4 Posterior distribution of a quantity of interest 1.5 Complements on Gaussian processes

Lecture 2 : Bayesian optimization (BO) 2.1. Decision-theoretic framework

2.2. From Bayes-optimal to myopic strategies 2.3. Design under uncertainty

References

(5)

Lecture 1 : From meta-models to UQ 1.1 Introduction

1.2 Black-box modeling 1.3 Bayesian approach

1.4 Posterior distribution of a quantity of interest 1.5 Complements on Gaussian processes

(6)

Computer-simulation based design

The numerical model might be time- or resource-consuming

Design parameters might be subject to dispersions

The system might operate under unknown conditions

(7)

Computer-simulation based design – Example

Computer simulations to design a product or a process, in particular

to find the best feasible values fordesign parameters(optimization problem)

to minimize the probability of failure of a product

To comply with European emissions standards, the design parameters of combustion engines have to be carefully optimized

The shape of intake ports controls airflow characteristics, which have direct impact on

the performances of the engine

emissions ofNOx andCO

f :XRdRperformance as a function of design parameters (d= 20100)

Computingf(x)takes 520 hours

Simulation of an intake port (Navier-Stokes equ.) (courtesy of Renault)

(8)

Lecture 1 : From meta-models to UQ 1.1 Introduction

1.2 Black-box modeling 1.3 Bayesian approach

1.4 Posterior distribution of a quantity of interest 1.5 Complements on Gaussian processes

(9)

Black-box modeling

(10)

Black-box modeling

(11)

Black-box modeling

For simplification → dropu

(12)

Black-box modeling

Let f :X→R be a real function defined onX⊆Rd, where

Xis the input/parameter domain of the computer simulation under study, or thefactor(from Latin, “which acts”) space

f is aperformanceorcostfunction (a function of the outputs of the computer simulation)

(13)

Black-box modeling

Let x1, . . . ,xn∈Xbe n simulations points

Denote by

z1 =f(x1), . . . ,zn=f(xn)

the corresp. simulation results (observations/evaluations of f)

Our objective: use the data Dn= (xi,zi)i=1...n to infer properties aboutf

Example: given a newx ∈Rd, predict the valuef(x)

(14)

Black-box modeling

f is a black-box, only known through evaluation results: query an evaluation at x, observe the result

Predict the value of f at a givenx?

→ the problem is that of constructing an approximation / an estimator bfn of f fromDn

Such abfn also called amodel or ameta-model(because the numerical simulator is a model itself) off

(15)

A simple curve fitting problem

Suppose that we are given a data set ofn simulation results, i.e., evaluations results of an unknown functionf : [0,1]→R, at points x1, . . . ,xn.

A data set of size n= 8:

0 0.2 0.4 0.6 0.8 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5

x

z

(16)

A simple curve fitting problem

Any approximation procedure of f consists in building a functionbfn=h(·;θ) whereθ∈Rl is a vector of parameters, to be estimated from Dn and available prior information

Fundamental example: linear model bfn(x) =h(x, θ) =

Xl i=1

θiri(x)

where functions ri :X→R are called regressors (e.g., r1(x) = 1,r2(x) =x,r3(x) =x2. . .→ polynomial model)

Most classical method to obtain a good value of b:

least squares → minimize the sum of squared errors J(θ) =

Xn

zibfn(xi)2 = Xn

(zih(xi;θ))2

(17)

A simple curve fitting problem

Linear fit

0 0.2 0.4 0.6 0.8 1

-2 -1.5

-1 -0.5

0 0.5 1 1.5

x

z

(18)

A simple curve fitting problem

Quadratic fit

0 0.2 0.4 0.6 0.8 1

-2 -1.5

-1 -0.5

0 0.5 1 1.5

x

z

(19)

A simple curve fitting problem

Poor fit!

Why? Model capacity is weak

Now, as an example, consider the model bfn(x) =θtr(x) with

r(x) = (1 cos(2πx) sin(2πx). . .cos(2mπx) sin(2mπx))t∈R2m+1 (a truncated Fourier series)

The increase in the number of parameters yields anill-defined problem (l ≫n)

(20)

A simple curve fitting problem

When the problem becomes ill-defined (as capacity increases), a classical solution for finding a good value ofb is to minimize the sum of an approximation error and a regularizationterm:

J(θ) = Xn i=1

ziθtr(xi)2+Ckθk22, C >0

kθk22 penalizes vectors θwith large elements

C strikes a balance between regularization and data fidelity

This approach is known as Tikhonov regularization (Tikhonov

& Arsenin, 1977) →at the basis of numerous approximation methods (ridge regression, splines, RBF, SVM. . . )

(21)

A simple curve fitting problem

n = 8,m= 50, l = 101,C = 108

0 0.2 0.4 0.6 0.8 1

-2 -1.5

-1 -0.5

0 0.5 1 1.5

x

z

(22)

A simple curve fitting problem

The regularization principle alone is not enough to obtain a good approximation

As modeling capacity increases,overfitting may arise

(23)

A simple curve fitting problem

To avoid overfitting, we should try a regularization that penalizes high frequencies more

For instance, take

kθk=θ12+ Xm k=1

θ22k +θ22k+1 (1 + (2kπ)α)2

(24)

A simple curve fitting problem

n = 8,m= 50, l = 101,C = 108,α= 1.3

0 0.2 0.4 0.6 0.8 1

-2 -1.5

-1 -0.5

0 0.5 1 1.5

x

z

(25)

A simple curve fitting problem

From this example, we can see that the construction of a regularization scheme should result from a procedure that takes into account the data (using cross-validation, for instance) an/or prior knowledge

(high frequencies ≪ low frequencies, for instance)

(26)

Black-box modeling

A large number of methods are available in the literature:

polynomial regression, splines, NN, RBF. . .

All methods are based on mixing prior information and regularization principles

Instances of “regularization” in regression:

t-tests, F-tests, ANOVA, AIC (Akaike info criterion). . . in linear regression

Early stopping in NN

Regularized reproducing-kernel regression

1960 : splines, (Schoenberg 1964, Duchon 1976–1979)

1970 : ridge regression (Hoerl, Kennard)

1980 : RBF, (Micchelli 1986, Powel 1987)

1995 : SVM, (Vapnik 1995)

1997 : SVR, (Smola 1997) & semi-param SVR (Smola 1999)

(27)

Black-box modeling

How to choose a regularization scheme?

The Bayesian settingis a principled approach that makes it possible to construct regularized regressions

(28)

Lecture 1 : From meta-models to UQ 1.1 Introduction

1.2 Black-box modeling 1.3 Bayesian approach

1.4 Posterior distribution of a quantity of interest 1.5 Complements on Gaussian processes

(29)

Why a Bayesian approach?

Objective: infer properties about f :X→Rthrough pointwise evaluations

Why a Bayesian approach?

a principled approach to choose a regularization scheme according to prior information

through probability calculus and/or Monte Carlo simulations, the user can infer properties about the unknown function

for instance: given prior knowledge andDn, what is the probability that the global maximum off is greater than a given thresholduR?

Main idea: use astatistical model of the observations, together with a probability model for the parameterof the statistical model

(30)

Some reminders about probabilities

Recall that arandom variableis a function that maps a sample space Ω to an outcome space E (e.g. E =R), and that assigns probabilities (weights) to possible outcomes

Def.

Formally, let (Ω,A,P) be a probability space, and (E,E) be a measurable outcome space

a random variableX is a measurable function (Ω,A,P)(E,E)

X is used to assign probabilities to events: for instance P(X[0,1]) = PX([0,1]) = 1/2

Case of a random variable with a density P(X [a,b]) = PX([a,b]) =

Z b

pX(x)dx

(31)

Gaussian random variables

A real-valued random variableZ is said to beGaussianN(µ, σ2), if it has the continuous probability density function

gµ,σ2(z) = 1

2πσ2exp

1 2

(zµ)2 σ2

z gµ2

µ σ

(32)

Gaussian random variables

The meanofZ (also calledexpectationor first-order moment) is

E(Z) =Z

R

z gµ,σ2(z)dz =µ and its second-ordermoment is defined as

E(Z2) = Z

R

z2gµ,σ2(z)dz=σ2+µ2

The varianceof Z is defined as

var(Z) = E(Z −E(Z))2= EZ2−E[Z]2=σ2

(33)

Gaussian random variables

A Gaussian variableZ can be used as astochastic model of some uncertain real-valued quantity

In other words,Z can be thought as aprior about some uncertain quantityof interest

(34)

Gaussian random variables

Using a random generator, it is possible to “generate”

sample valuesz1,z2, . . . of our modelZ → possible values for our uncertain quantity of interest

−60 −4 −2 0 2 4 6

0.1 0.2 0.3 0.4 0.5 0.6 0.7

n=100

(35)

Gaussian random variables

Using a random generator, it is possible to “generate”

sample valuesz1,z2, . . . of our modelZ → possible values for our uncertain quantity of interest

−60 −4 −2 0 2 4 6

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

n=1000

(36)

Gaussian random variables

Using a random generator, it is possible to “generate”

sample valuesz1,z2, . . . of our modelZ → possible values for our uncertain quantity of interest

−60 −4 −2 0 2 4 6

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

n=10000

(as n→ ∞, the empirical distribution of the realizations tends to

(37)

Bayesian model

Formally, recall that astatistical modelis a tripletM= (Z,F,P)

Z → observation space (typically,Z =Rn)

F → σ-algebra on Z

P →parametric family {Pθ;θ∈Θ} of probability distributions on (Z,F)

Def.

ABayesian modelis defined by the specification of

a parametric statistical modelM (model of the observations)

a prior probability distributionΠ on (Θ,Ξ) →probability model that describes uncertainty about θbefore an observation is made

(38)

Example

Suppose we repeat measurements of a quantity of interest:

z1,z2, . . .∈R

Model of observations: Zi iid∼ N(θ1, θ2), i = 1, . . . ,n

The statistical model can formally be written as the triplet M=Rn,B(Rn),N(θ1, θ2)n θ

12

Moreover, if we a assume a prior distributionabout θ1 andθ2

(e.g. θ1 ∼ N(1,1),θ2 ∼ IG(3,2)), we obtain a Bayesian model

From this Bayesian model (model of observations + prior), we can compute theposterior distribution of (θ1, θ2) given Z1, . . . ,Zn (will be explained later)

(39)

Example

n = 50,θ1 = 1.2,θ2= 0.8

sample size n=50

-1 0 1 2 3 4

zi 0

0.1 0.2 0.3 0.4

pdf

prior

0 1 2

0.5 1 1.5

2 posterior

0 1 2

0.5 1 1.5 2

θ2

θ2

(40)

Simple curve fitting problem from a Bayesian approach

Recall our simple curve fitting model bfn(x) =θtr(x) with

r(x) = (1 cos(2πx) sin(2πx). . .cos(2mπx) sin(2mπx))t∈R2m+1

Bayesian model?

(41)

Assume the following statistical model for the observations:

Zi =ξ(xi) +εi, i = 1, . . . ,n ξ(x) =θtr(x), x ∈X εi iid∼ N(0, σε2)

or equivalently,

Zi iid∼ N(θtr(xi), σ2ε), i = 1, . . . ,n

Moreover, choose a prior distributionfor θ:

θj indep∼ N(0, σ2θj), j = 1, . . . ,2m+ 1

The rvsZi constitute aBayesian model of the observations

ξ is arandom function/random process → prior aboutf

(42)

Random process

Def.

Arandom process ξ: (Ω,X)→Ris a collection of random variablesξ(·,x) : Ω→Rindexed by x ∈X

Random processes can be viewed as a generalization of random vectors

For a fixed ω∈Ω, the functionξ(ω,·) :X→Ris called a sample path

(43)

In our Bayesian setting, we can say that

we use a random processξas astochastic modelof the unknownfunctionf

f is viewed as assample pathsof ξ

ξrepresentsour knowledge aboutf before any evaluation has been made

the distribution Π = Pξ is aprior aboutf

All real functions Prior Pξ

Unknown functionf

(44)

Here, ξ(ω,·) =θ(ω)tr(·)

Fixing ω (a sample path) amounts to “choosing” a value for the random vector θ

(45)

Example of sample paths with

θ1 ∼ N(0,1) θ2k, θ2k+1 indep

∼ N

0,1+(ω1

0k)α

, k = 1, . . . ,m with ω0 = 10,α= 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-4 -3 -2 -1 0 1 2 3

z

(46)

Example of sample paths with

θ1 ∼ N(0,1) θ2k, θ2k+1 indep

∼ N

0,1+(ω1

0k)α

, k = 1, . . . ,m with ω0 = 50,α= 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-8 -6 -4 -2 0 2 4 6 8

z

(47)

Example of sample paths with

θ1 ∼ N(0,1) θ2k, θ2k+1 indep

∼ N

0,1+(ω1

0k)α

, k = 1, . . . ,m with ω0 = 10,α= 1.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-3 -2 -1 0 1 2 3

z

(48)

Bayesian approach

The choice of a priorin a Bayesian approach reflects the user’s knowledge about uncertain parameters

In the case of function approximation →regularity of the function

(49)

Bayesian approach

Where shall we go now?

Objective: compute posterior distributions from data

0 0.2 0.4 0.6 0.8 1

-4 -3 -2 -1 0 1 2 3 4

x

z

(50)

Bayesian approach

Where shall we go now?

Objective: compute posterior distributions from data

0 0.2 0.4 0.6 0.8 1

-4 -3 -2 -1 0 1 2 3 4

x

z

(51)

Bayesian approach

Where shall we go now?

Objective: compute posterior distributions from data

0 0.2 0.4 0.6 0.8 1

-4 -3 -2 -1 0 1 2 3 4

x

z

(52)

Bayesian approach

Where shall we go now?

Objective: compute posterior distributions from data

0 0.2 0.4 0.6 0.8 1

-4 -3 -2 -1 0 1 2 3 4

x

z

(53)

Bayesian approach

Where shall we go now?

Objective: compute posterior distributions from data

0 0.2 0.4 0.6 0.8 1

-4 -3 -2 -1 0 1 2 3 4

x

z

(54)

A “simplification” of the Bayesian model for the simple curve fitting problem

ξ =θtr (our prior aboutf) is a Gaussian process

Why?

(55)

Gaussian random vectors

Def.

A real-valued random vectorZ = (Z1, . . . ,Zd)∈Rd is said to be Gaussianiff any linear combination of its components Pdi=1aiZi, witha1, . . . ,ad ∈R, is a Gaussian variable

A Gaussian random vector Z is characterized by itsmean vector, µ= (E[Z1], . . . ,E[Zd])∈Rd, and thecovarianceof the pairs of components (Zi,Zj),i,j ∈ {1, . . . ,d},

cov(Zi,Zj) = E[(Zi −E(Zi))(Zj −E(Zj))]

IfZ ∈Rd is a Gaussian vector with meanµ∈Rd and covariance matrix Σ∈Rd×d, we shall write Z ∼ N(µ,Σ)

(56)

Exercise: LetZ ∼ N(µ,Σ). Determine EPdi=1aiZi

and varPdi=1aiZi

The correlation coefficient of two componentsZi andZj of Z is defined by

ρ(Zi,Zj) = cov(Zi,Zj)

qvar(Zi)var(Zj) ∈[−1,1], measures the similarity between and

(57)

Gaussian random vectors: correlation

ρ= 0

−4 −3 −2 −1 0 1 2 3 4

−3

−2

−1 0 1 2 3

ρ= 0.8

−4 −3 −2 −1 0 1 2 3 4

−4

−3

−2

−1 0 1 2 3 4

(58)

Gaussian random processes

Recall that a random process is a set ξ ={ξ(x),x ∈X} of random variables indexed by the elements ofX

Gaussian random process

→ generalization of a Gaussian random vector Def.

ξ is aGaussian random processiff, ∀n∈N,∀x1, . . . ,xn∈X, and

a1, . . . ,an∈R, the real-valued random variable Xn

i=1

aiξ(xi) is Gaussian

(59)

Application

If

ξ(x) =θtr(x), x∈X

θj indep∼ N(0, σ2θj), j = 1, . . . ,2m+ 1 then,∀x1, . . . ,xn∈X, and∀a1, . . . ,an∈R,

Xn i=1

aiξ(xi) =X

i

aiX

j

θjrj(xi)

=X

j

X

i

airj(xi)θj ∼ N0,X

j

X

i

airj(xi)2σθ2j

Thus, ξ=θtr is a Gaussian process

(60)

Gaussian random processes

A Gaussian process ischaracterized by

itsmean function

m:x X7→E[ξ(x)]

and itscovariance function

k : (x,y)X27→cov(ξ(x), ξ(y))

Notation: ξ ∼ GP(m,k)

(61)

Exercise: determine EPdi=1aiξ(xi) and varPdi=1aiξ(xi)

What is the distribution of Paiξ(xi)?

→ The distribution of a linear combination of a Gaussian process GP(m,k) can besimply obtained as a function of mandk

(62)

Application

If

ξ(x) =θtr(x), x∈X

θj indep∼ N(0, σ2θj), j = 1, . . . ,2m+ 1 then,

ξ ∼ GP(0,k) with

k : (x,y)7→X

j

σ2θjrj(x)rj(y)

(63)

Covariance function corresponding to

θ1 ∼ N(0,1) θ2k, θ2k+1 indep∼ N

0,1+(ω1

0k)α

, k = 1, . . . ,m with ω0 = 10,α= 4

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

k

(64)

Covariance function corresponding to

θ1 ∼ N(0,1) θ2k, θ2k+1 indep∼ N

0,1+(ω1

0k)α

, k = 1, . . . ,m with ω0 = 50,α= 4

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

k

(65)

The covariance of a Gaussian random process

Main properties: a covariance function k is

symmetric: x,y X,k(x,y) =k(y,x)

positive:

nN,x1, . . . ,xnX,a1, . . . ,anR, Xn i,j=1

aik(xi,xj)aj 0

In the following, we shall assume that the covariance of ξ ∼ GP(m,k) isinvariant under translations, or stationary:

k(x+h,y+h) =k(x,y),x,y,h∈X

Whenk is stationary, there exists astationary covariance ksta :Rd →Rsuch that

cov(ξ(x), ξ(y)) =k(x,y) =k (x−y)

(66)

Stationary covariances

Whenk is stationary, the variance

var(ξ(x)) = cov(ξ(x), ξ(x)) =k(0) does not depend on x

The covariance function can be written as k(x−y) =σ2ρ(xy),

with σ2 = var(ξ(x)), and whereρ is thecorrelation function of ξ.

(67)

Stationary covariances

The graph of the correlation function is a symmetric “bell curve” shape

−40 −3 −2 −1 0 1 2 3 4

0.2 0.4 0.6 0.8 1

correlation range

(68)

We have, by C-S,

hX,|k(h)| = cov(ξ(x), ξ(x+h))

= E[(ξ(x)m(x))(ξ(x+h)m(x+h))]

E((ξ(x)m(x))2)1/2E((ξ(x+h)m(x+h))2)1/2

= k(0)1/2k(0)1/2=k(0)

Recall, Bochner’s spectral representation theorem

Theorem

A real function k(h), hRd is symmetric positive iff it is the Fourier transform of a finite positive measure, i.e.

k(h) = Z

Rd

eı(u,h)dµ(u), whereµis a finite positive measure onRd.

(69)

Gaussian process simulation

Using a random generator, it is possible to “generate” sample paths f1,f2, . . . of a Gaussian processξ

−20 −1 0 1 2

0.2 0.4 0.6 0.8 1

−2 −1 0 1 2

−4

−2 0 2 4

−20 −1 0 1 2

0.2 0.4 0.6 0.8 1

−2 −1 0 1 2

−4

−2 0 2 4

(70)

Gaussian process simulation

How to simulate sample paths of a zero-mean Gaussian random process?

Choose a set of points x1, . . .xn∈X

Denote by K then×n covariance matrix of the random vector ξ= (ξ(x1), . . . , ξ(xn))t (NB:ξ ∼ N(0,K))

Consider the Cholesky factorization of K K =CCt,

with C a lower triangular matrix (such a factorization exists since K is a sdp matrix)

Let ε= (ε1, . . . , εn)t a Gaussian vector withεi i.i.d

∼ N(0,1)

Then ∼ N(0,K)

(71)

“Simplification” of the Bayesian model for the simple curve fitting problem

Instead of choosing the model

ξ(x) =θtr(x), x∈X

θj indep∼ N(0, σ2θj), j = 1, . . . ,2m+ 1 simply choose a covariance functionk and assume ξ ∼ GP(0,k)

More details about how to choosek will be given below

(72)

Posterior

How to compute a posterior distribution, given the Gaussian-process prior ξ and data?

0 0.2 0.4 0.6 0.8 1

x -4

-3 -2 -1 0 1 2 3 4

z

(73)

Conditional distributions

Let X be a random variable modeling an unknown quantity of interest

Assume we observe a random variable T, or a random vector T = (T1, . . . ,Tn)

Provided that T andX arenot independent,T contains information aboutX

Aim: define a notion of distribution of X “knowing” T

(74)

Conditional probabilities

Recall the following Def.

Let (Ω,A,P) be a probability space. Given two eventsA,B ∈ A such that P(B)6= 0, define theprobability ofA givenB (or conditional onB) by

P(A|B) = P(A∩B) P(B)

(75)

The notion of conditional density

Def.

Assume that the pair (X,T)∈R2 has a densityp(X,T). Define theconditional densityof X given the event T =t by

pX|T(x |t) =

p(X,T)(x,t)

pT(t) = p(X,T)(x,t)

RpX,T(x,t)dx if pT(t)>0 arbitrary density if pT(t) = 0.

(76)

Application

Given a Gaussian random vectorZ = (Z1,Z2)∼ N(0,Σ), what is the distribution of Z2“knowing”Z1?

Define

pZ2|Z1(z2|z1) = p(Z2,Z1)(z2,z1)

pZ1(z1) = σ1

(2π)1/2(det Σ)1/2exp(1/2(ztΣ−1zz1212))

→ Gaussian distribution!

−3

−2

−1 0 1 2 3 4

The random variable denoted byZ2|Z1 with density pZ2|Z1(· |Z1) represents the residual uncertaintyaboutZ2

whenZ1 has been observed

High correlationsmall residual uncertainty

(77)

Conditional mean/expectation

A fundamental notion: conditional expectation Def.

Assume that the pair (X,T) has a densityp(X,T). Define the conditional mean ofX given T =t as

E(X |T =t)= Z

R

x pX|T(x|t)dx =h(t)

Def.

The random variable E(X |T) =h(T) is called the conditional expectation ofX given T

(78)

Conditional expectation

Why conditional expectation is an fundamental notion?

We have the following Theorem

Under the previous assumptions, the solution of the problem Xb = argminY Eh(X −Y)2i

where theminimum is taken over all functions of T is given by Xb = E(X |T)

In other words, E(X |T) is the best approximation (in the sense of the quadratic mean) ofX by a function ofT

(79)

Important properties of conditional expectation

(1) E(X |T) is a random variable depending onT →there exists a functionh such that E(X |T) =h(T)

(2) The operatorπH :X 7→E(X |T) is a (linear) operator of orthogonal projectiononto the space of all functions of T (for the inner productX,Y 7→(X,Y) =E(XY))

(3) LetX,Y,TL2. Then

i) αR, E(αX+Y |T) =αE(X |T) + E(Y |T) a.s.

ii) E(E[X|T]) = E(X)

iii) IfX T, E(X |T) = E(X)

(80)

Conditional expectation: Gaussian random variables

Recall that the space of second-order random variables L2(Ω,A,P) endowed with the inner product

X,Y 7→(X,Y) =E(XY) is a Hilbert space

Gaussian linear space Def.

A linear subspaceG of L2(Ω,A,P) is Gaussianiff

X1, . . . ,XnG and∀a1, . . . ,an∈Rthe random variable P

iaiXi is Gaussian

In what follows, assume that G iscentered, i.e., each element in G is a zero-mean random variable

(81)

Theorem (Projection theorem in centered Gaussian spaces) Let G be a centered Gaussian space. Let X,T1, . . . ,TnG. Then E(X |T1, . . . ,Tn) is the orthogonal projection (in L2) of X on T =span{T1, . . . ,Tn}.

proof LetXbG be the orthogonal projection ofX onT.

we haveX=bX+εwhereεGis orthogonal toT.

InG, orthogonalityindependence. Thus,εTi,i= 1, . . . ,n.

Then,

E(X|T1, . . . ,Tn) = E(Xb|T1, . . . ,Tn) + E(ε|T1, . . . ,Tn)

=Xb+ E(ε) =Xb

The result follows.

(82)

Application

LetZ = (Z1,Z2) be a zero-mean Gaussian random vector, with covariance matrix

σ12 σ1,2 σ2,1 σ22

!

Then E(Z1 |Z2) is the orthogonal projection of Z1 onto Z2. Thus E(Z1 |Z2) =λZ2

with

(Z1λZ2,Z2) = (Z1,Z2)−λ(Z2,Z2) = 0. Hence,

λ= σ1,2 σ22

(83)

Application

LetZ = (Z1,Z2)∼ N(0,Σ) as aboverecall that the cond.

distrib. ofZ2given Z1is a Gaussian distribution

HenceZ2|Z1∼ N(µ(Z1), σ(Z1)2)µ? σ?

−4

−3

−2

−1 0 1 2 3 4

µ= E(Z2|Z1) =λZ1with λ= σσ1,22

2

Using the property of the orthogonal projection:

σ2= E (Z2E(Z2|Z1))2|Z1

= E (Z 2E(Z2|Z1))2

= E (Z2λZ1)2

= E ((Z 2λZ1)Z2)

2

(84)

Generalization

Exercise: Let (Z0,Z1, . . . ,Zn) be a centered Gaussian vector.

DetermineE(Z0|Z1, . . . ,Zn).

(85)

Application: computation of the posterior distrib. of a GP

Let ξ ∼ GP(0,k)

Assume we observeξ atx1, . . . ,xn∈X

Given x0 ∈X, what is the conditional distrib.—or the posterior distrib. in our Bayesian framework—of

ξ(x0)|ξ(x1), . . . , ξ(xn) ?

More generally, what is the posterior distribution of the random process

ξ(·)|ξ(x1), . . . , ξ(xn) ?

(86)

Computation of the posterior distribution of a GP

Prop.

Letξ ∼ GP(0,k). The random process ξ conditioned on Fn={ξ(x1), . . . , ξ(xn)}, denoted byξ |Fn, is aGaussian processwith

– mean ξbn:x 7→E(ξ(x)|ξ(x1), . . . , ξ(xn))

– covariance kn:x,y 7→E(ξ(x)−ξbn(x))(ξ(y)−ξbn(y))

(87)

Computation of the posterior distrib. of a GP

By property of the conditional expectation in Gaussian spaces, for allx ∈X,ξbn(x) is a linear combination ofξ(x1), . . . , ξ(xn):

ξbn(x) :=

Xn i=1

λi(x)ξ(xi)

Moreover, the posterior mean ξbn(x) is the orthogonal projection ofξ(x) ontospan{ξ(xi),i = 1, . . . ,n}, such that

ξbn(x) = argminYE

(ξ(x)Y)2

the variance of the prediction error is minimum

E(ξbn(x)) = E[E (ξ(x)|Fn)] = E(ξ(x)) = 0unbiased estimation

ξbn(x) is the best linear predictor(BLP) ofξ(x) from ξ(x1), . . . , ξ(xn), also called thekriging predictor of ξ(x)

Références

Documents relatifs

It is common for d to be large, d & 50 , making the optimization dicult, especially when f is an expensive black-box and the use of surrogate-based approaches [1] is

In particular, when Kriging is used to interpolate past evaluations, the uncertainty associated with the lack of information on the function can be expressed and used to compute

More specifically, we consider a decreasing sequence of nested subsets of the design space, which is defined and explored sequentially using a combination of Sequential Monte

The principle of the Bayesian experiment design is to make sequentially a choice of candidates (phantoms) that allows us to reduce the variance of the WBSAR95

Many examples in the literature show that the EI algorithm is particularly interesting for dealing with the optimization of functions which are expensive to evaluate, as is often

The optimization of the ELBO (Eq.(25)) is performed using a loop procedure consisting of an optimization step with the natural gradient to perform the optimization with respect to

and various approaches have been proposed, including robust optimi- zation, where only the bounds of the uncertain parameters are used, and stochastic optimization where

Figure 3: Probability density functions of four quantities of interest for resolution d/16. Each figure shows results for a range of values of N , the training-set size.