Bayesian Optimization Julien Bect, Emmanuel Vazquez

(1)

HAL Id: hal-03277561

https://cel.archives-ouvertes.fr/hal-03277561

Submitted on 4 Jul 2021

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative CommonsAttribution - NonCommercial - NoDerivatives| 4.0 International License

Bayesian Optimization

Julien Bect, Emmanuel Vazquez

To cite this version:

Julien Bect, Emmanuel Vazquez. Bayesian Optimization: Engineering design under uncertainty using expensive-to-evaluate numerical models. Doctoral. CEA-EDF-INRIA Numerical analysis Summer school 2017. Design and optimization under uncertainty of large-scale numerical models., Université Pierre et Marie Curie (Paris VI), Paris, France. 2017, pp.224. �hal-03277561�

(2)

Bayesian Optimization

—

Engineering design under uncertainty using expensive-to-evaluate numerical models

—

Julien Bect and Emmanuel Vazquez

Laboratoire des signaux et systèmes, Gif-sur-Yvette, France

CEA-EDF-INRIA Numerical Analysis Summer School Université Pierre et Marie Curie (Paris VI)

Paris, 2017, July 3–7

(3)

Lecture 1 : From meta-models to UQ 1.1 Introduction

1.2 Black-box modeling 1.3 Bayesian approach

1.4 Posterior distribution of a quantity of interest 1.5 Complements on Gaussian processes

Lecture 2 : Bayesian optimization (BO) 2.1. Decision-theoretic framework

2.2. From Bayes-optimal to myopic strategies 2.3. Design under uncertainty

References

(4)

Lecture 2 : Bayesian optimization (BO) 2.1. Decision-theoretic framework

2.2. From Bayes-optimal to myopic strategies 2.3. Design under uncertainty

References

(5)

(6)

Computer-simulation based design

◮ The numerical model might be time- or resource-consuming

◮ Design parameters might be subject to dispersions

◮ The system might operate under unknown conditions

(7)

Computer-simulation based design – Example

◮ Computer simulations to design a product or a process, in particular

◮ to find the best feasible values fordesign parameters(optimization problem)

◮ to minimize the probability of failure of a product

◮ To comply with European emissions standards, the design parameters of combustion engines have to be carefully optimized

◮ The shape of intake ports controls airflow characteristics, which have direct impact on

◮ the performances of the engine

◮ emissions ofNOx andCO

◮ f :X⊂R^d→Rperformance as a function of design parameters (d= 20∼100)

◮ Computingf(x)takes 5∼20 hours

Simulation of an intake port (Navier-Stokes equ.) (courtesy of Renault)

(8)

(9)

Black-box modeling

(10)

Black-box modeling

(11)

Black-box modeling

For simplification → dropu

(12)

Black-box modeling

◮ Let f :X→R be a real function defined onX⊆R^d, where

◮ Xis the input/parameter domain of the computer simulation under study, or thefactor(from Latin, “which acts”) space

◮ f is aperformanceorcostfunction (a function of the outputs of the computer simulation)

(13)

Black-box modeling

◮ Let x1, . . . ,xn∈Xbe n simulations points

◮ Denote by

z1 =f(x1), . . . ,z_n=f(x_n)

the corresp. simulation results (observations/evaluations of f)

◮ Our objective: use the data Dn= (x_i,z_i)_i=1...n to infer properties aboutf

◮ Example: given a newx ∈R^d, predict the valuef(x)

(14)

Black-box modeling

◮ f is a black-box, only known through evaluation results: query an evaluation at x, observe the result

◮ Predict the value of f at a givenx?

→ the problem is that of constructing an approximation / an estimator ^bf_n of f fromDn

◮ Such a^bf_n also called amodel or ameta-model(because the numerical simulator is a model itself) off

(15)

A simple curve fitting problem

◮ Suppose that we are given a data set ofn simulation results, i.e., evaluations results of an unknown functionf : [0,1]→R, at points x₁, . . . ,x_n.

◮ A data set of size n= 8:

0 0.2 0.4 0.6 0.8 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5

x

z

(16)

A simple curve fitting problem

◮ Any approximation procedure of f consists in building a function^bf_n=h(·;θ) whereθ∈R^l is a vector of parameters, to be estimated from Dn and available prior information

◮ Fundamental example: linear model bf_n(x) =h(x, θ) =

Xl i=1

θ_ir_i(x)

where functions ri :X→R are called regressors (e.g., r1(x) = 1,r2(x) =x,r3(x) =x². . .→ polynomial model)

◮ Most classical method to obtain a good value of b:

least squares → minimize the sum of squared errors J(θ) =

Xn

z_i−^bf_n(x_i)² = Xn

(z_i−h(x_i;θ))²

(17)

A simple curve fitting problem

◮ Linear fit

0 0.2 0.4 0.6 0.8 1

-2 -1.5

-1 -0.5

0 0.5 1 1.5

x

z

(18)

A simple curve fitting problem

◮ Quadratic fit

0 0.2 0.4 0.6 0.8 1

-2 -1.5

-1 -0.5

0 0.5 1 1.5

x

z

(19)

A simple curve fitting problem

◮ Poor fit!

◮ Why? Model capacity is weak

◮ Now, as an example, consider the model bf_n(x) =θ^tr(x) with

r(x) = (1 cos(2πx) sin(2πx). . .cos(2mπx) sin(2mπx))^t∈R^2m+1 (a truncated Fourier series)

◮ The increase in the number of parameters yields anill-defined problem (l ≫n)

(20)

A simple curve fitting problem

◮ When the problem becomes ill-defined (as capacity increases), a classical solution for finding a good value ofb is to minimize the sum of an approximation error and a regularizationterm:

J(θ) = Xn i=1

z_i−θ^tr(x_i)²+Ckθk²2, C >0

◮ kθk²2 penalizes vectors θwith large elements

◮ C strikes a balance between regularization and data fidelity

◮ This approach is known as Tikhonov regularization (Tikhonov

& Arsenin, 1977) →at the basis of numerous approximation methods (ridge regression, splines, RBF, SVM. . . )

(21)

A simple curve fitting problem

◮ n = 8,m= 50, l = 101,C = 10⁻⁸

0 0.2 0.4 0.6 0.8 1

-2 -1.5

-1 -0.5

0 0.5 1 1.5

x

z

(22)

A simple curve fitting problem

◮ The regularization principle alone is not enough to obtain a good approximation

◮ As modeling capacity increases,overfitting may arise

(23)

A simple curve fitting problem

◮ To avoid overfitting, we should try a regularization that penalizes high frequencies more

◮ For instance, take

kθk=θ₁²+ Xm k=1

θ²_2k +θ²_2k+1 (1 + (2kπ)^α)²

(24)

A simple curve fitting problem

◮ n = 8,m= 50, l = 101,C = 10⁻⁸,α= 1.3

0 0.2 0.4 0.6 0.8 1

-2 -1.5

-1 -0.5

0 0.5 1 1.5

x

z

(25)

A simple curve fitting problem

◮ From this example, we can see that the construction of a regularization scheme should result from a procedure that takes into account the data (using cross-validation, for instance) an/or prior knowledge

(high frequencies ≪ low frequencies, for instance)

(26)

Black-box modeling

◮ A large number of methods are available in the literature:

polynomial regression, splines, NN, RBF. . .

◮ All methods are based on mixing prior information and regularization principles

◮ Instances of “regularization” in regression:

◮ t-tests, F-tests, ANOVA, AIC (Akaike info criterion). . . in linear regression

◮ Early stopping in NN

◮ Regularized reproducing-kernel regression

◮ 1960 : splines, (Schoenberg 1964, Duchon 1976–1979)

◮ 1970 : ridge regression (Hoerl, Kennard)

◮ 1980 : RBF, (Micchelli 1986, Powel 1987)

◮ 1995 : SVM, (Vapnik 1995)

◮ 1997 : SVR, (Smola 1997) & semi-param SVR (Smola 1999)

(27)

Black-box modeling

◮ How to choose a regularization scheme?

◮ The Bayesian settingis a principled approach that makes it possible to construct regularized regressions

(28)

(29)

Why a Bayesian approach?

◮ Objective: infer properties about f :X→Rthrough pointwise evaluations

◮ Why a Bayesian approach?

◮ a principled approach to choose a regularization scheme according to prior information

◮ through probability calculus and/or Monte Carlo simulations, the user can infer properties about the unknown function

◮ for instance: given prior knowledge andDⁿ, what is the probability that the global maximum off is greater than a given thresholdu∈R?

◮ Main idea: use astatistical model of the observations, together with a probability model for the parameterof the statistical model

(30)

Some reminders about probabilities

◮ Recall that arandom variableis a function that maps a sample space Ω to an outcome space E (e.g. E =R), and that assigns probabilities (weights) to possible outcomes

Def.

Formally, let (Ω,A,P) be a probability space, and (E,E) be a measurable outcome space

→a random variableX is a measurable function (Ω,A,P)→(E,E)

◮ X is used to assign probabilities to events: for instance P(X∈[0,1]) = P^X([0,1]) = 1/2

◮ Case of a random variable with a density P(X ∈[a,b]) = P^X([a,b]) =

Z b

p^X(x)dx

(31)

Gaussian random variables

◮ A real-valued random variableZ is said to beGaussianN(µ, σ²), if it has the continuous probability density function

gµ,σ²(z) = 1

√2πσ²exp

−1 2

(z−µ)² σ²

z gµ,σ2

µ σ

(32)

Gaussian random variables

◮ The meanofZ (also calledexpectationor first-order moment) is

E(Z) =^Z

R

z g_µ,σ2(z)dz =µ and its second-ordermoment is defined as

E(Z²) = Z

R

z²g_µ,σ²(z)dz=σ²+µ²

◮ The varianceof Z is defined as

var(Z) = E(Z −E(Z))²= EZ²−E[Z]²=σ²

(33)

Gaussian random variables

◮ A Gaussian variableZ can be used as astochastic model of some uncertain real-valued quantity

◮ In other words,Z can be thought as aprior about some uncertain quantityof interest

(34)

Gaussian random variables

◮ Using a random generator, it is possible to “generate”

sample valuesz1,z2, . . . of our modelZ → possible values for our uncertain quantity of interest

−60 −4 −2 0 2 4 6

0.1 0.2 0.3 0.4 0.5 0.6 0.7

n=100

(35)

Gaussian random variables

−60 −4 −2 0 2 4 6

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

n=1000

(36)

Gaussian random variables

−60 −4 −2 0 2 4 6

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45

n=10000

(as n→ ∞, the empirical distribution of the realizations tends to

(37)

Bayesian model

Formally, recall that astatistical modelis a tripletM= (Z,F,P)

◮ Z → observation space (typically,Z =Rⁿ)

◮ F → σ-algebra on Z

◮ P →parametric family {P_θ;θ∈Θ} of probability distributions on (Z,F)

Def.

ABayesian modelis defined by the specification of

◮ a parametric statistical modelM (model of the observations)

◮ a prior probability distributionΠ on (Θ,Ξ) →probability model that describes uncertainty about θbefore an observation is made

(38)

Example

◮ Suppose we repeat measurements of a quantity of interest:

z1,z2, . . .∈R

◮ Model of observations: Z_i ^iid∼ N(θ1, θ₂), i = 1, . . . ,n

◮ The statistical model can formally be written as the triplet M=Rⁿ,B(Rⁿ),N(θ₁, θ₂)^⊗ⁿ _θ

1,θ2

◮ Moreover, if we a assume a prior distributionabout θ1 andθ2

(e.g. θ1 ∼ N(1,1),θ2 ∼ IG(3,2)), we obtain a Bayesian model

◮ From this Bayesian model (model of observations + prior), we can compute theposterior distribution of (θ₁, θ₂) given Z₁, . . . ,Z_n (will be explained later)

(39)

Example

n = 50,θ1 = 1.2,θ2= 0.8

sample size n=50

-1 0 1 2 3 4

zi 0

0.1 0.2 0.3 0.4

pdf

prior

0 1 2

0.5 1 1.5

2 posterior

0 1 2

0.5 1 1.5 2

θ2

(40)

Simple curve fitting problem from a Bayesian approach

◮ Recall our simple curve fitting model bfn(x) =θ^tr(x) with

r(x) = (1 cos(2πx) sin(2πx). . .cos(2mπx) sin(2mπx))^t∈R^2m+1

◮ Bayesian model?

(41)

◮ Assume the following statistical model for the observations:









Z_i =ξ(x_i) +ε_i, i = 1, . . . ,n ξ(x) =θ^tr(x), x ∈X ε_i ^iid∼ N(0, σ_ε²)

or equivalently,

Z_i ^iid∼ N(θ^tr(x_i), σ²_ε), i = 1, . . . ,n

◮ Moreover, choose a prior distributionfor θ:

θ_j ^indep∼ N(0, σ²_θ_j), j = 1, . . . ,2m+ 1

◮ The rvsZ_i constitute aBayesian model of the observations

◮ ξ is arandom function/random process → prior aboutf

(42)

Random process

Def.

Arandom process ξ: (Ω,X)→Ris a collection of random variablesξ(·,x) : Ω→Rindexed by x ∈X

◮ Random processes can be viewed as a generalization of random vectors

◮ For a fixed ω∈Ω, the functionξ(ω,·) :X→Ris called a sample path

(43)

◮ In our Bayesian setting, we can say that

◮ we use a random processξas astochastic modelof the unknownfunctionf

◮ f is viewed as assample pathsof ξ

◮ ξrepresentsour knowledge aboutf before any evaluation has been made

◮ the distribution Π = P^ξ is aprior aboutf

All real functions Prior P^ξ

Unknown functionf

(44)

◮ Here, ξ(ω,·) =θ(ω)^tr(·)

◮ Fixing ω (a sample path) amounts to “choosing” a value for the random vector θ

(45)

◮ Example of sample paths with







θ1 ∼ N(0,1) θ2k, θ2k+1 indep

∼ N

0,_1+(ω¹

0k)^α

, k = 1, . . . ,m with ω0 = ^2π₁₀,α= 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-4 -3 -2 -1 0 1 2 3

z

(46)







θ1 ∼ N(0,1) θ2k, θ2k+1 indep

∼ N

0,_1+(ω¹

0k)^α

, k = 1, . . . ,m with ω0 = ^2π₅₀,α= 4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-8 -6 -4 -2 0 2 4 6 8

z

(47)







θ1 ∼ N(0,1) θ2k, θ2k+1 indep

∼ N

0,_1+(ω¹

0k)^α

, k = 1, . . . ,m with ω0 = ^2π₁₀,α= 1.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-3 -2 -1 0 1 2 3

z

(48)

Bayesian approach

◮ The choice of a priorin a Bayesian approach reflects the user’s knowledge about uncertain parameters

◮ In the case of function approximation →regularity of the function

(49)

Bayesian approach

◮ Where shall we go now?

◮ Objective: compute posterior distributions from data

0 0.2 0.4 0.6 0.8 1

-4 -3 -2 -1 0 1 2 3 4

x

z

(50)

Bayesian approach

0 0.2 0.4 0.6 0.8 1

-4 -3 -2 -1 0 1 2 3 4

x

z

(51)

Bayesian approach

0 0.2 0.4 0.6 0.8 1

-4 -3 -2 -1 0 1 2 3 4

x

z

(52)

Bayesian approach

0 0.2 0.4 0.6 0.8 1

-4 -3 -2 -1 0 1 2 3 4

x

z

(53)

Bayesian approach

0 0.2 0.4 0.6 0.8 1

-4 -3 -2 -1 0 1 2 3 4

x

z

(54)

A “simplification” of the Bayesian model for the simple curve fitting problem

◮ ξ =θ^tr (our prior aboutf) is a Gaussian process

◮ Why?

(55)

Gaussian random vectors

Def.

A real-valued random vectorZ = (Z1, . . . ,Z_d)∈R^d is said to be Gaussianiff any linear combination of its components ^P^d_i=1a_iZ_i, witha1, . . . ,a_d ∈R, is a Gaussian variable

◮ A Gaussian random vector Z is characterized by itsmean vector, µ= (E[Z1], . . . ,E[Z_d])∈R^d, and thecovarianceof the pairs of components (Z_i,Z_j),i,j ∈ {1, . . . ,d},

cov(Z_i,Z_j) = E[(Z_i −E(Z_i))(Z_j −E(Z_j))]

◮ IfZ ∈R^d is a Gaussian vector with meanµ∈R^d and covariance matrix Σ∈R^d^×^d, we shall write Z ∼ N(µ,Σ)

(56)

◮ Exercise: LetZ ∼ N(µ,Σ). Determine E^P^d_i=1aiZi

and var^P^d_i₌₁a_iZ_i

◮ The correlation coefficient of two componentsZ_i andZ_j of Z is defined by

ρ(Zi,Zj) = cov(Zi,Z_j)

qvar(Z_i)var(Z_j) ∈[−1,1], measures the similarity between and

(57)

Gaussian random vectors: correlation

ρ= 0

−4 −3 −2 −1 0 1 2 3 4

−3

−2

−1 0 1 2 3

ρ= 0.8

−4 −3 −2 −1 0 1 2 3 4

−4

−3

−2

−1 0 1 2 3 4

(58)

Gaussian random processes

◮ Recall that a random process is a set ξ ={ξ(x),x ∈X} of random variables indexed by the elements ofX

◮ Gaussian random process

→ generalization of a Gaussian random vector Def.

ξ is aGaussian random processiff, ∀n∈N,∀x1, . . . ,xn∈X, and

∀a1, . . . ,an∈R, the real-valued random variable Xn

i=1

aiξ(xi) is Gaussian

(59)

Application

◮ If _





ξ(x) =θ^tr(x), x∈X

θ_j ^indep∼ N(0, σ²_θ_j), j = 1, . . . ,2m+ 1 then,∀x1, . . . ,x_n∈X, and∀a1, . . . ,a_n∈R,

Xn i=1

a_iξ(x_i) =^X

i

a_i^X

j

θ_jr_j(x_i)

=^X

j

X

i

a_ir_j(x_i)θ_j ∼ N0,^X

j

X

i

a_ir_j(x_i)²σ_θ²_j

◮ Thus, ξ=θ^tr is a Gaussian process

(60)

Gaussian random processes

◮ A Gaussian process ischaracterized by

◮ itsmean function

m:x ∈X7→E[ξ(x)]

◮ and itscovariance function

k : (x,y)∈X²7→cov(ξ(x), ξ(y))

◮ Notation: ξ ∼ GP(m,k)

(61)

◮ Exercise: determine E^P^d_i=1a_iξ(x_i) and var^P^d_i=1a_iξ(x_i)

◮ What is the distribution of ^Pa_iξ(x_i)?

→ The distribution of a linear combination of a Gaussian process GP(m,k) can besimply obtained as a function of mandk

(62)

Application

◮ If _





θ_j ^indep∼ N(0, σ²_θ_j), j = 1, . . . ,2m+ 1 then,

ξ ∼ GP(0,k) with

k : (x,y)7→^X

j

σ²_θ_jrj(x)rj(y)

(63)

◮ Covariance function corresponding to







θ₁ ∼ N(0,1) θ_2k, θ_2k+1 ^indep∼ N

0,_1+(ω¹

0k)^α

, k = 1, . . . ,m with ω0 = ^2π₁₀,α= 4

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

k

(64)

◮ Covariance function corresponding to







θ₁ ∼ N(0,1) θ_2k, θ_2k+1 ^indep∼ N

0,_1+(ω¹

0k)^α

, k = 1, . . . ,m with ω0 = ^2π₅₀,α= 4

-1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3

k

(65)

The covariance of a Gaussian random process

◮ Main properties: a covariance function k is

◮ symmetric: ∀x,y ∈X,k(x,y) =k(y,x)

◮ positive:

∀n∈N,∀x1, . . . ,xn∈X,∀a1, . . . ,an∈R, Xn i,j=1

aik(xi,xj)aj ≥0

◮ In the following, we shall assume that the covariance of ξ ∼ GP(m,k) isinvariant under translations, or stationary:

k(x+h,y+h) =k(x,y), ∀x,y,h∈X

◮ Whenk is stationary, there exists astationary covariance k_sta :R^d →Rsuch that

cov(ξ(x), ξ(y)) =k(x,y) =k (x−y)

(66)

Stationary covariances

◮ Whenk is stationary, the variance

var(ξ(x)) = cov(ξ(x), ξ(x)) =k(0) does not depend on x

◮ The covariance function can be written as k(x−y) =σ²ρ(x−y),

with σ² = var(ξ(x)), and whereρ is thecorrelation function of ξ.

(67)

Stationary covariances

◮ The graph of the correlation function is a symmetric “bell curve” shape

−40 −3 −2 −1 0 1 2 3 4

0.2 0.4 0.6 0.8 1

correlation range

(68)

◮ We have, by C-S,

∀h∈X,|k(h)| = cov(ξ(x), ξ(x+h))

= E[(ξ(x)−m(x))(ξ(x+h)−m(x+h))]

≤ E((ξ(x)−m(x))²)^1/2E((ξ(x+h)−m(x+h))²)^1/2

= k(0)^1/2k(0)^1/2=k(0)

◮ Recall, Bochner’s spectral representation theorem

Theorem

A real function k(h), h∈R^d is symmetric positive iff it is the Fourier transform of a finite positive measure, i.e.

k(h) = Z

R^d

e^ı(u,h)dµ(u), whereµis a finite positive measure onR^d.

(69)

Gaussian process simulation

◮ Using a random generator, it is possible to “generate” sample paths f1,f2, . . . of a Gaussian processξ

−20 −1 0 1 2

0.2 0.4 0.6 0.8 1

−2 −1 0 1 2

−4

−2 0 2 4

−20 −1 0 1 2

0.2 0.4 0.6 0.8 1

−2 −1 0 1 2

−4

−2 0 2 4

(70)

Gaussian process simulation

How to simulate sample paths of a zero-mean Gaussian random process?

◮ Choose a set of points x₁, . . .x_n∈X

◮ Denote by K then×n covariance matrix of the random vector ξ= (ξ(x1), . . . , ξ(x_n))^t (NB:ξ ∼ N(0,K))

◮ Consider the Cholesky factorization of K K =CC^t,

with C a lower triangular matrix (such a factorization exists since K is a sdp matrix)

◮ Let ε= (ε1, . . . , εn)^t a Gaussian vector withεi i.i.d

∼ N(0,1)

◮ Then Cε∼ N(0,K)

(71)

“Simplification” of the Bayesian model for the simple curve fitting problem

◮ Instead of choosing the model





θ_j ^indep∼ N(0, σ²_θ_j), j = 1, . . . ,2m+ 1 simply choose a covariance functionk and assume ξ ∼ GP(0,k)

◮ More details about how to choosek will be given below

(72)

Posterior

◮ How to compute a posterior distribution, given the Gaussian-process prior ξ and data?

0 0.2 0.4 0.6 0.8 1

x -4

-3 -2 -1 0 1 2 3 4

z

(73)

Conditional distributions

◮ Let X be a random variable modeling an unknown quantity of interest

◮ Assume we observe a random variable T, or a random vector T = (T₁, . . . ,T_n)

◮ Provided that T andX arenot independent,T contains information aboutX

◮ Aim: define a notion of distribution of X “knowing” T

(74)

Conditional probabilities

Recall the following Def.

Let (Ω,A,P) be a probability space. Given two eventsA,B ∈ A such that P(B)6= 0, define theprobability ofA givenB (or conditional onB) by

P(A|B) = P(A∩B) P(B)

(75)

The notion of conditional density

Def.

Assume that the pair (X,T)∈R² has a densityp^(X^,T). Define theconditional densityof X given the event T =t by

p^X^|^T(x |t) =









p^(X,T)(x,t)

p^T(t) = p^(X^,T⁾(x,t)

Rp^X,T(x,t)dx if p^T(t)>0 arbitrary density if p^T(t) = 0.

(76)

Application

◮ Given a Gaussian random vectorZ = (Z1,Z₂)∼ N(0,Σ), what is the distribution of Z2“knowing”Z1?

◮ Define

p^Z²^|Z¹(z₂|z1) = p^(Z²^,Z¹⁾(z2,z1)

p^Z¹(z₁) = σ1

(2π)^1/2(det Σ)^1/2exp(−1/2(z^tΣ⁻¹z−z₁²/σ₁²))

→ Gaussian distribution!

−3

−2

−1 0 1 2 3 4

◮ The random variable denoted byZ₂|Z₁ with density p^Z²^|Z¹(· |Z₁) represents the residual uncertaintyaboutZ2

whenZ1 has been observed

◮ High correlation→small residual uncertainty

(77)

Conditional mean/expectation

A fundamental notion: conditional expectation Def.

Assume that the pair (X,T) has a densityp^(X^,T⁾. Define the conditional mean ofX given T =t as

E(X |T =t)=^∆ Z

R

x p^X^|^T(x|t)dx =h(t)

Def.

The random variable E(X |T) =h(T) is called the conditional expectation ofX given T

(78)

Conditional expectation

◮ Why conditional expectation is an fundamental notion?

◮ We have the following Theorem

Under the previous assumptions, the solution of the problem Xb = argmin_Y E^h(X −Y)²ⁱ

where theminimum is taken over all functions of T is given by Xb = E(X |T)

◮ In other words, E(X |T) is the best approximation (in the sense of the quadratic mean) ofX by a function ofT

(79)

Important properties of conditional expectation

(1) E(X |T) is a random variable depending onT →there exists a functionh such that E(X |T) =h(T)

(2) The operatorπ_H :X 7→E(X |T) is a (linear) operator of orthogonal projectiononto the space of all functions of T (for the inner productX,Y 7→(X,Y) =E(XY))

(3) LetX,Y,T ∈L². Then

i) ∀α∈R, E(αX+Y |T) =αE(X |T) + E(Y |T) a.s.

ii) E(E[X|T]) = E(X)

iii) IfX ⊥⊥T, E(X |T) = E(X)

(80)

Conditional expectation: Gaussian random variables

◮ Recall that the space of second-order random variables L²(Ω,A,P) endowed with the inner product

X,Y 7→(X,Y) =E(XY) is a Hilbert space

◮ Gaussian linear space Def.

A linear subspaceG of L²(Ω,A,P) is Gaussianiff

∀X1, . . . ,Xn∈G and∀a1, . . . ,an∈Rthe random variable P

iaiXi is Gaussian

◮ In what follows, assume that G iscentered, i.e., each element in G is a zero-mean random variable

(81)

Theorem (Projection theorem in centered Gaussian spaces) Let G be a centered Gaussian space. Let X,T1, . . . ,Tn∈G. Then E(X |T1, . . . ,Tn) is the orthogonal projection (in L²) of X on T =span{T1, . . . ,Tn}.

proof LetXb∈G be the orthogonal projection ofX onT.

◮ we haveX=bX+εwhereε∈Gis orthogonal toT.

InG, orthogonality⇔independence. Thus,ε⊥⊥Ti,i= 1, . . . ,n.

◮ Then,

E(X|T1, . . . ,Tn) = E(Xb|T1, . . . ,Tn) + E(ε|T1, . . . ,Tn)

=Xb+ E(ε) =Xb

The result follows.

(82)

Application

LetZ = (Z1,Z2) be a zero-mean Gaussian random vector, with covariance matrix

σ₁² σ_1,2 σ_2,1 σ²₂

!

Then E(Z1 |Z2) is the orthogonal projection of Z1 onto Z2. Thus E(Z₁ |Z₂) =λZ₂

with

(Z1−λZ2,Z2) = (Z1,Z2)−λ(Z2,Z2) = 0. Hence,

λ= σ_1,2 σ²₂

(83)

Application

◮ LetZ = (Z1,Z₂)∼ N(0,Σ) as above→recall that the cond.

distrib. ofZ2given Z1is a Gaussian distribution

◮ HenceZ2|Z1∼ N(µ(Z1), σ(Z1)²)→µ? σ?

−4

−3

−2

−1 0 1 2 3 4

◮ µ= E(Z2|Z1) =λZ1with λ= ^σ_σ^1,22

2

◮ Using the property of the orthogonal projection:

σ²= E (Z2−E(Z2|Z₁))²|Z₁

= E (Z⊥ ₂−E(Z₂|Z₁))²

= E (Z2−λZ1)²

= E ((Z⊥ 2−λZ1)Z2)

2

(84)

Generalization

Exercise: Let (Z0,Z1, . . . ,Z_n) be a centered Gaussian vector.

DetermineE(Z0|Z1, . . . ,Z_n).

(85)

Application: computation of the posterior distrib. of a GP

◮ Let ξ ∼ GP(0,k)

◮ Assume we observeξ atx1, . . . ,xn∈X

◮ Given x0 ∈X, what is the conditional distrib.—or the posterior distrib. in our Bayesian framework—of

ξ(x₀)|ξ(x1), . . . , ξ(x_n) ?

◮ More generally, what is the posterior distribution of the random process

ξ(·)|ξ(x1), . . . , ξ(x_n) ?

(86)

Computation of the posterior distribution of a GP

Prop.

Letξ ∼ GP(0,k). The random process ξ conditioned on F_n={ξ(x1), . . . , ξ(x_n)}, denoted byξ |F_n, is aGaussian processwith

– mean ξ^bn:x 7→E(ξ(x)|ξ(x1), . . . , ξ(xn))

– covariance k_n:x,y 7→E(ξ(x)−ξ^b_n(x))(ξ(y)−ξ^b_n(y))

(87)

Computation of the posterior distrib. of a GP

◮ By property of the conditional expectation in Gaussian spaces, for allx ∈X,ξ^b_n(x) is a linear combination ofξ(x₁), . . . , ξ(x_n):

ξb_n(x) :=

Xn i=1

λ_i(x)ξ(x_i)

◮ Moreover, the posterior mean ξ^bn(x) is the orthogonal projection ofξ(x) ontospan{ξ(xi),i = 1, . . . ,n}, such that

◮ ξbn(x) = argmin_YE

(ξ(x)−Y)²

→the variance of the prediction error is minimum

◮ E(ξbn(x)) = E[E (ξ(x)|Fn)] = E(ξ(x)) = 0→unbiased estimation

◮ ξbn(x) is the best linear predictor(BLP) ofξ(x) from ξ(x1), . . . , ξ(xn), also called thekriging predictor of ξ(x)