A N ANALYTIC COMPARISON OF REGULARIZATION METHODS FOR

(1)

A N ANALYTIC COMPARISON OF REGULARIZATION METHODS FOR

G ^AUSSIAN P ^ROCESSES

H ÔSSEIN M ÔHAMMADI ^1,2 , R ÔDOLPHE L Ê R ÎCHE ^2,1 , E ^RIC T ÔUBOUL ^1,2 , X ÂVIER B ÂY ^1,2 , N ÎCOLAS D ÛRRANDE ^1,2

1 E ´ ^COLE N ÂTIONALE S ÛP ERIEURE DES ´ M ^{INES DE} S ÂINT - ´ E ^TIENNE (EMSE)

2 CNRS LIMOS, UMR 5168

A ^BSTRACT

The implementation of conditional Gaussian Processes (GPs), also known as kriging models, requires the inversion of a covariance matrix. In practice, e.g., when performing optimization with the EGO algorithm, this matrix is often ill-conditioned. The two most classical regularization methods to avoid degeneracy of the covariance matrix are i ) adding a small positive constant to the diagonal (which we will refer to with a slight abuse of language as “nugget” regularization) [4] and ii ) pseudoinverse (PI) [1]

in which singular values smaller than a threshold are zeroed.

This work first provides algebraic calculations which allow comparing PI and nugget regularizations with respect to interpolation properties when the observed points tend towards each other, i.e., become redundant. The analysis is made possible by approximating ill-conditioned covariance matrices with the neighboring truly non invertible ones. Secondly, a distribution-wise GP model is proposed that has improved interpolation properties at redundant points.

C ^ONDITIONAL G AUSSIAN PROCESSES

Let K (., .) be a given kernel, X = x

¹

, ..., x

ⁿ

be n data points where the samples are taken and y = y

¹

, ..., y

ⁿ

>

be the corresponding response values. The conditional mean and variance of cen- tered GP (Y (x))

_x∈D

are [3]:

m(x) = c(x)

^>

C

⁻¹

y , (1)

v(x) = c(x, x) − c(x)

^>

C

⁻¹

c(x) , (2)

where c(x) = K (x, x

¹

), ..., K (x, x

ⁿ

)

>

is the vector of covariances between a new point x and the n already observed sample points. The n × n matrix C is a covariance matrix between the data points. In many practical cases (e.g., points close to each other), C is not (numerically) invertible.

P SEUDOINVERSE (PI) REGULARIZATION E IGEN ANALYSIS OF PI REGULARIZATION

The covariance matrix C can be decomposed into C = UΣU

^>

, where U is made of the eigenvectors of C as columns, U

^>

U = UU

^>

= I. Σ is a diagonal matrix containing the eigenvalues of C, λ

₁

≥ λ

₂

≥ · · · ≥ λ

_n

. If rank(C) = r < n, then the pseudoinverse of C is expressed as

C

^†

=

U

⁰

, U

⁰⁰





diag

1

λ

r×r

0

_r×(n−r)

0

_(n−r)×r

0

(n−r)×(n−r)





U

⁰

, U

⁰⁰

>

, (3)

in which U

⁰

contains the r first eigenvectors associated to strictly positive eigenvalues and U

⁰⁰

involves those associated to zero eigenvalues. Numerically, set the PI threshold τ = λ

₁

/κ

_max

where κ

_max

is a maximum condition number of C. Using PI regularization, kriging mean and variance are equal to

m

^{P I}

(x) = c(x)

^>

r

X

i=1

U

ⁱ

_>

y

λ

_i

U

ⁱ

, (4)

v

^{P I}

(x) = c(x, x) −

r

X

i=1

U

ⁱ

_>

c(x)

₂

λ

_i

. (5)

PI Averaging Property: The PI kriging prediction at redundant points is the average of the outputs at those points.

1.0 1.5 2.0 2.5 3.0

−4−202468

x

y

●

Fig. 1: PI regularized kriging mean ±2 standard deviation. The bullets are data points. Notice i) the averaging property of kriging mean at redundant points (here x = 2) and ii) that the variance is zero at redundant points.

Summary of the proof : The PI kriging mean at all design points is given by

m

^{P I}

(X) = CC

^†

y. (6)

In Equation (6), CC

^†

is an orthogonal projection matrix onto the image space of C, Im(C) [5]. Therefore, the PI kriging prediction is the projection of the observed outputs onto Im(C). Now, suppose that there are redundant points at k different locations. The corresponding columns in the covariance matrix are identical,

C =





 C

¹

, ..., C

^N¹

| {z }

= C

^N¹

, . . . , C

^N¹^+···+N^k−1⁺¹

, ..., C

^N

| {z }

= C

^N

, C

^N⁺¹

, ..., C

ⁿ





 ,

where

k

P

i=1

N

_i

= N . In this case the orthogonal projection operator onto Im(C) is







J_N₁ N₁

. . . 0 0

^J_N^Nk

k

I

_n−N





 ,

where J

_N_i

is a matrix of ones and I

_n−N

is the identity matrix of size n − N . m

^{P I}

(X) can be written as:

m

^{P I}

(X) = P

_Im(C)

y = h

y

¹

, . . . , y

¹

, . . . , y

^k

, . . . , y

^k

, y

^N⁺¹

, . . . , y

ⁿ

i

_>

, (7)

in which y

ⁱ

=

PNi

j=N1+...+Ni−1+1

y^j

N_i

. This property is shown in Fig. 1 when k = 3.

N UGGET REGULARIZATION

In nugget regularization, a positive value δ is added to the main diagonal of C which increases all the eigenvalues by δ. The condition number of the covariance matrix becomes: κ(C + δ I) =

^λ_λ^max ⁺ ^δ

min + δ

. In [2], it is proved that the vector c(x) is perpendicular to the null space of C. Therefore, the kriging mean and variance regularized by nugget are

m

^{N ug}

(x) = c(x)

^>

r

X

i=1

U

ⁱ

>

y

λ

_i

+ δ U

ⁱ

, (8)

v

^{N ug}

(x) = c(x, x) −

r

X

i=1

U

ⁱ

>

c(x)

₂

λ

_i

+ δ . (9)

Comparing Equations (8) and (9) with Equations (4) and (5) shows that

nugget and PI regularizations tend to the same behavior as the nugget magnitude decreases

cf. the leftmost picture of Fig. 2. However, maximum likelihood estimation of the nugget term may yield large amplitudes which degrade the interpolation quality. For example, in the rightmost picture of Fig. 2 the nugget value estimated by maximum likelihood is 7.07.

1.0 1.5 2.0 2.5 3.0

−4−202468

x

y

1.0 1.5 2.0 2.5 3.0

−4−202468

x

y

●

Fig. 2: PI regularized kriging is plotted in black. Left: the nugget value δ is 1 (cyan) and 0.1 (ma- genta). For nugget values smaller than 0.01, PI and nugget regularizations cannot be visually distinguished. Right: the nugget estimated by maximum likelihood is δ = 7.07 (blue).

A DISTRIBUTION - ^WISE GP ^MODEL

A distribution-wise GP model is a model with improved interpolation properties at redundant points: the trajectories pass through data points that are uniquely defined (non-redundant), therefore having a null variance there. At redundant points with different outputs, the GP mean and variance are the empirical average and variance of the outputs, respectively. This model is obtained by conditioning on probability distributions instead of data point(s), see Fig. 3.

1.0 1.5 2.0 2.5 3.0

−4−202468

x

y

1.0 1.5 2.0 2.5 3.0

−4−202468

x

y

Fig. 3: The leftmost picture shows a distribution-wise GP. At x = 2 a probability distribution is considered whose mean and variance are the empirical average and variance of the redundant points. On the right: comparison of kriging means (the solid lines) ±2 standard deviation (the dashed lines); maximum likelihood nugget regularization (blue), PI regularization (black) and distribution-wise GP (DGP, red). PI and DGP pass through the mean of outputs at redundant points while nugget does not. Contrarily to PI, DGP preserves the empirical variance.

Let vector z = [Z (x

¹

), . . . , Z (x

ⁿ

)]

^>

and diagonal matrix V with diagonal elements equal to the distributions’ variances be the means and variances of the observed outputs, Z (X) ∼ N (z, V). The conditional mean and the variance of the distribution-wise GP are expressed as

m

_Z

(x) = E

_Z

c

_Z

(x)

^>

C

⁻¹_Z

z

= c

_Z

(x)

^>

C

⁻¹_Z

z , (10)

v

_Z

(x) = E

_Z

(V ar(Z (x)|z)) + V ar ( E

_Z

(Z (x)|z)) (11)

= c

_Z

(x, x) − c

_Z

(x)

^>

C

⁻¹_Z

c

_Z

(x) + c

_Z

(x)

^>

C

⁻¹_Z

VC

⁻¹_Z

c

_Z

(x). (12)

R ^EFERENCES

[1] Donald R. Jones, Matthias Schonlau, and William J. Welch. Efficient Global Optimization of Expensive Black-Box Functions. Journal of Global Optimization, 13(4):455–492, 1998.

[2] Hossein Mohammadi, Rodolphe Le Riche, Eric Touboul, and Xavier Bay. On regularization techniques in statistical learning by Gaussian processes. In NICST’2013, New and smart Information Communication Science and Technology to support Sustainable Development, Clermont Ferrand, France, September 2013.

[3] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning. The MIT Press, 2005.

[4] Olivier Roustant, David Ginsbourger, and Yves Deville. DiceKriging, DiceOptim: Two R Packages for the Analysis of Computer Experiments by Kriging-Based Metamodeling and Optimization. Journal of Statistical Software, 51(1):1–55, 2012.

[5] Gilbert Strang. Introduction to Linear Algebra. Wellesley-Cambridge Press, 2009.

A N ANALYTIC COMPARISON OF REGULARIZATION METHODS FOR