• Aucun résultat trouvé

A Comparison of Regularization Methods for Gaussian Processes

N/A
N/A
Protected

Academic year: 2022

Partager "A Comparison of Regularization Methods for Gaussian Processes"

Copied!
26
0
0

Texte intégral

(1)

A Comparison of Regularization Methods for Gaussian Processes

Rodolphe Le Riche 1,2 , Hossein Mohammadi 3 , Nicolas Durrande 2 , Eric Touboul 2 , Xavier Bay 2

1CNRS LIMOS, France

2Ecole des Mines de Saint Etienne, France

3Univ. of Exeter, UK

May 2017

SIAM OP17 conference, Vancouver BC

(2)

From PDEs to Gaussian Processes

x y

results of

PDE PDEs parameterized by x

(a shape, boundary conditions ...) and post-processed to yield y (x) (drag, lift, ...).

To describe the many probable y (x ) between the observed/simulated

(x

i

, y (x

i

)): conditional

Gaussian Processes, GPs.

(3)

GPs and the covariance matrix

Every Gaussian Process involves a covariance matrix C, made of C

ij

= Cov Y (x

i

), Y (x

j

)

, that must be inverted.

y

x

0

x

1

. . . x x

n=5

covariance function

(of non conditional GP),

k(x, x

0

) = Cov(Y (x), Y (x

0

))

(4)

Motivations (1)

But two issues happen. Firstly, points get too close to each other:

x

3

x

4

. . . k(x

1

, x

3

) k(x

1

, x

4

) . . . . . . k(x

2

, x

3

) k(x

2

, x

4

) . . .

. . .

. . . k(x

n

, x

3

) k(x

n

, x

4

) . . .

 the 2 columns become similar as x

3

→ x

4

,

which makes C ill-conditioned.

(5)

Motivations (2)

Examples of points too close:

(right) optimization using GPs with the EGO algorithm

[D.R. Jones et al. 1998]

on the 6D Hartmann function. Note the clusters of points.

data spatially distributed as human beings (e.g., phones) is strongly clustered (towns vs. low density areas).

Ill-conditioning even happens without close points: additive kernels and rectangular design patterns, periodic kernels. Cf. HAL report no.

01264192 accompanying the talk (same title).

(6)

Motivations (3)

If points are too close, just delete the extra ones? Second issue: they may carry different information.

x

3

x

4

|y

3

− y

4

|

Examples:

Repeated performance measures on human patients.

PDE of chaotic phenomena, expl. of crash, force vs. time for infinitesimal

perturbations of the mesh

[from M. Maliki,Adaptive surrogate models for the reliable lightweight design of automotive body structures, PhD, 2016]

.

(7)

Overview of this work

Covariance matrix ill-conditioning is endemic: GP softwares always include a regularization strategy, nugget (a.k.a. Tikhonov regularization, ridge regression) or pseudo-inverse.

What is the difference between nugget and pseudo-inverse? Do they always have the required interpolation properties?

⇒ Clear differences between pseudo-inverse and nugget arise when pushing ill-conditioning to true singularity of C: close points are merged, almost redundant points are made redundant.

⇒ Propose another regularization approach, the distribution-wise

GP.

(8)

Related studies

In order to deal with C ill-conditioning in Gaussian Processes, the following approaches have been proposed.

Matrix regularization (more general than GPs): nugget and pseudo-inverse, see later.

Choice of the design points, X: [ Salagame and Barton, Factorial Des. for Spat. Correl. Reg., J. of Appl. Stats., 97 ], [ Rennen, Subset select.

from large datasets for krg. modeling, SMO, 09 ], . . .

Choice of the covariance function, k(x, x

0

): [ Davis and Morris, 6 factors which affect the condit. nb of mat. associated with krg, Math.

Geol., 97 ], [ Belsley, Regression Diagnostics, 2005 ], . . .

GPs without C inversion: [ Gibbs, Bayesian GPs for Reg. and

Classif., PhD, 97 ].

(9)

Conditional Gaussian Processes: density

x y

Y (x) | x

1

, y (x

1

), . . . , x

n

, y (x

n

) ∼ N (m(x), c(x, x)) ,

x ∈ X ⊂ R

d

y ∈ R

m(x) + c(x, x)

| {z }

≡ v (x) m(x)

m(x) − c (x, x)

Cf. [Rasmussen and Williams, Gaussian Processes for Machine Learning, 2006]

(10)

Gaussian Processes: conditional mean and covariance

Y (x) | X, y ∼ N (m(x), c(x, x)) (reminders

in gray)

x y

m(x) = c(x)

>

C

−1

 y (x

1

)

. . . y (x

n

)

c(x, x

0

) = k(x, x

0

) − c(x)

>

C

−1

c(x

0

) (at x

0

= x , c (x, x) ≡ v(x))

y

(next: the covariances c(x) and C)

(11)

Gaussian Processes: covariance vector and matrix

Y (x) | X, y ∼ N (m(x), c(x, x))

x y

m(x) = c(x)

>

C

−1

y

c(x, x) = k(x, x) − c(x)

>

C

−1

c(x) Covariance vector :

c(x)

>

= [k(x

1

, x), . . . , k(x

n

, x)]

Covariance matrix : C =

k(x

1

, x

1

) k(x

1

, x

2

) . . . k(x

1

, x

n

) k(x

2

, x

1

) k(x

2

, x

2

) . . . k(x

2

, x

n

)

. . .

k(x

n

, x

1

) k(x

n

, x

2

) . . . k(x

n

, x

n

)

(covariance function)

X = {x

1

, . . . , x

n

} , m(X) ≡

m(x

1

) . . . m(x

n

)

 = CC

−1

y ∈ Im(C)

(12)

C Eigenanalysis

Image space Null space (non empty, C singular) Im(C) = span(

V z }| {

V

1

. . . V

r

) Null (C) = span(

W z }| { W

1

. . . W

n−r

)

C = [V W]

"

diag (λ)

r×r

0

r×(n−r)

0

(n−r)×r

0

(n−r)×(n−r)

#

[V W]

>

(13)

Definition of redundant points

Redundant points definition: the points responsible for the linear dependency in the columns of C. Their indices are the non-zero off-diagonal terms of VV

>

.

Expl. where points {1, 2, 6} and {3, 4} are redundant (actually repeated):

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

x1 x2

X1

X2

X3

X4

X5

X6

VV

>

=

0.33 0.33 0.00 0.00 0.00 0.33 0.33 0.33 0.00 0.00 0.00 0.33 0.00 0.00 0.50 0.50 0.00 0.00 0.00 0.00 0.50 0.50 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.33 0.33 0.00 0.00 0.00 0.33

k(x,x’) = exp

−(x1−x10)2 2×.252

×exp

−(x2−x20)2 2×.252

Definitions and properties framed with a green back- ground are, to the authors’ knowledge, “new”.

(14)

Kriging with Pseudo-Inverse (1)

E.g., in [ Siefert et al., MAPS software ].

Moore-Penrose

Pseudo-Inverse C

= [V W]

"

diag

λ1

r×r

0

r×(n−r)

0

(n−r)×r

0

(n−r)×(n−r)

#

[V W]

>

m(x) = c(x)

>

H C

−1

H H

(singularity)

y = ⇒ m

PI

(x) = c(x)

>

C

y (idem with c(x, x)) PI kriging property 1: the PI kriging prediction at data points X is the orthogonal projection of the observations onto Im(C),

m

PI

(X) = VV

>

y

(proof straightforward, cf. report)

m

PI

() is all the kriging model can express.

(15)

Kriging with Pseudo-Inverse (2)

PI kriging properties 2&3: At data points, repeated or not, the PI kriging averages the output values and the PI variance is zero.

Proof of Property 2: exhibit the pro- jection matrix which is made of averag- ing formula at repeated points. Proof of Property 3: direct application of PI property CC

C = C.

1.0 1.5 2.0 2.5 3.0

-4-202468

x

y

(16)

Kriging with nugget (1)

The most often used regularization approach, a.k.a. Tikhonov regu- larization

[A. Tikhonov, 1943]

, ridge regression

[A. Hoerl, 1962]

.

In GPs, see for expl. [ Andrianakis and Challenor, The effect of the nugget on Gaussian process emulators of computer models, Comput. Stats. & Data Anal., 12 ], [ Ranjan et al., A computationally stable approach to GP interpolation of de- terministic computer simulation, Technometrics, 11 ], [ Gramacy and Lee, Cases for the nugget in modeling computer experiments, Stats. & Computing, 12 ], . . .

Principle: add τ

2

to the diagonal m(x) = c(x)

>

H C

−1

H H

(singularity)

y = ⇒ m

Nug

(x) = c(x)

>

(C + τ

2

I)

−1

y

(idem with c(x, x))

(17)

Kriging with nugget (2)

C = ⇒ (C + τ

2

I)

With nugget, kriging no longer in- terpolates and has a non zero vari- ance at data points.

Right plot: PI (solid lines), nugget estimated by maximum likelihood (dashed lines) and nugget esti- mated by cross-validation (dash- dotted lines)

1.0 1.5 2.0 2.5 3.0

-4-202468

x

y

(18)

Nugget estimation

C = ⇒ (C + τ

2

I) , an intuitive result:

Nugget ML property: the nugget estimated by maximum likelihood, τ b

2

, increases with the spread at re- peated points.

Proof: eigendecomposition and mono- tonicities in the log-likelihood w.r.t.

spread and nugget.

On the right, τ c

+2

associated to +, b τ

2

to •, and τ c

+2

> b τ

2

.

y

x1 x2 ... xk

(19)

GP model-data discrepancy

Definition: Model-data discrepancy, discr = ky − m

PI

(X)k

2

kyk

2

= kWW

>

yk

2

kyk

2

if r < n

= 0 if r = n 0 ≤ discr ≤ 1

How to change y to reduce discr (if applicable): step along

−∇

y

ky − m

PI

(X)k

2

= − WW

>

y

(20)

PI vs. Nugget GP regularization (1)

PI - nugget equivalence property: as the nugget decreases to 0, the mean and covariance of GPs regularized by PI and nugget tend towards each other.

The proof involves eigendecomposition and the following property:

Covariance vector property: c(x) is perpendicular to the null space of C.

Proof based on the positive-definitness of the extended cov. matrix at (x , X)

>

1.0 1.5 2.0 2.5 3.0

-4-202468y

τ

2

= 1

←−−−−−−

τ

2

= 0.1

−−−−−−→

1.0 1.5 2.0 2.5 3.0

-4-202468y

(21)

PI vs. Nugget GP regularization (2)

As the model-data discrepancy decreases (C remains singular), the nugget maximizing the (regularized) maximum likelihood tends to 0 ⇒ PI and nugget regularizations become equivalent.

A non-zero discrepancy affects m

Nug

() throughout the X space while it only affects m

PI

() at the redundant points.

x1

x2

-2

-1.75

-1.5

-1.25

-1

-0.75

-0.5

-0.25

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

2.25

2.5

2.75

3 3.25

3.5

3.75

4

1.0 1.2 1.4 1.6 1.8 2.0

1.01.21.41.61.82.0 -2-1.75

-1.5

-1.25

-1 -0.75

-0.5 -0.25

0

0.25

0.5 0.75

1

1.25

1.5

1.75

2 2.25

2.5

2.75

3 3.25

3.5

3.75

x1 x24

x3 x4

x5 x6

x7

additive kernel and function, discr = 0, τ b

2

= 10

−2

← −−−−−−− −

− −−−−−−− → y

3

changed, no longer additive, τ b

2

≈ 1 . 91

x1

x2

-0.75

-0.5

-0.25

0

0.25 0.25

0.5

0.5

0.75 0.75

1 1

1.25

1.25

1.5

1.5

1.75

1.75

2

2.25

2.5

2.75

3

1.0 1.2 1.4 1.6 1.8 2.0

1.01.21.41.61.82.0 1.181.2 1.221.24 1.26 1.28 1.3

1.32 1.34 1.36 1.38

1.4

1.42 1.44

1.46

1.48

1.5

1.52

1.54

1.56

1.58

1.6

x1 x2

x3 x4

x5 x6

x7

(22)

Interpolating repeated points?

The point-wise definition of inter- polation does not apply. We wish a GP model

whose trajectories pass

through uniquely defined data points (c(x

i

, x

i

) = 0) and whose process mean and variance equals the empirical mean and variance at

repeated points.

⇒ interpolate distributions.

1.0 1.5 2.0 2.5 3.0

-50510

x

y

(23)

Distribution-wise GP (1)

Derived and used differently in [Titsias, Variational learning of inducing var. in sparse GPs, JMLR, 2009]

Derivation: assume that the observations are actually random, y (x

i

) → Z (x

i

). k number of unique points (repeated points counted once, e.g., k = 5, n = 8 on previous plot). Use the law of total expectation and variance,

m

Dist

(x) = E

Z

E

(Y (x)|y (x

i

) = Z (x

i

) , 1 ≤ i ≤ k

= E

Z

c

Z

(x)

>

C

−1Z

Z

= c

Z

(x)

>

C

−1Z

E (Z) c

Dist

(x,x) = E

Z

Var

Y (x)|y (x

i

) = Z (x

i

) , 1 ≤ i ≤ k

+ Var

Z

E

(Y (x)|y (x

i

) = Z (x

i

) , 1 ≤ i ≤ k

=

k(x, x) − c

Z

(x)

>

C

−1Z

c

Z

(x) + c

Z

(x)

>

C

−1Z

Var(Z) C

−1Z

c

Z

(x)

(compare to m(x) = c(x)

>

C

−1

y and c(x, x) = k(x, x) − c(x)

>

C

−1

c(x) )

(24)

Distribution-wise GP (2)

Interpolation property of distribution-wise GPs: they interpo- late the means and variances at data points,

m

Dist

(x

i

) = E (Z

i

) , c

Dist

(x

i

, x

i

) = Var(Z

i

) Distribution-wise GPs scale as O(k

3

), which may be cheaper than the usual O(n

3

) cost of traditional GPs.

Implementation: use the empirical means for E (Z) and the diagonal

of empirical (uncorrelated) variances for Var(Z).

(25)

Distribution-wise GP versus nugget regularization

Z ∼ N

 2 3 1

 ,

0.25 0 0

0 0 0

0 0 0.25

 , observe how distribution-wise GP (solid) is independent of the number of samples n while the variance of GP regularized by nugget (dashes) tends to 0 as n % :

0 1 2 3 4

-3-2-101234

x

y

0 1 2 3 4

-3-2-101234

x

y

(26)

Concluding remarks

We have given new algebraic results for comparing nugget and pseudo-inverse as regularization strategies in Gaussian Processes.

Proofs and further discussion in our report, An analytic comparison of regularization methods for Gaussian Processes, HAL Technical report no. hal-01264192, Jan. 2016 version 1 → May 2017 version 4.

Possible continuations: investigate data points clustering for distribution-wise GP as a way to deal with large number of data points; investigate the use of heteroskedastic nugget as a

regularization strategy (using e.g., developments of [ Binois et al.,

Practical heteroskedastic GP model. for large simul. exp., ArXiv TR,

2016 ]).

Références

Documents relatifs

From this observation, it is expected that accurate results can be obtained if the trade-off parameters λ and the tuning parameters p and q are properly cho- sen, since

In particular, the practical interest of updating the regularization parameter during the identification process has been pointed out, as well as the relative performances of

The naphthazarin-producing ascospore strains of 6 tetrads of the cross M 808-15 x 6-36 were chosen for the quantitative analysis of the amounts of naphthazarin derivatives produced in

In fact these methods make also some selection ( the generated sequences converge to the analytic center when using smoothing methods or the least norm center ) on the solution set

A distribution-wise GP model is a model with improved interpolation properties at redundant points: the trajectories pass through data points that are uniquely defined

Whereas computing time of lasso does not seem to depend on the number of relevant variables, computing time of forward variable selection increases considerably with the number

This result is a straightforward consequence of Lemma 5.2 in the Appendix, where it is shown that the oracle in the class of binary filters λ i ∈ {0, 1} achieves the same rate

The original data are n = 506 observations on p = 14 variables, medv median value, being the target variable. crim per capita crime rate