• Aucun résultat trouvé

Deep Gaussian processes

N/A
N/A
Protected

Academic year: 2022

Partager "Deep Gaussian processes"

Copied!
72
0
0

Texte intégral

(1)

Deep Gaussian Processes

Maurizio Filippone

EURECOM, Sophia Antipolis, France August 31

st

, 2018

1

(2)

1 Introduction

2 Inference for Deep Gaussian Processes

3 Convolutional Deep Gaussian Processes

4 Conclusions

2

(3)

Introduction

(4)

Gaussian Processes - Priors over Functions

• Infinite Gaussian random variables with parametric and input-dependent covariance

Rasmussen and Williams, 2006

3

(5)

Gaussian Processes as Infinitely-Wide Shallow Neural Nets

• Take W (i) ∼ N (0, α i I )

• Central Limit Theorem implies that F is Gaussian

.. .

.. .

Φ

X F

W(0) W(1)

• F has zero-mean

• cov(F ) = E p(W

(0)

,W

(1)

) [Φ(X W (0) )W (1) W (1) > Φ(X W (0) ) > ]

Neal,LNS, 1996

4

(6)

Gaussian Processes as Infinitely-Wide Shallow Neural Nets

• Take W (i) ∼ N (0, α i I )

• Central Limit Theorem implies that F is Gaussian

.. .

.. .

Φ

X F

W(0) W(1)

• F has zero-mean

• cov(F ) = α 1 E p(W

(0)

) [Φ(X W (0) )Φ(X W (0) ) > ]

• Some choices of Φ lead to analytic expression of known kernels (RBF, Mat´ ern, arc-cosine, Brownian motion, . . .)

Neal,LNS, 1996

5

(7)

Gaussian Processes - Priors over Functions

K =

Rasmussen and Williams, 2006

6

(8)

Gaussian Processes - Priors over Functions

K =

Rasmussen and Williams, 2006

7

(9)

Gaussian Processes - Priors over Functions

K = n

n

Rasmussen and Williams, 2006

8

(10)

Gaussian Processes - Regression example

• Inputs = X Labels = Y

• Introduce latent variables F with covariance K = K (X , θ)

• Introduce Gaussian likelihood p(Y | F )

−3−2−10123

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

Rasmussen and Williams, 2006 9

(11)

Gaussian Processes - Regression example

• Predictive distribution p(F | Y , X , θ) =

Z

p(F | F , θ)p (F | Y , X , θ)d F

−3 −2 −1 0 1 2 3

−3−2−10123

x

y

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

Rasmussen and Williams, 2006 10

(12)

Gaussian Processes - Classification example

• Inputs = X Labels = Y

• Introduce latent variables F with covariance K = K (X , θ)

• Introduce Bernoulli likelihood p (Y | F )

0.00.20.40.60.81.0

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

Rasmussen and Williams, 2006 11

(13)

Gaussian Processes - Classification example

• Predictive distribution - needs approximation to p(F | Y , X , θ)!

p(F | Y , X , θ) = Z

p(F | F , θ)p (F | Y , X , θ)d F

0.00.20.40.60.81.0

• Posterior p(F | Y , X , θ) ∝ p(Y | F )p(F | X , θ) R p(Y | F )p(F | X , θ)d F

Rasmussen and Williams, 2006

12

(14)

Challenges and Limitations

• Kernel design

• p(Y | X , θ) might be expensive to compute (factorize K )

• p(Y | X , θ) might not even be computable!

F

Y θ X

• Marginal likelihood p(Y | X , θ) =

Z

p (Y | F )p(F | X , θ)d F

13

(15)

Deep Gaussian Processes for Large Representational Power

• Bypassing kernel design through composition of processes

(f ◦ g)(x)??

Neal,LNS, 1996 – Damianou and Lawrence,AISTATS, 2013

14

(16)

Deep Gaussian Processes for Large Representational Power

• Composition of stationary processes yields something very complex

F(1)

Y θ(1) X

F(2) θ(2)

Neal,LNS, 1996 – Damianou and Lawrence,AISTATS, 2013 – Duvenaud et al.,AISTATS, 2014 15

(17)

Pathologies of Deep Gaussian Processes

• Deep is not necessarily good!

• Example

1 Layer 2 Layers

5 Layers 10 Layers

F(1)

Y θ(1) X

F(2) θ(2)

Neal,LNS, 1996 – Duvenaud et al.,AISTATS, 2014 – Matthews et al.,arXiv, 2018 16

(18)

Pathologies of Deep Gaussian Processes

• Deep is not necessarily good!

• Feeding input to each layer helps...

1 Layer 2 Layers

5 Layers 10 Layers

F(1)

Y θ(1) X

F(2) θ(2)

Neal,LNS, 1996 – Duvenaud et al.,AISTATS, 2014 – Matthews et al.,arXiv, 2018 17

(19)

Learning Deep Gaussian Processes

• Inference requires calculating integrals of this kind:

p(Y | X , θ) = Z

p

Y | F (N

h

) , θ (N

h

) × p

F (N

h

) | F (N

h

1) , θ (N

h

1)

× . . . × p

F (1) | X , θ (0)

d F (N

h

) . . . d F (1)

• Extremely challenging!

18

(20)

The Deep Learning Revolution

• Large representational power

• Mini-batch-based learning

• Exploit GPU and distributed computing

• Automatic differentiation

• Mature development of regularization (e.g., dropout)

• Application-specific representations (e.g., convolutional)

19

(21)

Stochastic Gradient Optimization

E n

∇ ] par LowerBound o

= ∇ par LowerBound

−4 −2 0 2 4

−4−2024

−4 −2 0 2 4

−4−2024

●●

Robbins and Monro,AoMS, 1951

20

(22)

Stochastic Variational Inference - Illustration

vpar 0 = vpar + α t

2 ∇ ^ vpar (LowerBound) α t → 0

21

−4 −2 0 2 4

−4−2024

0 20 40 60 80 100

0.00.20.40.60.81.0

Iteration

p(par | data)

5 0 5

20 15 10 5 0 5 10 15 20

4 2 0 2 4

16

14

12

10

8

(23)

Is There Any Hope for GPs and DGPs?

• Mini-batch training is straightforward when objective factorizes over training points

objective = X

i

f (y i , x i , par)

• In GPs latent variables are fully correlated p(F | X , θ) = N (F | 0, K (X , θ)) ∝ exp

− 1

2 F > K 1 F

• Na¨ıve mini-batch approaches would totally break this!

Can we exploit what made Deep Learning successful for practical and scalable learning of (Deep) Gaussian processes?

22

(24)

Is There Any Hope for GPs and DGPs?

• Mini-batch training is straightforward when objective factorizes over training points

objective = X

i

f (y i , x i , par)

• In GPs latent variables are fully correlated p(F | X , θ) = N (F | 0, K (X , θ)) ∝ exp

− 1

2 F > K 1 F

• Na¨ıve mini-batch approaches would totally break this!

Can we exploit what made Deep Learning successful for

practical and scalable learning of (Deep) Gaussian processes?

22

(25)

Inference for Deep Gaussian

Processes

(26)

Inference for DGPs

• Inducing points-based approximations

• VI+Titsias AISTATS 2009 Sparse GP

• Damianou and Lawrence, AISTATS, 2013

• Hensman and Lawrence, arXiv, 2014

• Salimbeni and Deisenroth, NIPS, 2017

• EP+FITC - Bui et al. ICML, 2016

• MCMC+Titsias AISTATS 2009 Sparse GP

• Havasi et al., arXiv, 2018

• Random feature-based approximations

• Gal and Ghahramani, ICML 2016

• Cutajar et al., ICML 2017

23

(27)

Inference for DGPs

• Low-Rank Approximation options - O (nm 2 )

• Call P as a low rank approximation to K y

• Woodbury identity exploits low rank structure of P

Preconditioning Kernel Matrices

K. Cutajar 1 , M. A. Osborne 2 , J. P. Cunningham 3 , M. Filippone 1

1 - EURECOM, Sophia Antipolis, France

2 - University of Oxford, Oxford, UK 3 - Columbia University, New York City, USA

Kernel Machines and Solving Linear Systems

! Operate in a high-dimensional, implicit feature space;

! Rely on the construction of an n × n Gram matrix K;

! Popular kernels:

RBF : k (x

i

, x

j

) = σ

2

exp !

12

d

2

"

; Matérn : k

v=3

2

(x

i

,x

j

) = σ

2

! 1 + √

3d "

exp !

− √ 3d "

; where d

2

= (x

i

− x

j

)

Λ (x

i

− x

j

).

! Involve the solution of linear systems Kz = v;

! Cholesky Decomposition:

O (n

2

) space and O (n

3

) time - unfeasible for large n.

Fig. 1:

Kernel machines enable non-linear separation of data.

! Conjugate Gradient (CG):

Numerical solution of linear systems constructed using matrix-vector multiplications;

O (tn

2

) for t CG iterations - in theory t = n (possibly worse).

z

z0

Fig. 2:

CG

! Preconditioned Conjugate Gradient (PCG):

Transforms linear system to be better conditioned, improving convergence;

Yields a new linear system of the form P

−1

Kz = P

−1

v;

O(tn

2

) for t PCG iterations - in practice t ≪ n.

z

z0

Fig. 3:

Preconditioned CG

Preconditioning Approaches

! Suppose we want to precondition K

y

= K + λI;

! Our choice of preconditioner should:

Approximate K

y

as closely as possible;

Be easy to invert.

! For low-rank preconditioners we employ the Woodbury inversion lemma:

K

y

= P =

P

1

=

! For other preconditioners we solve inner linear systems once again using CG!

Formulation Strategy

Nyström P = K

XU

K

U U1

K

U X

+ λI where U ⊂ X Woodbury FITC P = K

XU

K

U U1

K

U X

+ diag !

K − K

XU

K

U U1

K

U X

"

+ λI Woodbury

PITC P = K

XU

K

U U1

K

U X

+ bldiag !

K − K

XU

K

U U1

K

U X

"

+ λI Woodbury

Spectral P

ij

=

σm2

#

m r=1

cos $

2πs

r

(x

i

− x

j

) %

+ λI

ij

Woodbury

Partial SVD K = AΛA

⇒ P = A

[·,1:m]

Λ

[1:m,1:m]

A

[1:m,·]

+ λI Woodbury

Block Jacobi P = bldiag (K) + λI Block Inverse

SKI P = W K

U U

W

+ λI where K

U U

is Kronecker Inner CG

Regularization P = K + δI + λI Inner CG

Preconditioning Kernel Matrices

Concrete Dataset Power plant Dataset Protein Dataset

(n

= 1030, d= 8)

(n

= 9568, d= 4)

(n

= 45730, d= 9)

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

log

10

(n

it

)

0 5

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + Block Jacobi

-3 -2 -1 0 1 2 -2

-4

-6 +

PITC

-3 -2 -1 0 1 2 -2

-4 -6

+

+ +

+ + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + Spectral

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + + Regularized

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + +

SKI

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + Block Jacobi

-3 -2 -1 0 1 2 -2

-4 -6

+ + + PITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + +

Spectral

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + Regularized

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + +

SKI

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + +

+ + +

Block Jacobi

-3 -2 -1 0 1 2 -2

-4 -6

+ + +

+ + +

PITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + +

Spectral

-3 -2 -1 0 1 2 -2

-4 -6 + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

Too Expensive!

Regularized

-3 -2 -1 0 1 2 -2

-4 -6

Too Expensive!

SKI

Loss

− 2

−1 0 1 2

Gain

Figure 1. Comparison of preconditioners for different settings of kernel parameters. The lengthscalel and the noise variance

λ

are shown on the x and y axes respectively. The top figure indicates the number of iterations required to solve the corresponding linear system using CG, whilst the bottom part of the figure shows the rate of improvement (negative - blue) or degradation (positive - red) achieved by using PCG to solve the same linear system. Parameters and results are reported inlog

10

. Symbols added to facilitate reading in B/W print.

tioner requires one matrix-vector product, and we add this to the overall count of such computations. For this precon- ditioner, we add a diagonal offset δ to the original matrix, equivalent to two orders of magnitude greater than the noise of the process. In general, although the complexity of PCG is indeed no different from that of CG, we emphasize that experiencing a 2-fold or 5-fold (in some cases even an order of magnitude) improvement can be very substantial when plain CG takes very long to converge or when the dataset is large.

We focus on an isotropic RBF variant of the kernel in eq.1, fixing the marginal variance σ

2

to one. We vary the length- scale parameter l and the noise variance λ in log

10

scale.

The top part of fig. 1 shows the number of iterations that the standard CG algorithm takes, where we have capped the number of iterations to 100,000.

The bottom part of the figure reports the improvement of- fered by various preconditioners measured as

log

10

! # PCG iterations

# CG iterations

"

.

It is worth noting that when both CG and PCG fail to con- verge within the upper bound, the improvement will be marked as 0, i.e. neither a gain or a loss within the given bound. The results plotted in fig. 1 indicate that the low- rank preconditioners (PITC, FITC and Nystr¨om) achieve

significant reductions in the number of iterations for each dataset, and all approaches work best when the lengthscale is longer, characterising smoother processes. In contrast, preconditioning seems to be less effective when the length- scale is shorter, corresponding to a kernel matrix that is more sparse. However, for cases yielding positive results, the improvement is often in the range of an order of mag- nitude, which can be substantial when a large number of iterations is required by the CG algorithm.

The results also confirm that, as alluded to in the previous section, Block Jacobi preconditioning is generally a poor preconditioner, particularly when the corresponding kernel matrix is dense. The only minor improvements were ob- served when CG itself converges quickly, in which case preconditioning serves very little purpose either way.

The regularization approach with flexible conjugate gradi- ent does not appear to be effective in any case either, partic- ularly due to the substantial amount of iterations required for solving an inner system at every iteration of the PCG algorithm. This implies that introducing additional small jitter to the diagonal does not necessarily make the sys- tem much easier to solve, whilst adding an overly large offset would negatively impact convergence of the outer al- gorithm. One could assume that tuning the value of this parameter could result in slightly better results; however, preliminary experiments in this regard yielded only minor

Fig. 4:

Comparison of preconditioners for different settings of kernel parameters across multiple datasets. Top: Number of iterations required to solve the corresponding linear system using CG. Bottom: Rate of improvement (blue) or degradation (red) achieved by using PCG to solve the same linear system.

Motivating Example - Gaussian Processes

! Marginal likelihood:

log[p(y|par)] = − 1

2 log |K

y

| − 1

2 y

K

y1

y + const.

! Derivatives wrt par:

∂ log[p(y | par)]

∂par

i

= − 1 2 Tr

&

K

y−1

∂K

y

∂par

i

'

+ 1

2 y

K

y−1

∂K

y

∂par

i

K

y−1

y

! Stochastic estimate of the trace - assuming E[rr

] = I, then:

Tr

&

K

y−1

∂K

y

∂par

i

'

= Tr

&

K

y−1

∂K

y

∂par

i

E[rr

] '

= E (

r

K

y−1

∂K

y

∂par

i

r )

≈ 1 N

r

Nr

*

i=1

r

(i)

K

y−1

∂K

y

∂par

i

r

(i)

! Linear systems only!

! Laplace approximation for non-Gaussian likelihoods may be formulated in a similar way!

Experimental Setup and Results

! Exact gradient-based optimization using Cholesky decomposition (CHOL);

! Stochastic gradient-based optimization using ADAGRAD - using CG and PCG;

! GP Approximations:

Variational learning of inducing variables (VAR);

Fully Independent Training Conditional (FITC);

Partially Independent Training Conditional (PITC).

Classification

Spam Dataset(n= 4061,d=57)

0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.040.080.12

log10

(

seconds

)

Error Rate

−1 0 1 2 3

152535

log10

(

seconds

)

Negative Test Log−Lik

EEG Dataset(n= 14979,d=14)

1.0 1.5 2.0 2.5 3.0 3.5

0.050.150.25

log10

(

seconds

)

Error Rate

0 1 2 3 4

2030405060

log10

(

seconds

)

Negative Test Log−Lik

Regression

Power plant Dataset(n= 9568,d=4)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.180.200.22

log10

(

seconds

)

RMSE

0 1 2 3

−40−200

log10

(

seconds

)

Negative Test Log−Lik

Protein Dataset(n= 45730,d=9)

1.5 2.0 2.5 3.0 3.5 4.0

0.600.640.680.72

log10

(

seconds

)

RMSE

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

200300400

log10

(

seconds

)

Negative Test Log−Lik

PCG CG CHOL FITC PITC VAR

Fig. 5:

Error and negative log likelihood on

n

held out test data over time. Curves are averaged over multiple repetitions.

Conclusions

! Our solution:

✓ Exact in the limit of iterations;

✓ Straightforward to construct and easy to tune;

✓ Scalable to large datasets - no need to store K;

✓ Competitive with exact Cholesky decomposition;

✓ Superior to approximate methods.

! Ongoing work:

Extending this work to other kernel functions and models;

Implementation on a distributed framework;

Exploiting PCG in the solution of f (K) z = v.

1

DGPs: Low-rank approximation of covariance at each layer

24

(28)

Inference for DGPs

• Low-Rank Approximation options - O (nm 2 )

• Call P as a low rank approximation to K y

• Woodbury identity exploits low rank structure of P

Preconditioning Kernel Matrices

K. Cutajar 1 , M. A. Osborne 2 , J. P. Cunningham 3 , M. Filippone 1

1 - EURECOM, Sophia Antipolis, France

2 - University of Oxford, Oxford, UK 3 - Columbia University, New York City, USA

Kernel Machines and Solving Linear Systems

! Operate in a high-dimensional, implicit feature space;

! Rely on the construction of an n × n Gram matrix K;

! Popular kernels:

RBF : k (x

i

, x

j

) = σ

2

exp !

12

d

2

"

; Matérn : k

v=3

2

(x

i

,x

j

) = σ

2

! 1 + √

3d "

exp !

− √ 3d "

; where d

2

= (x

i

− x

j

)

Λ (x

i

− x

j

).

! Involve the solution of linear systems Kz = v;

! Cholesky Decomposition:

O (n

2

) space and O (n

3

) time - unfeasible for large n.

Fig. 1:

Kernel machines enable non-linear separation of data.

! Conjugate Gradient (CG):

Numerical solution of linear systems constructed using matrix-vector multiplications;

O (tn

2

) for t CG iterations - in theory t = n (possibly worse).

z

z0

Fig. 2:

CG

! Preconditioned Conjugate Gradient (PCG):

Transforms linear system to be better conditioned, improving convergence;

Yields a new linear system of the form P

−1

Kz = P

−1

v;

O(tn

2

) for t PCG iterations - in practice t ≪ n.

z

z0

Fig. 3:

Preconditioned CG

Preconditioning Approaches

! Suppose we want to precondition K

y

= K + λI;

! Our choice of preconditioner should:

Approximate K

y

as closely as possible;

Be easy to invert.

! For low-rank preconditioners we employ the Woodbury inversion lemma:

K

y

= P =

P

1

=

! For other preconditioners we solve inner linear systems once again using CG!

Formulation Strategy

Nyström P = K

XU

K

U U1

K

U X

+ λI where U ⊂ X Woodbury FITC P = K

XU

K

U U1

K

U X

+ diag !

K − K

XU

K

U U1

K

U X

"

+ λI Woodbury

PITC P = K

XU

K

U U1

K

U X

+ bldiag !

K − K

XU

K

U U1

K

U X

"

+ λI Woodbury

Spectral P

ij

=

σm2

#

m r=1

cos $

2πs

r

(x

i

− x

j

) %

+ λI

ij

Woodbury

Partial SVD K = AΛA

⇒ P = A

[·,1:m]

Λ

[1:m,1:m]

A

[1:m,·]

+ λI Woodbury

Block Jacobi P = bldiag (K) + λI Block Inverse

SKI P = W K

U U

W

+ λI where K

U U

is Kronecker Inner CG

Regularization P = K + δI + λI Inner CG

Preconditioning Kernel Matrices

Concrete Dataset Power plant Dataset Protein Dataset

(n

= 1030, d= 8)

(n

= 9568, d= 4)

(n

= 45730, d= 9)

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

-3 -2 -1 0 1 2 log10(l) -2 -4 -6

log10()

log

10

(n

it

)

0 5

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + Block Jacobi

-3 -2 -1 0 1 2 -2

-4

-6 +

PITC

-3 -2 -1 0 1 2 -2

-4 -6

+

+ +

+ + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + Spectral

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + + Regularized

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + +

SKI

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + Block Jacobi

-3 -2 -1 0 1 2 -2

-4 -6

+ + + PITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + +

Spectral

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + + + + Regularized

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + + + + +

SKI

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + +

+ + +

Block Jacobi

-3 -2 -1 0 1 2 -2

-4 -6

+ + +

+ + +

PITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + FITC

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + +

Nystrom

-3 -2 -1 0 1 2 -2

-4 -6

+ + + + + + + + + + + +

Spectral

-3 -2 -1 0 1 2 -2

-4 -6 + + + + + Randomized SVD

-3 -2 -1 0 1 2 -2

-4 -6

Too Expensive!

Regularized

-3 -2 -1 0 1 2 -2

-4 -6

Too Expensive!

SKI

Loss

− 2

−1 0 1 2

Gain

Figure 1. Comparison of preconditioners for different settings of kernel parameters. The lengthscalel and the noise variance

λ

are shown on the x and y axes respectively. The top figure indicates the number of iterations required to solve the corresponding linear system using CG, whilst the bottom part of the figure shows the rate of improvement (negative - blue) or degradation (positive - red) achieved by using PCG to solve the same linear system. Parameters and results are reported inlog

10

. Symbols added to facilitate reading in B/W print.

tioner requires one matrix-vector product, and we add this to the overall count of such computations. For this precon- ditioner, we add a diagonal offset δ to the original matrix, equivalent to two orders of magnitude greater than the noise of the process. In general, although the complexity of PCG is indeed no different from that of CG, we emphasize that experiencing a 2-fold or 5-fold (in some cases even an order of magnitude) improvement can be very substantial when plain CG takes very long to converge or when the dataset is large.

We focus on an isotropic RBF variant of the kernel in eq.1, fixing the marginal variance σ

2

to one. We vary the length- scale parameter l and the noise variance λ in log

10

scale.

The top part of fig. 1 shows the number of iterations that the standard CG algorithm takes, where we have capped the number of iterations to 100,000.

The bottom part of the figure reports the improvement of- fered by various preconditioners measured as

log

10

! # PCG iterations

# CG iterations

"

.

It is worth noting that when both CG and PCG fail to con- verge within the upper bound, the improvement will be marked as 0, i.e. neither a gain or a loss within the given bound. The results plotted in fig. 1 indicate that the low- rank preconditioners (PITC, FITC and Nystr¨om) achieve

significant reductions in the number of iterations for each dataset, and all approaches work best when the lengthscale is longer, characterising smoother processes. In contrast, preconditioning seems to be less effective when the length- scale is shorter, corresponding to a kernel matrix that is more sparse. However, for cases yielding positive results, the improvement is often in the range of an order of mag- nitude, which can be substantial when a large number of iterations is required by the CG algorithm.

The results also confirm that, as alluded to in the previous section, Block Jacobi preconditioning is generally a poor preconditioner, particularly when the corresponding kernel matrix is dense. The only minor improvements were ob- served when CG itself converges quickly, in which case preconditioning serves very little purpose either way.

The regularization approach with flexible conjugate gradi- ent does not appear to be effective in any case either, partic- ularly due to the substantial amount of iterations required for solving an inner system at every iteration of the PCG algorithm. This implies that introducing additional small jitter to the diagonal does not necessarily make the sys- tem much easier to solve, whilst adding an overly large offset would negatively impact convergence of the outer al- gorithm. One could assume that tuning the value of this parameter could result in slightly better results; however, preliminary experiments in this regard yielded only minor

Fig. 4:

Comparison of preconditioners for different settings of kernel parameters across multiple datasets. Top: Number of iterations required to solve the corresponding linear system using CG. Bottom: Rate of improvement (blue) or degradation (red) achieved by using PCG to solve the same linear system.

Motivating Example - Gaussian Processes

! Marginal likelihood:

log[p(y|par)] = − 1

2 log |K

y

| − 1

2 y

K

y1

y + const.

! Derivatives wrt par:

∂ log[p(y | par)]

∂par

i

= − 1 2 Tr

&

K

y−1

∂K

y

∂par

i

'

+ 1

2 y

K

y−1

∂K

y

∂par

i

K

y−1

y

! Stochastic estimate of the trace - assuming E[rr

] = I, then:

Tr

&

K

y−1

∂K

y

∂par

i

'

= Tr

&

K

y−1

∂K

y

∂par

i

E[rr

] '

= E (

r

K

y−1

∂K

y

∂par

i

r )

≈ 1 N

r

Nr

*

i=1

r

(i)

K

y−1

∂K

y

∂par

i

r

(i)

! Linear systems only!

! Laplace approximation for non-Gaussian likelihoods may be formulated in a similar way!

Experimental Setup and Results

! Exact gradient-based optimization using Cholesky decomposition (CHOL);

! Stochastic gradient-based optimization using ADAGRAD - using CG and PCG;

! GP Approximations:

Variational learning of inducing variables (VAR);

Fully Independent Training Conditional (FITC);

Partially Independent Training Conditional (PITC).

Classification

Spam Dataset(n= 4061,d=57)

0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.040.080.12

log10

(

seconds

)

Error Rate

−1 0 1 2 3

152535

log10

(

seconds

)

Negative Test Log−Lik

EEG Dataset(n= 14979,d=14)

1.0 1.5 2.0 2.5 3.0 3.5

0.050.150.25

log10

(

seconds

)

Error Rate

0 1 2 3 4

2030405060

log10

(

seconds

)

Negative Test Log−Lik

Regression

Power plant Dataset(n= 9568,d=4)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

0.180.200.22

log10

(

seconds

)

RMSE

0 1 2 3

−40−200

log10

(

seconds

)

Negative Test Log−Lik

Protein Dataset(n= 45730,d=9)

1.5 2.0 2.5 3.0 3.5 4.0

0.600.640.680.72

log10

(

seconds

)

RMSE

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

200300400

log10

(

seconds

)

Negative Test Log−Lik

PCG CG CHOL FITC PITC VAR

Fig. 5:

Error and negative log likelihood on

n

held out test data over time. Curves are averaged over multiple repetitions.

Conclusions

! Our solution:

✓ Exact in the limit of iterations;

✓ Straightforward to construct and easy to tune;

✓ Scalable to large datasets - no need to store K;

✓ Competitive with exact Cholesky decomposition;

✓ Superior to approximate methods.

! Ongoing work:

Extending this work to other kernel functions and models;

Implementation on a distributed framework;

Exploiting PCG in the solution of f (K) z = v.

1

DGPs: Low-rank approximation of covariance at each layer

24

(29)

Scalable Expectation Propagation for DGPs

• Pseudo-inputs Z (i)

• Inducing variables U (i)

• VI targets q

U (i)

• Assuming q U (i)

∝ p U (i)

g U (i) N

learn g as an average data factor

• Reduces memory and allows for factorization of the objective (output of each layer made Gaussian)

F

(1)

Y

θ

(1)

X

F

(2)

θ

(2)

U

(1)

Z

(1)

U

(1)

Z

(2)

U

(1)

Z

(3)

θ

(3)

Bui et al.,ICML, 2016

25

(30)

Scalable Expectation Propagation for DGPs

• Pseudo-inputs Z (i)

• Inducing variables U (i)

• VI targets q

U (i)

• Assuming q U (i)

∝ p U (i)

g U (i) N

learn g as an average data factor

• Reduces memory and allows for factorization of the objective (output of each layer made Gaussian)

F

(1)

Y

θ

(1)

X

F

(2)

θ

(2)

U

(1)

Z

(1)

U

(1)

Z

(2)

U

(1)

Z

(3)

θ

(3)

Bui et al.,ICML, 2016

25

(31)

Inducing Points for DGPs extending Titsias, AISTATS, 2009

• Pseudo-inputs Z (i)

• Inducing variables U (i)

• VI targets q F (i) , U (i) | F (i 1) p

F (i) | U (i) , F (i 1)

q

U (i)

• Lower bound factorizes across training points. . .

• . . . and the i th marginal of the final layer depends only on the i th marginals of all layers

F

(1)

Y

θ

(1)

X

F

(2)

θ

(2)

U

(1)

Z

(1)

U

(1)

Z

(2)

U

(1)

Z

(3)

θ

(3)

Hensman and Lawrence,arXiv, 2014 – Salimbeni and Deisenroth,NIPS, 2017

26

(32)

Inducing Points for DGPs extending Titsias, AISTATS, 2009

• Pseudo-inputs Z (i)

• Inducing variables U (i)

• VI targets q F (i) , U (i) | F (i 1) p

F (i) | U (i) , F (i 1)

q

U (i)

• Lower bound factorizes across training points. . .

• . . . and the i th marginal of the final layer depends only on the i th marginals of all layers

F

(1)

Y

θ

(1)

X

F

(2)

θ

(2)

U

(1)

Z

(1)

U

(1)

Z

(2)

U

(1)

Z

(3)

θ

(3)

Hensman and Lawrence,arXiv, 2014 – Salimbeni and Deisenroth,NIPS, 2017

26

(33)

Random Feature Expansions for DGPs - Bochner’s theorem

• Continuous shift-invariant covariance function k(x i − x j | θ) = σ 2

Z

p(ω | θ) exp

ι(x i − x j ) > ω

d ω

• Monte Carlo estimate

k(x i − x j | θ) ≈ σ 2 N RF

N

RF

X

r =1

z(x i | ω ˜ r ) > z(x j | ω ˜ r )

with

˜

ω r ∼ p(ω | θ)

z(x | ω) = [cos(x > ω), sin(x > ω)] >

Rahimi and Recht,NIPS, 2008 - L´azaro-Gredilla et al.,JMLR, 2010

27

(34)

Random Feature Expansions for DGPs - Bochner’s theorem

• Continuous shift-invariant covariance function k(x i − x j | θ) = σ 2

Z

p(ω | θ) exp

ι(x i − x j ) > ω

d ω

• Monte Carlo estimate

k(x i − x j | θ) ≈ σ 2 N RF

N

RF

X

r =1

z(x i | ω ˜ r ) > z(x j | ω ˜ r )

with

˜

ω r ∼ p(ω | θ)

z(x | ω) = [cos(x > ω), sin(x > ω)] >

Rahimi and Recht,NIPS, 2008 - L´azaro-Gredilla et al.,JMLR, 2010

27

(35)

Random Feature Expansions for DGPs

• Define

Φ (l) = s σ 2

N RF (l) h

cos

F (l )(l)

, sin

F (l)(l) i

and

F (l+1) = Φ (l) W (l)

• We are stacking Bayesian linear models with p

W (l) · i

= N (0, I)

• Expansion of arc-cosine kernel yields ReLU activations!

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017

28

(36)

Random Feature Expansions for DGPs

• Define

Φ (l) = s σ 2

N RF (l) h

cos

F (l )(l)

, sin

F (l)(l) i

and

F (l+1) = Φ (l) W (l)

• We are stacking Bayesian linear models with p

W (l) · i

= N (0, I)

• Expansion of arc-cosine kernel yields ReLU activations!

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017

28

(37)

DGPs with random features become DNNs

θ

(0)

θ

(1)

Φ

(0)

X F

(1)

Φ

(1)

F

(2)

Y

(0)

W

(0)

(1)

W

(1)

We can learn the model using Stochastic Variational Inference for Bayesian DNNs!

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017

29

(38)

Results - Classification

EEG dataset (n = 14979, d = 14)

2 2.5 3 3.5

0 0.1 0.2

log

10

(sec) Error rate

2 2.5 3 3.5

0.2 0.4

log

10

(sec) MNLL

dgp-rbf dgp-arc dgp-ep dnn var-gp

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017

30

(39)

Results - Multiclass Classification

MNIST dataset (n = 60000, d = 784)

3 3.5 4 4.5

0 0.05 0.1 0.15 0.2

log

10

(sec) Error rate

3 3.5 4 4.5

0 1 2 3

log

10

(sec) MNLL

dgp-rbf dgp-arc dgp-ep dnn var-gp

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017

31

(40)

Results - MNIST-8M

• Variant of MNIST with 8.1M images

• 99+% accuracy!

• Also, check out Krauth et al., UAI 2017

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017 – Krauth, Cutajar, Bonilla, Filippone,UAI, 2017

32

(41)

Results - Model (Depth) Selection

Airline dataset (n = 5 M +, d = 8)

2 3 4 5

0.2 0.3 0.4 0.5

log10(sec) Error rate

2 3 4 5

0.45 0.5 0.55 0.6

log10(sec) MNLL

2 10 20 30 2.6

2.7

·106

Layers Neg. Lower Bound

2 layers 10 layers 20 layers 30 layers SV-DKL

Cutajar, Bonilla, Michiardi, Filippone,ICML, 2017

33

(42)

Convolutional Deep Gaussian

Processes

(43)

Convolutional Nets

• Convolutional nets are widely used. . .

• . . .but they are known to be overconfident!

Convolution + ReLU

Convolution + ReLU

Pooling Pooling Fully connected layers Output

Guo et al.,ICML, 2017

34

(44)

Calibration as a Measure of Quantification of Uncertainty

• Reliability diagrams

Predicted value

F ra ct io n po si tive s

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

35

(45)

Calibration as a Measure of Quantification of Uncertainty

• Reliability diagrams

Predicted value

F ra ct io n po si tive s

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

36

(46)

Calibration as a Measure of Quantification of Uncertainty

• Reliability diagrams - Under-confident predictions

Predicted value

F ra ct io n po si tive s

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

• We can extract the Expected Calibration Error (ece) score

• The brier score is another measure of calibration

37

(47)

Calibration as a Measure of Quantification of Uncertainty

• Reliability diagrams - Overconfident predictions

Predicted value

F ra ct io n po si tive s

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Reliability diagrams of modern Deep CNNs look like this! Post-calibration fixes it

38

(48)

Calibration as a Measure of Quantification of Uncertainty

• Reliability diagrams - Overconfident predictions

Predicted value

F ra ct io n po si tive s

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Reliability diagrams of modern Deep CNNs look like this!

Post-calibration fixes it

38

(49)

Combining Convolutional Nets with GPs

• There have been attempts to combine CNNs with GPs

• Most popular ones replace fully connected layers with GPs

Convolution + ReLU

Convolution + ReLU

Pooling Pooling Gaussian Process Output

Wilson et al.,NIPS, 2016 – Bradshaw et al.,arXiv, 2017 – Tran et al.,arXiv, 2018

39

(50)

Combining Convolutional Nets with GPs

• There have been attempts to combine CNNs with GPs

• Most popular ones replace fully connected layers with GPs

Convolution + ReLU

Convolution + ReLU

Pooling Pooling Deep Gaussian Process Output

• Better quantification of uncertainty??

NO!

Tran et al.,arXiv, 2018

40

(51)

Combining Convolutional Nets with GPs

• There have been attempts to combine CNNs with GPs

• Most popular ones replace fully connected layers with GPs

Convolution + ReLU

Convolution + ReLU

Pooling Pooling Deep Gaussian Process Output

• Better quantification of uncertainty?? NO!

Tran et al.,arXiv, 2018

40

(52)

Existing Combinations of CNNs and GPs

• Convolutional Neural Nets - CNN

• Hybrid GPs and DNNs - gpdnn

• Stochastic Variational Deep Kernel Learning - svdkl

• Convolutional GP - cgp

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

accuracy

CNN

0.0 0.5 1.0

predictive output 0.0

0.5

1.0 GPDNN

0.0 0.5 1.0

predictive output 0.0

0.5

1.0 CGP

0.0 0.5 1.0

predictive output 0.0

0.5

1.0 SVDKL

cifar10-lenet cifar10-resnet cifar100-resnet cifar10 cifar100

Bradshaw et al.,arXiv, 2017 – Wilson et al.,NIPS, 2016 – van der Wilk et al.,NIPS, 2017

41

(53)

Bayesian CNNs are calibrated

• Inferring parameters of convolutional filter recovers calibration

• Example with Monte Carlo Dropout

0.0 0.5 1.0

0.0 0.5 1.0

cifar10-lenet

ECE: 0.0254 BRIER: 0.3951

1/8

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0145

BRIER: 0.3300

1/4

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0270

BRIER: 0.2827

1/2

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0495

BRIER: 0.2404

1

0.0 0.5 1.0

0.0 0.5 1.0

cifar10-resnet

ECE: 0.0485 BRIER: 0.3518

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0140

BRIER: 0.2765

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0295

BRIER: 0.2126

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0456

BRIER: 0.1799

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

cifar100-resnet

ECE: 0.2201 BRIER: 0.8334

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.1357

BRIER: 0.6886

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.0492

BRIER: 0.5463

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.0270

BRIER: 0.4615

RELIABILITY DIAGRAM FOR MCD

Tran et al.,arXiv, 2018 42

(54)

Bayesian CNNs with DGPs with Random Features

• We extended our work on Random Feature Expansions for DGPs to replace fully connected layers

0.0 0.5 1.0

0.0 0.5 1.0

cifar10-lenet

ECE: 0.1034 BRIER: 0.5104

1/8

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0679

BRIER: 0.4117

1/4

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0410

BRIER: 0.3293

1/2

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0205

BRIER: 0.2701

1

0.0 0.5 1.0

0.0 0.5 1.0

cifar10-resnet

ECE: 0.0203 BRIER: 0.3239

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0242

BRIER: 0.2491

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0562

BRIER: 0.1997

0.0 0.5 1.0

0.0 0.5 1.0

ECE: 0.0725

BRIER: 0.1766

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

cifar100-resnet

ECE: 0.0979 BRIER: 0.7468

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.0194

BRIER: 0.6137

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.0639

BRIER: 0.5092

0.0 0.5 1.0

predictive output 0.0

0.5 1.0

ECE: 0.1156

BRIER: 0.4523

RELIABILITY DIAGRAM FOR RF

Tran et al.,arXiv, 2018 43

(55)

Comparison with competitors

SHALLOW CONVOLUTIONAL STRUCTURE

err mnll ece brier

mnist cif ar10

3.8 4.0 4.2 4.4 4.6 4.8 0.00

0.01 0.02 0.03 0.04

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.1 0.2 0.3

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.2 0.4 0.6

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.1 0.2 0.3 0.4

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.2 0.4 0.6 0.8

3.8 4.0 4.2 4.4 4.6 4.8 0

1 2 3 4

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.1 0.2 0.3 0.4

3.8 4.0 4.2 4.4 4.6 4.8 0.2

0.4 0.6 0.8 1.0

cnn+gp(rf)

cnn+gp(sorf) H cnn+mcd

N cnn+cal × gpdnn

? cgp

+ svdkl

Tran et al.,arXiv, 2018

44

(56)

Comparison with competitors

DEEP CONVOLUTIONAL STRUCTURE

err mnll ece brier

cif ar10 cif ar100

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.2 0.4 0.6 0.8

3.8 4.0 4.2 4.4 4.6 4.8 0

2 4 6 8

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.2 0.4 0.6

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.5 1.0 1.5

3.8 4.0 4.2 4.4 4.6 4.8 0.2

0.4 0.6 0.8 1.0

3.8 4.0 4.2 4.4 4.6 4.8 0

5 10 15

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.2 0.4 0.6

3.8 4.0 4.2 4.4 4.6 4.8 0.0

0.5 1.0 1.5

log

10

(train size) log

10

(train size) log

10

(train size) log

10

(train size) cnn+gp(rf)

cnn+gp(sorf) H cnn+mcd

N cnn+cal × gpdnn

? cgp

+ svdkl

Tran et al.,arXiv, 2018

45

(57)

Analysis of Depth of DGP

• Increasing depth of DGP slightly improves error rate. . .

• . . . and slightly worsen calibration

err mnll ece brier

2 4 6 8 10

0.14 0.15 0.16 0.17

2 4 6 8 10

0.44 0.46 0.48 0.50

2 4 6 8 10

0.00 0.01 0.02 0.03 0.04

2 4 6 8 10

0.21 0.22 0.23 0.24

depth depth depth depth

Tran et al.,arXiv, 2018

46

(58)

Other Interesting DGP-based models

• Autoencoders

Dai et al.ICLR, 2015 – Domingues et al.,Mach. Learn., 2018

• DGPs with constrained dynamics

Lorenzi and Filippone,ICML, 2018

47

(59)

Conclusions

(60)

Conclusions

• DGPs offer probabilistic deep learning with sensible priors

• Inference for DGPs is hard

• Model approximations

• Approximate inference

• Difficult to assess the impact of these approximations

48

(61)

Conclusions

• DGPs offer probabilistic deep learning with sensible priors

• Inference for DGPs is hard

• Model approximations

• Approximate inference

• Difficult to assess the impact of these approximations

48

(62)

Conclusions

• We are borrowing ideas from GPs and deep learning

• Stochastic-based approximate inference

• Low-rank process decompositions

• Algebraic/computational tricks

49

Références

Documents relatifs

The article reports on outcomes from a study that aims to investigate the role of affordances, level- up and feedback in the web environment Expression Machine in developing the

In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 831–839.. Association for

Littman, editors, Proceedings of the Twenty-seventh International Conference on Machine Learning (ICML-10), volume 1,

In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research,

Keywords: Central hypoventilation, Autonomic nervous system, Congenital central hypoventilation syndrome, PHOX2B gene, Hirschsprung ’ s disease, Alanine expansion, Central control

The paper is specialized to the analysis of arrays, but our lattice could be used for any kind of data structures, as shown by the Deutsch-Schorr-Waite example, provided a

occurrence of perceptual, visuo-attentional and oculomotor deficits in NF1 children and their potential to explain reading behavior and reading problems in this population.. 2018

As the bridge deck expanded or contracted, some of the forces were trans- mittted through the bearings, causing the piers and abutments to rock through a