slides

(1)

Less is More:

Nystr¨ om Computational Regularization

Alessandro Rudi, Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia

Massachusetts Institute of Technology ale [email protected]

Dec 10th NIPS 2015

(2)

A Starting Point

Classically:

Statistics and optimizationdistinct steps in algorithm design

Large Scale:

Considerinterplaybetween statistics and optimization!

(Bottou, Bousquet ’08)

(3)

A Starting Point

Classically:

Statistics and optimizationdistinct steps in algorithm design

Large Scale:

Considerinterplaybetween statistics and optimization!

(Bottou, Bousquet ’08)

(4)

Supervised Learning Problem:

Estimatef^∗

f_⇤

The Setting

yi=f^∗(xi) +εi i∈ {1, . . . , n}

I εi ∈R, xi∈R^d random (with unknown distribution)

I f^∗ unknown

(5)

Supervised Learning

Problem:

Estimatef^∗ given Sn={(x1, y1), . . . ,(xn, yn)}

f_⇤

(x2, y2) (x₃, y₃)

(x4, y4)

(x5, y5) (x1, y1)

The Setting

yi=f^∗(xi) +εi i∈ {1, . . . , n}

I f^∗ unknown

(6)

Supervised Learning

Problem:

Estimatef^∗ given Sn={(x1, y1), . . . ,(xn, yn)}

f_⇤

(x2, y2) (x₃, y₃)

(x4, y4)

(x5, y5) (x1, y1)

The Setting

yi=f^∗(xi) +εi i∈ {1, . . . , n}

I f^∗ unknown

(7)

Outline

Learning with kernels

Data Dependent Subsampling

(8)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I q non linear function

I wi∈R^d centers

I ci∈Rcoefficients

I M =Mn could/shouldgrowwithn

Question: How to choose wi,ci andM givenSn ?

(9)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I wi∈R^d centers

(10)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I wi∈R^d centers

(11)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I wi∈R^d centers

(12)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I wi∈R^d centers

(13)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I wi∈R^d centers

(14)

Learning with Positive Definite Kernels

There is anelegantanswer if:

I q issymmetric

I all the matricesQbij =q(xi, xj)arepositive semi-definite¹

Representer Theorem(Kimeldorf, Wahba ’70; Sch¨olkopf et al. ’01) I M =n,

I wi=xi,

I ci byconvex optimization!

1They have non-negative eigenvalues

(15)

Learning with Positive Definite Kernels

There is anelegantanswer if:

I q issymmetric

I all the matricesQbij =q(xi, xj)arepositive semi-definite¹

Representer Theorem(Kimeldorf, Wahba ’70; Sch¨olkopf et al. ’01) I M =n,

I wi=xi,

I ci byconvexoptimization!

1They have non-negative eigenvalues

(16)

Kernel Ridge Regression (KRR)

a.k.a. Penalized Least Squares

fbλ= argmin

f∈H

1 n

Xn

i=1

(yi−f(xi))²+λkfk²

where

H={f | f(x) = XM

i=1

ciq(x, wi), ci∈R, wi∈R^d

| {z }

any center!

, M| {z }∈N

any length!

}

Solution

fbλ= Xn

i=1

ciq(x,xi) with c= (Qb+λnI)⁻¹yb

(17)

Kernel Ridge Regression (KRR)

a.k.a. Penalized Least Squares

fbλ= argmin

f∈H

1 n

Xn

i=1

(yi−f(xi))²+λkfk² where

H={f | f(x) = XM

i=1

ciq(x, wi), ci∈R, w| {z }i ∈R^d

any center!

, M| {z }∈N

any length!

}

Solution

fbλ= Xn

i=1

(18)

Kernel Ridge Regression (KRR)

a.k.a. Penalized Least Squares

fbλ= argmin

f∈H

1 n

Xn

i=1

(yi−f(xi))²+λkfk² where

H={f | f(x) = XM

i=1

ciq(x, wi), ci∈R, w| {z }i ∈R^d

any center!

, M| {z }∈N

any length!

}

Solution

fbλ= Xn

i=1

(19)

KRR: Statistics

Well understoodstatistical properties:

Classical Theorem

Iff^∗∈ H, then λ_∗= 1

√n E(fbλ∗(x)−f^∗(x))² . 1

√n

Remarks

1. Optimal nonparametric bound

2. Results forgeneralkernels (e.g. splines/Sobolev etc.)

λ∗=n⁻^2s+1¹ , E(fbλ∗(x)−f^∗(x))² . n⁻^2s+1^2s

3. Adaptive tuning via cross validation

(20)

KRR: Statistics

Classical Theorem

√n E(fbλ∗(x)−f^∗(x))² . 1

√n

Remarks

λ∗=n⁻^2s+1¹ , E(fbλ∗(x)−f^∗(x))² . n⁻^2s+1^2s

(21)

KRR: Statistics

Classical Theorem

√n E(fbλ∗(x)−f^∗(x))² . 1

√n

Remarks

λ∗=n⁻^2s+1¹ , E(fbλ∗(x)−f^∗(x))² . n⁻^2s+1^2s

(22)

KRR: Statistics

Classical Theorem

√n E(fbλ∗(x)−f^∗(x))² . 1

√n

Remarks

λ∗=n⁻^2s+1¹ , E(fbλ∗(x)−f^∗(x))² . n⁻^2s+1^2s

(23)

KRR: Statistics

Classical Theorem

√n E(fbλ∗(x)−f^∗(x))² . 1

√n

Remarks

λ∗=n⁻^2s+1¹ , E(fbλ_∗(x)−f^∗(x))² . n⁻^2s+1^2s

(24)

KRR: Statistics

Classical Theorem

√n E(fbλ∗(x)−f^∗(x))² . 1

√n

Remarks

λ∗=n⁻^2s+1¹ , E(fbλ_∗(x)−f^∗(x))² . n⁻^2s+1^2s

(25)

KRR: Optimization

fbλ= Xn

i=1

ciq(x, xi) with c= (Qb+λnI)⁻¹yb Linear System

Qb

c

⁼ by

Complexity

I SpaceO(n²)

I TimeO(n³)

BIG DATA?

Running out of space before running out of time... Can this be fixed?

(26)

KRR: Optimization

fbλ= Xn

i=1

ciq(x, xi) with c= (Qb+λnI)⁻¹yb Linear System

Qb

c

⁼ by

Complexity

I SpaceO(n²)

I TimeO(n³)

BIG DATA?

Running out of space before running out of time...

Can this be fixed?

(27)

Outline

Learning with kernels

Data Dependent Subsampling

(28)

Subsampling

1. pickwi at random...

from training set

(Smola, Scholk¨opf ’00)

˜

w1, . . . ,w˜M ⊂x1, . . . xn M n

2. perform KRR on

HM ={f | f(x) = XM

i=1

ciq(x,w˜i), ci∈R,wi∈R^d,M ∈N}.

Linear System

b y c b =

Q_M

Complexity

I Space

O(n²)→O(nM)

I Time

O(n³)→O(nM²)

What aboutstatistics? What’s theprice for efficient computations?

(29)

Subsampling

1. pickwi at random... from training set

˜

w1, . . . ,w˜M ⊂x1, . . . xn M n

2. perform KRR on

HM ={f | f(x) = XM

i=1

Linear System

b y c b =

Q_M

Complexity

I Space

O(n²)→O(nM)

I Time

O(n³)→O(nM²)

(30)

Subsampling

˜

w1, . . . ,w˜M ⊂x1, . . . xn M n

2. perform KRR on

HM ={f | f(x) = XM

i=1

Linear System

b y c b =

Q_M

Complexity

I Space

O(n²)→O(nM)

I Time

O(n³)→O(nM²)

(31)

Subsampling

˜

w1, . . . ,w˜M ⊂x1, . . . xn M n

2. perform KRR on

HM ={f | f(x) = XM

i=1

Linear System

b y c b =

Q_M

Complexity

I Space

O(n²)→O(nM)

I Time

O(n³)→O(nM²)

(32)

Subsampling

˜

w1, . . . ,w˜M ⊂x1, . . . xn M n

2. perform KRR on

HM ={f | f(x) = XM

i=1

Linear System

b y c b =

Q_M

Complexity

I Space

O(n²)→O(nM)

I Time

O(n³)→O(nM²)

What aboutstatistics? What’s the price for efficient computations?

(33)

Putting our Result in Context

I *Many*different subsampling schemes

(Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

I Theoretical guaranteesmainly onmatrix approximation

(Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+)

kQb−QbMk. 1

√M

I Few prediction guarantees either suboptimalor inrestricted setting (Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)

(34)

Putting our Result in Context

kQb−QbMk. 1

√M

I Few prediction guarantees either suboptimalor inrestricted setting (Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)

(35)

Putting our Result in Context

kQb−QbMk. 1

√M

I Few prediction guarantees either suboptimalor in restricted setting (Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)

(36)

Main Result

Theorem

√n ,M_∗= 1

λ_∗, E(fbλ∗,M∗(x)−f^∗(x))² . 1

√n

Remarks

1. Subsampling achivesoptimalbound. . . 2. . . . withM_∗∼√

n !! 3. More generally,

λ_∗=n⁻^2s+1¹ , M_∗= 1

λ_∗, Ex(fbλ∗,M∗(x)−f^∗(x))² . n⁻^2s+1^2s Note: An interesting insight is obtained rewriting the result. . .

(37)

Main Result

Theorem

√n ,M_∗= 1

λ_∗, E(fbλ∗,M∗(x)−f^∗(x))² . 1

√n

Remarks

n !! 3. More generally,

λ_∗=n⁻^2s+1¹ , M_∗= 1

(38)

Main Result

Theorem

√n ,M_∗= 1

λ_∗, E(fbλ∗,M∗(x)−f^∗(x))² . 1

√n

Remarks

1. Subsampling achivesoptimalbound. . .

2. . . . withM_∗∼√ n !! 3. More generally,

λ_∗=n⁻^2s+1¹ , M_∗= 1

(39)

Main Result

Theorem

√n ,M_∗= 1

λ_∗, E(fbλ∗,M∗(x)−f^∗(x))² . 1

√n

Remarks

n !!

3. More generally,

λ_∗=n⁻^2s+1¹ , M_∗= 1

(40)

Main Result

Theorem

√n ,M_∗= 1

λ_∗, E(fbλ∗,M∗(x)−f^∗(x))² . 1

√n

Remarks

n !!

3. More generally,

λ_∗=n⁻^2s+1¹ , M_∗= 1

λ_∗, Ex(fbλ∗,M∗(x)−f^∗(x))² . n⁻^2s+1^2s

Note: An interesting insight is obtained rewriting the result. . .

(41)

Main Result

Theorem

√n ,M_∗= 1

λ_∗, E(fbλ∗,M∗(x)−f^∗(x))² . 1

√n

Remarks

n !!

3. More generally,

λ_∗=n⁻^2s+1¹ , M_∗= 1

(42)

Computational Regularization (CoRe)

A simple idea: “swap” the role ofλandM. . .

Theorem

Iff^∗∈ H, then

M_∗=n^2s+1¹ , λ_∗= 1

M_∗, Ex(fbλ∗,M∗(x)−f^∗(x))² . n⁻^2s+1^2s

I λandM play the same role. . .

. . . new interpretation: subsampling regularizes!

I New natural incrementalalgorithm...

Algorithm

1. Pick a center +compute solution 2. Pick another center +rank one update 3. Pick another center . . .

(43)

Computational Regularization (CoRe)

Theorem

Iff^∗∈ H, then

M_∗=n^2s+1¹ , λ_∗= 1

M_∗, Ex(fbλ∗,M∗(x)−f^∗(x))² . n⁻^2s+1^2s

Algorithm

(44)

Computational Regularization (CoRe)

Theorem

Iff^∗∈ H, then

M_∗=n^2s+1¹ , λ_∗= 1

M_∗, Ex(fbλ∗,M∗(x)−f^∗(x))² . n⁻^2s+1^2s

Algorithm

(45)

Computational Regularization (CoRe)

Theorem

Iff^∗∈ H, then

M_∗=n^2s+1¹ , λ_∗= 1

M_∗, Ex(fbλ∗,M∗(x)−f^∗(x))² . n⁻^2s+1^2s

Algorithm

(46)

Computational Regularization (CoRe)

Theorem

Iff^∗∈ H, then

M_∗=n^2s+1¹ , λ_∗= 1

M_∗, Ex(fbλ∗,M∗(x)−f^∗(x))² . n⁻^2s+1^2s

Algorithm

1. Pick a center +compute solution

2. Pick another center +rank one update 3. Pick another center . . .

(47)

Computational Regularization (CoRe)

Theorem

Iff^∗∈ H, then

M_∗=n^2s+1¹ , λ_∗= 1

M_∗, Ex(fbλ∗,M∗(x)−f^∗(x))² . n⁻^2s+1^2s

Algorithm

1. Pick a center +compute solution 2. Pick another center +rank one update

3. Pick another center . . .

(48)

Computational Regularization (CoRe)

Theorem

Iff^∗∈ H, then

M_∗=n^2s+1¹ , λ_∗= 1

M_∗, Ex(fbλ∗,M∗(x)−f^∗(x))² . n⁻^2s+1^2s

Algorithm

(49)

CoRe Illustrated

n, λare fixed

0 50 100 150 200 250 300

Validation Error

0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11

Computation controls stability!

Time/space requirement tailored to generalization

(50)

Experiments

comparable/better w.r.t. the state of the art

Dataset ntr d Incremental Standard Standard Random Fastfood

CoRe KRLS Nystr¨om Features RF

Ins. Co. 5822 85 0.23180±4×10⁻⁵ 0.231 0.232 0.266 0.264 CPU 6554 21 2.8466±0.0497 7.271 6.758 7.103 7.366 CT slices 42800 384 7.1106±0.0772 NA 60.683 49.491 43.858 Year Pred. 463715 90 0.10470±5×10⁻⁵ NA 0.113 0.123 0.115

Forest 522910 54 0.9638±0.0186 NA 0.837 0.840 0.840

I Random Features (Rahimi, Recht ’07) I Fastfood (Le et al. ’13)

(51)

Contributions

I Optimallearning with data dependent subsampling

I Beyond uniform sampling – come to the poster!

Some questions:

I Beyond ridge regression– SGD and early stopping

I Data independent sampling– random features

I Beyond randomization–non convex optimization?

Some perspectives:

I Computational regularization: subsampling regularizes!

I Algorithm design: Control statistics with computations

(52)

Contributions

Some questions:

Some perspectives:

(53)

Contributions

Some questions:

Some perspectives:

(54)

Thank you!

Come to poster N.63 for the details!!

CODE: lcsl.github.io/NystromCoRe

Alessandro Rudi [email protected]

Laboratory for Computational and Statistical Learning - lcsl.mit.edu