• Aucun résultat trouvé

slides

N/A
N/A
Protected

Academic year: 2022

Partager "slides"

Copied!
54
0
0

Texte intégral

(1)

Less is More:

Nystr¨ om Computational Regularization

Alessandro Rudi, Raffaello Camoriano, Lorenzo Rosasco University of Genova - Istituto Italiano di Tecnologia

Massachusetts Institute of Technology ale [email protected]

Dec 10th NIPS 2015

(2)

A Starting Point

Classically:

Statistics and optimizationdistinct steps in algorithm design

Large Scale:

Considerinterplaybetween statistics and optimization!

(Bottou, Bousquet ’08)

(3)

A Starting Point

Classically:

Statistics and optimizationdistinct steps in algorithm design

Large Scale:

Considerinterplaybetween statistics and optimization!

(Bottou, Bousquet ’08)

(4)

Supervised Learning Problem:

Estimatef

f

The Setting

yi=f(xi) +εi i∈ {1, . . . , n}

I εi ∈R, xi∈Rd random (with unknown distribution)

I f unknown

(5)

Supervised Learning

Problem:

Estimatef given Sn={(x1, y1), . . . ,(xn, yn)}

f

(x2, y2) (x3, y3)

(x4, y4)

(x5, y5) (x1, y1)

The Setting

yi=f(xi) +εi i∈ {1, . . . , n}

I εi ∈R, xi∈Rd random (with unknown distribution)

I f unknown

(6)

Supervised Learning

Problem:

Estimatef given Sn={(x1, y1), . . . ,(xn, yn)}

f

(x2, y2) (x3, y3)

(x4, y4)

(x5, y5) (x1, y1)

The Setting

yi=f(xi) +εi i∈ {1, . . . , n}

I εi ∈R, xi∈Rd random (with unknown distribution)

I f unknown

(7)

Outline

Learning with kernels

Data Dependent Subsampling

(8)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I q non linear function

I wi∈Rd centers

I ci∈Rcoefficients

I M =Mn could/shouldgrowwithn

Question: How to choose wi,ci andM givenSn ?

(9)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I q non linear function

I wi∈Rd centers

I ci∈Rcoefficients

I M =Mn could/shouldgrowwithn

Question: How to choose wi,ci andM givenSn ?

(10)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I q non linear function

I wi∈Rd centers

I ci∈Rcoefficients

I M =Mn could/shouldgrowwithn

Question: How to choose wi,ci andM givenSn ?

(11)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I q non linear function

I wi∈Rd centers

I ci∈Rcoefficients

I M =Mn could/shouldgrowwithn

Question: How to choose wi,ci andM givenSn ?

(12)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I q non linear function

I wi∈Rd centers

I ci∈Rcoefficients

I M =Mn could/shouldgrowwithn

Question: How to choose wi,ci andM givenSn ?

(13)

Non-linear/non-parametric learning

fb(x) = XM

i=1

ciq(x, wi)

I q non linear function

I wi∈Rd centers

I ci∈Rcoefficients

I M =Mn could/shouldgrowwithn

Question: How to choose wi,ci andM givenSn ?

(14)

Learning with Positive Definite Kernels

There is anelegantanswer if:

I q issymmetric

I all the matricesQbij =q(xi, xj)arepositive semi-definite1

Representer Theorem(Kimeldorf, Wahba ’70; Sch¨olkopf et al. ’01) I M =n,

I wi=xi,

I ci byconvex optimization!

1They have non-negative eigenvalues

(15)

Learning with Positive Definite Kernels

There is anelegantanswer if:

I q issymmetric

I all the matricesQbij =q(xi, xj)arepositive semi-definite1

Representer Theorem(Kimeldorf, Wahba ’70; Sch¨olkopf et al. ’01) I M =n,

I wi=xi,

I ci byconvexoptimization!

1They have non-negative eigenvalues

(16)

Kernel Ridge Regression (KRR)

a.k.a. Penalized Least Squares

fbλ= argmin

f∈H

1 n

Xn

i=1

(yi−f(xi))2+λkfk2

where

H={f | f(x) = XM

i=1

ciq(x, wi), ci∈R, wi∈Rd

| {z }

any center!

, M| {z }∈N

any length!

}

Solution

fbλ= Xn

i=1

ciq(x,xi) with c= (Qb+λnI)1yb

(17)

Kernel Ridge Regression (KRR)

a.k.a. Penalized Least Squares

fbλ= argmin

f∈H

1 n

Xn

i=1

(yi−f(xi))2+λkfk2 where

H={f | f(x) = XM

i=1

ciq(x, wi), ci∈R, w| {z }i ∈Rd

any center!

, M| {z }∈N

any length!

}

Solution

fbλ= Xn

i=1

ciq(x,xi) with c= (Qb+λnI)1yb

(18)

Kernel Ridge Regression (KRR)

a.k.a. Penalized Least Squares

fbλ= argmin

f∈H

1 n

Xn

i=1

(yi−f(xi))2+λkfk2 where

H={f | f(x) = XM

i=1

ciq(x, wi), ci∈R, w| {z }i ∈Rd

any center!

, M| {z }∈N

any length!

}

Solution

fbλ= Xn

i=1

ciq(x,xi) with c= (Qb+λnI)1yb

(19)

KRR: Statistics

Well understoodstatistical properties:

Classical Theorem

Iff∈ H, then λ= 1

√n E(fbλ(x)−f(x))2 . 1

√n

Remarks

1. Optimal nonparametric bound

2. Results forgeneralkernels (e.g. splines/Sobolev etc.)

λ=n2s+11 , E(fbλ(x)−f(x))2 . n2s+12s

3. Adaptive tuning via cross validation

(20)

KRR: Statistics

Well understoodstatistical properties:

Classical Theorem

Iff∈ H, then λ= 1

√n E(fbλ(x)−f(x))2 . 1

√n

Remarks

1. Optimal nonparametric bound

2. Results forgeneralkernels (e.g. splines/Sobolev etc.)

λ=n2s+11 , E(fbλ(x)−f(x))2 . n2s+12s

3. Adaptive tuning via cross validation

(21)

KRR: Statistics

Well understoodstatistical properties:

Classical Theorem

Iff∈ H, then λ= 1

√n E(fbλ(x)−f(x))2 . 1

√n

Remarks

1. Optimal nonparametric bound

2. Results forgeneralkernels (e.g. splines/Sobolev etc.)

λ=n2s+11 , E(fbλ(x)−f(x))2 . n2s+12s

3. Adaptive tuning via cross validation

(22)

KRR: Statistics

Well understoodstatistical properties:

Classical Theorem

Iff∈ H, then λ= 1

√n E(fbλ(x)−f(x))2 . 1

√n

Remarks

1. Optimal nonparametric bound

2. Results forgeneralkernels (e.g. splines/Sobolev etc.)

λ=n2s+11 , E(fbλ(x)−f(x))2 . n2s+12s

3. Adaptive tuning via cross validation

(23)

KRR: Statistics

Well understoodstatistical properties:

Classical Theorem

Iff∈ H, then λ= 1

√n E(fbλ(x)−f(x))2 . 1

√n

Remarks

1. Optimal nonparametric bound

2. Results forgeneralkernels (e.g. splines/Sobolev etc.)

λ=n2s+11 , E(fbλ(x)−f(x))2 . n2s+12s

3. Adaptive tuning via cross validation

(24)

KRR: Statistics

Well understoodstatistical properties:

Classical Theorem

Iff∈ H, then λ= 1

√n E(fbλ(x)−f(x))2 . 1

√n

Remarks

1. Optimal nonparametric bound

2. Results forgeneralkernels (e.g. splines/Sobolev etc.)

λ=n2s+11 , E(fbλ(x)−f(x))2 . n2s+12s

3. Adaptive tuning via cross validation

(25)

KRR: Optimization

fbλ= Xn

i=1

ciq(x, xi) with c= (Qb+λnI)−1yb Linear System

Qb

c

= by

Complexity

I SpaceO(n2)

I TimeO(n3)

BIG DATA?

Running out of space before running out of time... Can this be fixed?

(26)

KRR: Optimization

fbλ= Xn

i=1

ciq(x, xi) with c= (Qb+λnI)−1yb Linear System

Qb

c

= by

Complexity

I SpaceO(n2)

I TimeO(n3)

BIG DATA?

Running out of space before running out of time...

Can this be fixed?

(27)

Outline

Learning with kernels

Data Dependent Subsampling

(28)

Subsampling

1. pickwi at random...

from training set

(Smola, Scholk¨opf ’00)

˜

w1, . . . ,w˜M ⊂x1, . . . xn M n

2. perform KRR on

HM ={f | f(x) = XM

i=1

ciq(x,w˜i), ci∈R,wi∈Rd,M ∈N}.

Linear System

b y c b =

QM

Complexity

I Space

O(n2)→O(nM)

I Time

O(n3)→O(nM2)

What aboutstatistics? What’s theprice for efficient computations?

(29)

Subsampling

1. pickwi at random... from training set

(Smola, Scholk¨opf ’00)

˜

w1, . . . ,w˜M ⊂x1, . . . xn M n

2. perform KRR on

HM ={f | f(x) = XM

i=1

ciq(x,w˜i), ci∈R,wi∈Rd,M ∈N}.

Linear System

b y c b =

QM

Complexity

I Space

O(n2)→O(nM)

I Time

O(n3)→O(nM2)

What aboutstatistics? What’s theprice for efficient computations?

(30)

Subsampling

1. pickwi at random... from training set

(Smola, Scholk¨opf ’00)

˜

w1, . . . ,w˜M ⊂x1, . . . xn M n

2. perform KRR on

HM ={f | f(x) = XM

i=1

ciq(x,w˜i), ci∈R,wi∈Rd,M ∈N}.

Linear System

b y c b =

QM

Complexity

I Space

O(n2)→O(nM)

I Time

O(n3)→O(nM2)

What aboutstatistics? What’s theprice for efficient computations?

(31)

Subsampling

1. pickwi at random... from training set

(Smola, Scholk¨opf ’00)

˜

w1, . . . ,w˜M ⊂x1, . . . xn M n

2. perform KRR on

HM ={f | f(x) = XM

i=1

ciq(x,w˜i), ci∈R,wi∈Rd,M ∈N}.

Linear System

b y c b =

QM

Complexity

I Space

O(n2)→O(nM)

I Time

O(n3)→O(nM2)

What aboutstatistics? What’s theprice for efficient computations?

(32)

Subsampling

1. pickwi at random... from training set

(Smola, Scholk¨opf ’00)

˜

w1, . . . ,w˜M ⊂x1, . . . xn M n

2. perform KRR on

HM ={f | f(x) = XM

i=1

ciq(x,w˜i), ci∈R,wi∈Rd,M ∈N}.

Linear System

b y c b =

QM

Complexity

I Space

O(n2)→O(nM)

I Time

O(n3)→O(nM2)

What aboutstatistics? What’s the price for efficient computations?

(33)

Putting our Result in Context

I *Many*different subsampling schemes

(Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

I Theoretical guaranteesmainly onmatrix approximation

(Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+)

kQb−QbMk. 1

√M

I Few prediction guarantees either suboptimalor inrestricted setting (Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)

(34)

Putting our Result in Context

I *Many*different subsampling schemes

(Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

I Theoretical guaranteesmainly onmatrix approximation

(Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+)

kQb−QbMk. 1

√M

I Few prediction guarantees either suboptimalor inrestricted setting (Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)

(35)

Putting our Result in Context

I *Many*different subsampling schemes

(Smola, Scholkopf ’00; Williams, Seeger ’01; . . . 20+)

I Theoretical guaranteesmainly onmatrix approximation

(Mahoney and Drineas ’09; Cortes et al ’10, Kumar et al.’12 . . . 10+)

kQb−QbMk. 1

√M

I Few prediction guarantees either suboptimalor in restricted setting (Cortes et al. ’10; Jin et al. ’11, Bach ’13, Alaoui, Mahoney ’14)

(36)

Main Result

Theorem

Iff∈ H, then λ= 1

√n ,M= 1

λ, E(fbλ,M(x)−f(x))2 . 1

√n

Remarks

1. Subsampling achivesoptimalbound. . . 2. . . . withM∼√

n !! 3. More generally,

λ=n2s+11 , M= 1

λ, Ex(fbλ,M(x)−f(x))2 . n2s+12s Note: An interesting insight is obtained rewriting the result. . .

(37)

Main Result

Theorem

Iff∈ H, then λ= 1

√n ,M= 1

λ, E(fbλ,M(x)−f(x))2 . 1

√n

Remarks

1. Subsampling achivesoptimalbound. . . 2. . . . withM∼√

n !! 3. More generally,

λ=n2s+11 , M= 1

λ, Ex(fbλ,M(x)−f(x))2 . n2s+12s Note: An interesting insight is obtained rewriting the result. . .

(38)

Main Result

Theorem

Iff∈ H, then λ= 1

√n ,M= 1

λ, E(fbλ,M(x)−f(x))2 . 1

√n

Remarks

1. Subsampling achivesoptimalbound. . .

2. . . . withM∼√ n !! 3. More generally,

λ=n2s+11 , M= 1

λ, Ex(fbλ,M(x)−f(x))2 . n2s+12s Note: An interesting insight is obtained rewriting the result. . .

(39)

Main Result

Theorem

Iff∈ H, then λ= 1

√n ,M= 1

λ, E(fbλ,M(x)−f(x))2 . 1

√n

Remarks

1. Subsampling achivesoptimalbound. . . 2. . . . withM∼√

n !!

3. More generally,

λ=n2s+11 , M= 1

λ, Ex(fbλ,M(x)−f(x))2 . n2s+12s Note: An interesting insight is obtained rewriting the result. . .

(40)

Main Result

Theorem

Iff∈ H, then λ= 1

√n ,M= 1

λ, E(fbλ,M(x)−f(x))2 . 1

√n

Remarks

1. Subsampling achivesoptimalbound. . . 2. . . . withM∼√

n !!

3. More generally,

λ=n2s+11 , M= 1

λ, Ex(fbλ,M(x)−f(x))2 . n2s+12s

Note: An interesting insight is obtained rewriting the result. . .

(41)

Main Result

Theorem

Iff∈ H, then λ= 1

√n ,M= 1

λ, E(fbλ,M(x)−f(x))2 . 1

√n

Remarks

1. Subsampling achivesoptimalbound. . . 2. . . . withM∼√

n !!

3. More generally,

λ=n2s+11 , M= 1

λ, Ex(fbλ,M(x)−f(x))2 . n2s+12s Note: An interesting insight is obtained rewriting the result. . .

(42)

Computational Regularization (CoRe)

A simple idea: “swap” the role ofλandM. . .

Theorem

Iff∈ H, then

M=n2s+11 , λ= 1

M, Ex(fbλ,M(x)−f(x))2 . n2s+12s

I λandM play the same role. . .

. . . new interpretation: subsampling regularizes!

I New natural incrementalalgorithm...

Algorithm

1. Pick a center +compute solution 2. Pick another center +rank one update 3. Pick another center . . .

(43)

Computational Regularization (CoRe)

A simple idea: “swap” the role ofλandM. . .

Theorem

Iff∈ H, then

M=n2s+11 , λ= 1

M, Ex(fbλ,M(x)−f(x))2 . n2s+12s

I λandM play the same role. . .

. . . new interpretation: subsampling regularizes!

I New natural incrementalalgorithm...

Algorithm

1. Pick a center +compute solution 2. Pick another center +rank one update 3. Pick another center . . .

(44)

Computational Regularization (CoRe)

A simple idea: “swap” the role ofλandM. . .

Theorem

Iff∈ H, then

M=n2s+11 , λ= 1

M, Ex(fbλ,M(x)−f(x))2 . n2s+12s

I λandM play the same role. . .

. . . new interpretation: subsampling regularizes!

I New natural incrementalalgorithm...

Algorithm

1. Pick a center +compute solution 2. Pick another center +rank one update 3. Pick another center . . .

(45)

Computational Regularization (CoRe)

A simple idea: “swap” the role ofλandM. . .

Theorem

Iff∈ H, then

M=n2s+11 , λ= 1

M, Ex(fbλ,M(x)−f(x))2 . n2s+12s

I λandM play the same role. . .

. . . new interpretation: subsampling regularizes!

I New natural incrementalalgorithm...

Algorithm

1. Pick a center +compute solution 2. Pick another center +rank one update 3. Pick another center . . .

(46)

Computational Regularization (CoRe)

A simple idea: “swap” the role ofλandM. . .

Theorem

Iff∈ H, then

M=n2s+11 , λ= 1

M, Ex(fbλ,M(x)−f(x))2 . n2s+12s

I λandM play the same role. . .

. . . new interpretation: subsampling regularizes!

I New natural incrementalalgorithm...

Algorithm

1. Pick a center +compute solution

2. Pick another center +rank one update 3. Pick another center . . .

(47)

Computational Regularization (CoRe)

A simple idea: “swap” the role ofλandM. . .

Theorem

Iff∈ H, then

M=n2s+11 , λ= 1

M, Ex(fbλ,M(x)−f(x))2 . n2s+12s

I λandM play the same role. . .

. . . new interpretation: subsampling regularizes!

I New natural incrementalalgorithm...

Algorithm

1. Pick a center +compute solution 2. Pick another center +rank one update

3. Pick another center . . .

(48)

Computational Regularization (CoRe)

A simple idea: “swap” the role ofλandM. . .

Theorem

Iff∈ H, then

M=n2s+11 , λ= 1

M, Ex(fbλ,M(x)−f(x))2 . n2s+12s

I λandM play the same role. . .

. . . new interpretation: subsampling regularizes!

I New natural incrementalalgorithm...

Algorithm

1. Pick a center +compute solution 2. Pick another center +rank one update 3. Pick another center . . .

(49)

CoRe Illustrated

n, λare fixed

0 50 100 150 200 250 300

Validation Error

0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11

Computation controls stability!

Time/space requirement tailored to generalization

(50)

Experiments

comparable/better w.r.t. the state of the art

Dataset ntr d Incremental Standard Standard Random Fastfood

CoRe KRLS Nystr¨om Features RF

Ins. Co. 5822 85 0.23180±4×10−5 0.231 0.232 0.266 0.264 CPU 6554 21 2.8466±0.0497 7.271 6.758 7.103 7.366 CT slices 42800 384 7.1106±0.0772 NA 60.683 49.491 43.858 Year Pred. 463715 90 0.10470±5×10−5 NA 0.113 0.123 0.115

Forest 522910 54 0.9638±0.0186 NA 0.837 0.840 0.840

I Random Features (Rahimi, Recht ’07) I Fastfood (Le et al. ’13)

(51)

Contributions

I Optimallearning with data dependent subsampling

I Beyond uniform sampling – come to the poster!

Some questions:

I Beyond ridge regression– SGD and early stopping

I Data independent sampling– random features

I Beyond randomization–non convex optimization?

Some perspectives:

I Computational regularization: subsampling regularizes!

I Algorithm design: Control statistics with computations

(52)

Contributions

I Optimallearning with data dependent subsampling

I Beyond uniform sampling – come to the poster!

Some questions:

I Beyond ridge regression– SGD and early stopping

I Data independent sampling– random features

I Beyond randomization–non convex optimization?

Some perspectives:

I Computational regularization: subsampling regularizes!

I Algorithm design: Control statistics with computations

(53)

Contributions

I Optimallearning with data dependent subsampling

I Beyond uniform sampling – come to the poster!

Some questions:

I Beyond ridge regression– SGD and early stopping

I Data independent sampling– random features

I Beyond randomization–non convex optimization?

Some perspectives:

I Computational regularization: subsampling regularizes!

I Algorithm design: Control statistics with computations

(54)

Thank you!

Come to poster N.63 for the details!!

CODE: lcsl.github.io/NystromCoRe

Alessandro Rudi [email protected]

Laboratory for Computational and Statistical Learning - lcsl.mit.edu

Références

Documents relatifs

On a obtenu face : quelle est la probabilité de chaque type

Combien de mots sont expliqués dans cette page.. Cite deux verbes qui commencent

2s) Ecris dans la case V pour Vrai et F pour Faux. ●Le mammouth veux écraser Cromignon. ●Le mammouth mange un radis. ●Il fait nuit quand Cromignon sort de sa cachette. ●La maman

Scapin, même jeu : N’y a-t-il personne qui puisse me dire où est le seigneur Géronte?. Géronte : Qu’y

Scène d’exposition 1 L’idée générale de la pièce formulée en bref.. Aparté 1 Personnages

[r]

2- Le récit du narrateur a mis en valeur la personne de la femme : la grand-mère , la mère , la sœur , la tante ; l’épouse, la fille et la petite amie cousine.. Quelle

Des phrases ou expressions écrites, hors dialogue, donnant des indications sur le décor, sur le comportement, ou sur la manière de parler des personnages : 1.5p9. 