• Aucun résultat trouvé

Méthodes pour l’apprentissage de données massives

N/A
N/A
Protected

Academic year: 2022

Partager "Méthodes pour l’apprentissage de données massives"

Copied!
133
0
0

Texte intégral

(1)

Méthodes pour l’apprentissage de données massives

Nathalie Villa-Vialaneix http://www.nathalievilla.org

en collaboration avec Fabrice Rossi

Journées d’Études en Statistique

2-7 octobre 2016 - Fréjus

Apprentissage statistique et données massives

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 1/46

(2)

Sommaire

1 Introduction

2 Subsampling

Bag of Little Bootstrap (BLB) Nyström approximation

3 Divide & Conquer

4 Online learning

Onlinek-nearest neighbors Online kernel ridge regression Online bagging

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 2/46

(3)

Sommaire

1 Introduction

2 Subsampling

Bag of Little Bootstrap (BLB) Nyström approximation

3 Divide & Conquer

4 Online learning

Onlinek-nearest neighbors Online kernel ridge regression Online bagging

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 3/46

(4)

Big Data?

Reference to the fast and recentincrease of worldwide data storage:

Image taken from Wikipedia Commons, CC BY-SA 3.0, author: Myworkforwiki. exabyte1018bits

The 3V: Volume: amount of data

Velocity: speed at which new data is generated

Variety: different types of data (text, images, videos, networks...)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46

(5)

Big Data?

Reference to the fast and recentincrease of worldwide data storage:

The 3V:

Volume: amount of data

Velocity: speed at which new data is generated

Variety: different types of data (text, images, videos, networks...)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46

(6)

Big Data?

Reference to the fast and recentincrease of worldwide data storage:

The 3V:

Volume: amount of data

Velocity: speed at which new data is generated

Variety: different types of data (text, images, videos, networks...)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46

(7)

Big Data?

Reference to the fast and recentincrease of worldwide data storage:

The 3V:

Volume: amount of data

Velocity: speed at which new data is generated

Variety: different types of data (text, images, videos, networks...)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46

(8)

Big Data?

Standard sizes(in bits):

42×106: for the complete work of Shakespeare 6.4×109: capacity of human genome (2 bits/pb)

4.5×1016: capacity of HD space in Google server farm in 2004 2×1017: storage space of Megaupload when it was shut down (2012) 2.4×1018: storage space of facebook data warehouse in 2014, with an increase of 0.6×1015/ day

1.2×1020storage space of Google data warehouse in 2013

Source:https://en.wikipedia.org/wiki/Orders_of_magnitude_(data)

The 3V: Volume: amount of data

Velocity: speed at which new data is generated

Variety: different types of data (text, images, videos, networks...)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46

(9)

Big Data?

The 3V:

Volume: amount of data

Velocity: speed at which new data is generated

Variety: different types of data (text, images, videos, networks...)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 4/46

(10)

Why are Big Data seen as an opportunity?

economic opportunity: advertisements, recommendations, . . . social opportunity: better job profiling

find new solutions to existing problems: open data websites with challenges or publication of re-usehttps://www.data.gouv.fr, https://ressources.data.sncf.comor

https://data.toulouse-metropole.fr

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 5/46

(11)

When should we consider data as “big”?

We deal with Big Data when:

data are atgoogle scale(rare)

data are big compared to ourcomputing capacities

... and depending onwhat we need to do with them

[R Core Team, 2015,Kane et al., 2013]

R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 6/46

(12)

When should we consider data as “big”?

We deal with Big Data when:

data are atgoogle scale(rare)

data are big compared to ourcomputing capacities

... and depending onwhat we need to do with them

[R Core Team, 2015,Kane et al., 2013]

R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 6/46

(13)

When should we consider data as “big”?

We deal with Big Data when:

data are atgoogle scale(rare)

data are big compared to ourcomputing capacities... and depending onwhat we need to do with them

[R Core Team, 2015,Kane et al., 2013]

R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 6/46

(14)

Big Data and Statistics

Evolution of problems posed to statistics:

“smalln, smallp” problems

large dimensional problems:p>norpn big data problems: nis very large

[Jordan, 2013]:scale statistical methodsoriginally designed to deal with smallnand take advantage ofparallel or distributed computing

environmentsto deal with large/bignwhile ensuring good properties (consistency, good approximations...).

This requires a closer cooperation between statisticians and computer scientists.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 7/46

(15)

Big Data and Statistics

Evolution of problems posed to statistics:

“smalln, smallp” problems

large dimensional problems:p>norpn big data problems: nis very large

[Jordan, 2013]:scale statistical methodsoriginally designed to deal with smallnand take advantage ofparallel or distributed computing

environmentsto deal with large/bignwhile ensuring good properties (consistency, good approximations...).

This requires a closer cooperation between statisticians and computer scientists.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 7/46

(16)

Big Data and Statistics

Evolution of problems posed to statistics:

“smalln, smallp” problems

large dimensional problems:p>norpn big data problems: nis very large

[Jordan, 2013]:scale statistical methodsoriginally designed to deal with smallnand take advantage ofparallel or distributed computing

environmentsto deal with large/bignwhile ensuring good properties (consistency, good approximations...).

This requires a closer cooperation between statisticians and computer scientists.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 7/46

(17)

Purpose of this presentation

What we will discuss

Standard approaches used toscale statistical methodswithexamplesof applications to learning methods discussed in previous presentations.

What we will not discuss

Practical implementations on various computing environments or programs in which these approaches can be used.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 8/46

(18)

Purpose of this presentation

What we will discuss

Standard approaches used toscale statistical methodswithexamplesof applications to learning methods discussed in previous presentations.

What we will not discuss

Practical implementations on various computing environments or programs in which these approaches can be used.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 8/46

(19)

Organization of the talk

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 9/46

(20)

Organization of the talk

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 9/46

(21)

Organization of the talk

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 9/46

(22)

Organization of the talk

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 9/46

(23)

Sommaire

1 Introduction

2 Subsampling

Bag of Little Bootstrap (BLB) Nyström approximation

3 Divide & Conquer

4 Online learning

Onlinek-nearest neighbors Online kernel ridge regression Online bagging

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 10/46

(24)

Overview of BLB

[Kleiner et al., 2012,Kleiner et al., 2014]

method used to scale any bootstrap estimation

consistency result demonstrated for a bootstrap estimation

Here: we describe the approach in the simplified case ofbagging (illustration for random forest)

Framework: (Xi,Yi)i=1,...,n a learning set. We want to define a predictor of Y ∈RfromX given the learning set.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 11/46

(25)

Overview of BLB

[Kleiner et al., 2012,Kleiner et al., 2014]

method used to scale any bootstrap estimation

consistency result demonstrated for a bootstrap estimation

Here: we describe the approach in the simplified case ofbagging (illustration for random forest)

Framework: (Xi,Yi)i=1,...,n a learning set. We want to define a predictor of Y ∈RfromX given the learning set.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 11/46

(26)

Overview of BLB

[Kleiner et al., 2012,Kleiner et al., 2014]

method used to scale any bootstrap estimation

consistency result demonstrated for a bootstrap estimation

Here: we describe the approach in the simplified case ofbagging (illustration for random forest)

Framework: (Xi,Yi)i=1,...,n a learning set. We want to define a predictor of Y ∈RfromX given the learning set.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 11/46

(27)

Standard bagging

(Xi,Yi)i=1,...,n

(Xi,Yi)i∈τ1 (Xi,Yi)i∈τb (Xi,Yi)i∈τB

ˆf1 ˆfb ˆfB

aggregation:ˆfbootstrap = B1 PB

b=1ˆfb subsample with replacementB times

Advantage for Big Data: Bootstrap estimators can be learned in parallel.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 12/46

(28)

Standard bagging

(Xi,Yi)i=1,...,n

(Xi,Yi)i∈τ1 (Xi,Yi)i∈τb (Xi,Yi)i∈τB

ˆf1 ˆfb ˆfB

aggregation:ˆfbootstrap = B1 PB

b=1ˆfb subsample with replacementB times

Advantage for Big Data: Bootstrap estimators can be learned in parallel.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 12/46

(29)

Problem with standard bagging

Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap: bootstrap samples have sizemwithmn

But: The quality of the estimator strongly depends onm!

Idea behind BLB

Use bootstrap samples having sizenbut witha very small number of different observations in each of them.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 13/46

(30)

Problem with standard bagging

Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap:

bootstrap samples have sizemwithmn

But: The quality of the estimator strongly depends onm!

Idea behind BLB

Use bootstrap samples having sizenbut witha very small number of different observations in each of them.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 13/46

(31)

Problem with standard bagging

Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap:

bootstrap samples have sizemwithmn

But: The quality of the estimator strongly depends onm!

Idea behind BLB

Use bootstrap samples having sizenbut witha very small number of different observations in each of them.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 13/46

(32)

Problem with standard bagging

Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap:

bootstrap samples have sizemwithmn

But: The quality of the estimator strongly depends onm!

Idea behind BLB

Use bootstrap samples having sizenbut witha very small number of different observations in each of them.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 13/46

(33)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

n1(B1,B2). . .n(Bm1,B2)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf(B1,B2) ˆf1

ˆfB1

ˆfBLB sampling, no

replacement (sizemn)

over-sampling

mean

mean

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46

(34)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

n1(B1,B2). . .n(Bm1,B2)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf(B1,B2) ˆf1

ˆfB1

ˆfBLB

sampling, no replacement (sizemn)

over-sampling

mean

mean

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46

(35)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

n1(B1,B2). . .n(Bm1,B2)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf(B1,B2) ˆf1

ˆfB1

ˆfBLB

sampling, no replacement (sizemn)

over-sampling

mean

mean

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46

(36)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

n1(B1,B2). . .n(Bm1,B2)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf(B1,B2)

ˆf1

ˆfB1

ˆfBLB

sampling, no replacement (sizemn)

over-sampling

mean

mean

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46

(37)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

n1(B1,B2). . .n(Bm1,B2)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf(B1,B2) ˆf1

ˆfB1

ˆfBLB

sampling, no replacement (sizemn)

over-sampling

mean

mean

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46

(38)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

n1(B1,B2). . .n(Bm1,B2)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf(B1,B2) ˆf1

ˆfB1

ˆfBLB sampling, no

replacement (sizemn)

over-sampling

mean

mean

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 14/46

(39)

What is over-sampling and why is it working?

BLB steps:

1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]: forn=106andγ =0.6, typicalmis about 4000, compared to 630 000 for standard bootstrap

2 for every subsampleτb, repeatB2times:

I over-sampling: affect weights(n1, . . . ,nm)simulated asM n,m11m

to observations inτb

I estimation step: train an estimator with weighted observations

3 aggregate by averaging

Remark: Final sample size (Pm

i=1ni) is equal ton(with replacement) as in standard bootstrap samples.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 15/46

(40)

What is over-sampling and why is it working?

BLB steps:

1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]

2 for every subsampleτb, repeatB2times:

I over-sampling: affect weights(n1, . . . ,nm)simulated asM

n,m11m to observations inτb

I estimation step: train an estimator with weighted observations

3 aggregate by averaging

Remark: Final sample size (Pm

i=1ni) is equal ton(with replacement) as in standard bootstrap samples.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 15/46

(41)

What is over-sampling and why is it working?

BLB steps:

1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]

2 for every subsampleτb, repeatB2times:

I over-sampling: affect weights(n1, . . . ,nm)simulated asM

n,m11m to observations inτb

I estimation step: train an estimator with weighted observations (if the learning algorithm allows a genuine processing of weights,

computational cost is low because of the small size ofm)

3 aggregate by averaging

Remark: Final sample size (Pm

i=1ni) is equal ton(with replacement) as in standard bootstrap samples.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 15/46

(42)

What is over-sampling and why is it working?

BLB steps:

1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]

2 for every subsampleτb, repeatB2times:

I over-sampling: affect weights(n1, . . . ,nm)simulated asM

n,m11m to observations inτb

I estimation step: train an estimator with weighted observations

3 aggregate by averaging

Remark: Final sample size (Pm

i=1ni) is equal ton(with replacement) as in standard bootstrap samples.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 15/46

(43)

What is over-sampling and why is it working?

BLB steps:

1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]

2 for every subsampleτb, repeatB2times:

I over-sampling: affect weights(n1, . . . ,nm)simulated asM

n,m11m to observations inτb

I estimation step: train an estimator with weighted observations

3 aggregate by averaging

Remark: Final sample size (Pm

i=1ni) is equal ton(with replacement) as in standard bootstrap samples.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 15/46

(44)

Large scale kernel methods

Given a training set(Xi,Yi)i=1,...,n (X ∈ XandY ∈Ror{0, . . . ,K −1}), kernel methods(such as SVM) are based on the definition of a kernel:

K : X × X →Rthat is symmetric (K(x,x0) =K(x0,x)) and positive (PN

i,i0=1αiαi0K(xi,xi0)≥0).

This is equivalent to embedXinto an (implicit) Hilbert space in which standard (often linear) statistical methods can be used with the kernel trick:

K(x,x0) =hφ(x), φ(x0)iH

Standard complexity (number of elementary operations) of kernel learning methods: O(n2)or evenO(n3)(ex: K-PCA)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 16/46

(45)

Large scale kernel methods

Given a training set(Xi,Yi)i=1,...,n (X ∈ XandY ∈Ror{0, . . . ,K −1}), kernel methods(such as SVM) are based on the definition of a kernel:

K : X × X →R

This is equivalent to embedXinto an (implicit) Hilbert space in which standard (often linear) statistical methods can be used with the kernel trick:

K(x,x0) =hφ(x), φ(x0)iH

Standard complexity (number of elementary operations) of kernel learning methods: O(n2)or evenO(n3)(ex: K-PCA)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 16/46

(46)

Large scale kernel methods

Given a training set(Xi,Yi)i=1,...,n (X ∈ XandY ∈Ror{0, . . . ,K −1}), kernel methods(such as SVM) are based on the definition of a kernel:

K : X × X →R

This is equivalent to embedXinto an (implicit) Hilbert space in which standard (often linear) statistical methods can be used with the kernel trick:

K(x,x0) =hφ(x), φ(x0)iH

Standard complexity (number of elementary operations) of kernel learning methods: O(n2)or evenO(n3)(ex: K-PCA)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 16/46

(47)

Low rank approximation solutions

Aim: approximateK= (K(Xi,Xi0)i,i0=1,...,nwith a low rank matrix (matrix with rankk n.

Then, use the approximation to train your predictor and correct it to

“re-scale” it ton. Typical computational cost:O(nk2).

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 17/46

(48)

Low rank approximation solutions

Aim: approximateK= (K(Xi,Xi0)i,i0=1,...,nwith a low rank matrix (matrix with rankk n.

Then, use the approximation to train your predictor and correct it to

“re-scale” it ton. Typical computational cost:O(nk2).

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 17/46

(49)

Sketch of Nyström approximation

[Williams and Seeger, 2000,Drineas and Mahoney, 2005]

Pick at randommobservations in{1, . . . , m}(without loss of generality suppose that the firstmones have been chosen).

Re-write: K=

"

K(m) K(m,n−m) K(n−m,m) K(n−m,n−m)

#

et K(n,m)=

"

K(m) K(n−m,m)

# ,

withK(n−m,m)= (K(m,n−m))>, and useK(m)instead ofK.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 18/46

(50)

Sketch of Nyström approximation

[Williams and Seeger, 2000,Drineas and Mahoney, 2005]

Pick at randommobservations in{1, . . . , m}(without loss of generality suppose that the firstmones have been chosen).

Re-write:

K=

"

K(m) K(m,n−m) K(n−m,m) K(n−m,n−m)

#

et K(n,m)=

"

K(m) K(n−m,m)

# ,

withK(n−m,m)= (K(m,n−m))>, and useK(m)instead ofK.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 18/46

(51)

Approximate spectral decomposition of K

Notations:

K eigenvectors:(vj)j=1,...,n

eigenvalues:(λj)j=1,...,n (positive, decreasing order)

K(m) eigenvectors:(vj(m))j=1,...,m

eigenvalues:(λ(m)j )j=1,...,m (positive, decreasing order)

∀j =1, . . . , m, µj' n

(m)j and vj' rm

n 1 µ(m)j K

(n,m)vj(m)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 19/46

(52)

Approximate spectral decomposition of K

Notations:

K eigenvectors:(vj)j=1,...,n

eigenvalues:(λj)j=1,...,n (positive, decreasing order)

K(m) eigenvectors:(vj(m))j=1,...,m

eigenvalues:(λ(m)j )j=1,...,m (positive, decreasing order)

∀j =1, . . . , m, µj' n

(m)j and vj' rm

n 1 µ(m)j K

(n,m)vj(m)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 19/46

(53)

Approximate spectral decomposition of K

Notations:

K eigenvectors:(vj)j=1,...,n

eigenvalues:(λj)j=1,...,n (positive, decreasing order)

K(m) eigenvectors:(vj(m))j=1,...,m

eigenvalues:(λ(m)j )j=1,...,m (positive, decreasing order)

∀j =1, . . . , m, µj' n

(m)j and vj' rm

n 1 µ(m)j K

(n,m)vj(m)

complexity of thedirect calculation: O(n3)

complexity of theapproximate solution: O(m3) +O(nm2)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 19/46

(54)

Approximate spectral decomposition of K

Notations:

K eigenvectors:(vj)j=1,...,n

eigenvalues:(λj)j=1,...,n (positive, decreasing order)

K(m) eigenvectors:(vj(m))j=1,...,m

eigenvalues:(λ(m)j )j=1,...,m (positive, decreasing order)

∀j =1, . . . , m, µj' n

(m)j and vj' rm

n 1 µ(m)j K

(n,m)vj(m)

Remark: When the rank ofKis<m, the approximation is exact.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 19/46

(55)

Approximate spectral decomposition of K

Notations:

K eigenvectors:(vj)j=1,...,n

eigenvalues:(λj)j=1,...,n (positive, decreasing order)

K(m) eigenvectors:(vj(m))j=1,...,m

eigenvalues:(λ(m)j )j=1,...,m (positive, decreasing order)

∀j =1, . . . , m, µj' n

(m)j and vj' rm

n 1 µ(m)j K

(n,m)vj(m)

Notations: µ(n,m)= mnµ(m)j andvj(n,m) = qm

n 1

µ(m)j K(n,m)vj(m)

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 19/46

(56)

Low rank approximation of K

Arankk (fork ≤m) approximation ofKis obtained with:

eK=K(n,m)

K(m)k +

K(m,n)

in whichK(m)k is the best rankk approximation ofK(m)and

K(m)k +

is its pseudo inverse.

=Pk j=1

µ(m)j −1

vj(m)

vj(m) >

.

eK=

k

X

j=1

µ(n,m)j vj(n,m)

vj(n,m) >

Quality of the approximation;[Cortes et al., 2010,Bach, 2013]

With probability at least 1−δ,

keKKk2≤ kKkKk2+ √n m max

i=1,...,nKii 2+log 1 δ

!!

whereKk is the best rankk approximation ofKandKiiare diagonal terms inK.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 20/46

(57)

Low rank approximation of K

Arankk (fork ≤m) approximation ofKis obtained with:

eK=K(n,m)

K(m)k +

K(m,n)

in whichK(m)k is the best rankk approximation ofK(m)and

K(m)k +

=Pk

j=1

µ(m)j −1

vj(m)

vj(m) >

.

eK= Xk

j=1

µ(n,m)j vj(n,m)

vj(n,m) >

Quality of the approximation;[Cortes et al., 2010,Bach, 2013]

With probability at least 1−δ,

keKKk2≤ kKkKk2+ √n m max

i=1,...,nKii 2+log 1 δ

!!

whereKk is the best rankk approximation ofKandKiiare diagonal terms inK.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 20/46

(58)

Low rank approximation of K

Arankk (fork ≤m) approximation ofKis obtained with:

eK= Xk

j=1

µ(n,m)j vj(n,m)

vj(n,m) >

Quality of the approximation;[Cortes et al., 2010,Bach, 2013]

With probability at least 1−δ,

keKKk2≤ kKkKk2+ √n m max

i=1,...,nKii 2+log 1 δ

!!

whereKk is the best rankk approximation ofKandKiiare diagonal terms inK.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 20/46

(59)

Low rank approximation of K

Arankk (fork ≤m) approximation ofKis obtained with:

eK= Xk

j=1

µ(n,m)j vj(n,m)

vj(n,m) >

Quality of the approximation;[Cortes et al., 2010,Bach, 2013]

With probability at least 1−δ,

keKKk2≤ kKkKk2+ √n m max

i=1,...,nKii 2+log 1 δ

!!

whereKk is the best rankk approximation ofKandKiiare diagonal terms inK.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 20/46

(60)

Application to kernel methods

Example of the kernel ridge regression [Saunders et al., 1998]

KRRis a kernel regression method with simply performs a linear regression in the feature space, regularized byL2norm:

arg min

w∈H n

X

i=1

(Yi− hw, φ(xi)iH)2+λ 2kwk2H

Solution:

ˆf : x∈ X → hw, φ(xi)iH = Xn

i=1

αiK(Xi,x)











 α1

... αn











= (K+λIn)1











 Y1

... Yn











The inverse of(K+λIn)is computationallyvery costly!

Nyström approximation: replaceKby its Nyström rankk approximation... αhas the explicit form:

α˜= 1 λ

YVk

λIn+ ΛkV>kVk1

ΛkV>kY ,

Quality of the predictionis also controlled in this approximation (see

[Cortes et al., 2010,Bach, 2013]).

Similar methodsexist to use Nyström approximation in SVM.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 21/46

(61)

Application to kernel methods

Example of the kernel ridge regression [Saunders et al., 1998]

KRRis a kernel regression method with simply performs a linear regression in the feature space, regularized byL2norm:

arg min

w∈H n

X

i=1

(Yi− hw, φ(xi)iH)2+λ 2kwk2H

Solution:

ˆf : x ∈ X → hw, φ(xi)iH = Xn

i=1

αiK(Xi,x)











 α1

...

αn











= (K+λIn)−1











 Y1

...

Yn











The inverse of(K+λIn)is computationallyvery costly! Nyström approximation: replaceKby its Nyström rankk approximation... αhas the explicit form:

α˜= 1 λ

YVk

λIn+ ΛkV>kVk1

ΛkV>kY ,

Quality of the predictionis also controlled in this approximation (see

[Cortes et al., 2010,Bach, 2013]).

Similar methodsexist to use Nyström approximation in SVM.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 21/46

(62)

Application to kernel methods

Example of the kernel ridge regression [Saunders et al., 1998]

KRRis a kernel regression method with simply performs a linear regression in the feature space, regularized byL2norm:

arg min

w∈H n

X

i=1

(Yi− hw, φ(xi)iH)2+λ 2kwk2H

Solution:

ˆf : x ∈ X → hw, φ(xi)iH = Xn

i=1

αiK(Xi,x)











 α1

...

αn











= (K+λIn)−1











 Y1

...

Yn











The inverse of(K+λIn)is computationallyvery costly!

Nyström approximation: replaceKby its Nyström rankk approximation... αhas the explicit form:

α˜= 1 λ

YVk

λIn+ ΛkV>kVk1

ΛkV>kY ,

Quality of the predictionis also controlled in this approximation (see

[Cortes et al., 2010,Bach, 2013]).

Similar methodsexist to use Nyström approximation in SVM.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 21/46

(63)

Application to kernel methods

Example of the kernel ridge regression [Saunders et al., 1998]

Nyström approximation: replaceKby its Nyström rankk approximation...

αhas the explicit form:

α˜= 1 λ

YVk

λIn+ ΛkV>kVk−1

ΛkV>kY ,

Quality of the predictionis also controlled in this approximation (see

[Cortes et al., 2010,Bach, 2013]).

Similar methodsexist to use Nyström approximation in SVM.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 21/46

(64)

Application to kernel methods

Example of the kernel ridge regression [Saunders et al., 1998]

Nyström approximation: replaceKby its Nyström rankk approximation...

αhas the explicit form:

α˜= 1 λ

YVk

λIn+ ΛkV>kVk−1

ΛkV>kY ,

Quality of the predictionis also controlled in this approximation (see

[Cortes et al., 2010,Bach, 2013]).

Similar methodsexist to use Nyström approximation in SVM.

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 21/46

(65)

Sommaire

1 Introduction

2 Subsampling

Bag of Little Bootstrap (BLB) Nyström approximation

3 Divide & Conquer

4 Online learning

Onlinek-nearest neighbors Online kernel ridge regression Online bagging

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 22/46

Références

Documents relatifs

Nous avons établi dans un premier temps une méthodologie originale permettant l’ap- plication de méthodes de prévision sur des données de géolocalisation. Cette méthodologie

Ces cat´egories peuvent repr´esenter soit les diff´erentes valeurs d’une variable qualitative issue des donn´ees initiales, soit le croisement de cat´egories (valeurs de

Allant, encore plus loin, l’article 41 de la loi relative à l’organisation et à la transformation du système de santé de 2019 étend le périmètre du SNDS notamment à

Moreover, while there exists only one electromagnetic radiation pressure, there are at least two kinds of acoustic radiation pressure, according as the fluid in which the acoustic

Évolution du temps d’exécution de l’apprentissage avec Spark MLlib et du taux d’erreur sur l’échantillon test en fonction de la taille de l’échantillon et du nombre n_hash

Risque dans le sens où ça peut être mal exploité, mais c’est toujours plus d’informations […] donc ça peut être une opportunité pour mieux comprendre et déceler de

As already mentioned, the present work is concerned with the modelling of large- scale data within a data streaming framework, using Machine Learning methods, specifically the

Dans l'article présenté ici, nous avons fait un travail de biocuration pour annoter les gènes codant des fimbriae dans les génomes de Porphyromonas. Nous avons