Méthodes pour l’apprentissage de données massives

(1)

Méthodes pour l’apprentissage de données massives

Nathalie Villa-Vialaneix http://www.nathalievilla.org

en collaboration avec Fabrice Rossi

Journées d’Études en Statistique

2-7 octobre 2016 - Fréjus

Apprentissage statistique et données massives

Nathalie Villa-Vialaneix |Méthodes pour l’apprentissage de données massives 1/46

(2)

Sommaire

1 Introduction

2 Subsampling

Bag of Little Bootstrap (BLB) Nyström approximation

3 Divide & Conquer

4 Online learning

Onlinek-nearest neighbors Online kernel ridge regression Online bagging

(3)

Sommaire

1 Introduction

2 Subsampling

3 Divide & Conquer

4 Online learning

(4)

Big Data?

Reference to the fast and recentincrease of worldwide data storage:

Image taken from Wikipedia Commons, CC BY-SA 3.0, author: Myworkforwiki. exabyte∼10¹⁸bits

The 3V: Volume: amount of data

Velocity: speed at which new data is generated

Variety: different types of data (text, images, videos, networks...)

(5)

Big Data?

The 3V:

Volume: amount of data

(6)

Big Data?

The 3V:

(7)

Big Data?

The 3V:

(8)

Big Data?

Standard sizes(in bits):

42×10⁶: for the complete work of Shakespeare 6.4×10⁹: capacity of human genome (2 bits/pb)

4.5×10¹⁶: capacity of HD space in Google server farm in 2004 2×10¹⁷: storage space of Megaupload when it was shut down (2012) 2.4×10¹⁸: storage space of facebook data warehouse in 2014, with an increase of 0.6×10¹⁵/ day

1.2×10²⁰storage space of Google data warehouse in 2013

Source:https://en.wikipedia.org/wiki/Orders_of_magnitude_(data)

The 3V: Volume: amount of data

(9)

Big Data?

The 3V:

(10)

Why are Big Data seen as an opportunity?

economic opportunity: advertisements, recommendations, . . . social opportunity: better job profiling

find new solutions to existing problems: open data websites with challenges or publication of re-usehttps://www.data.gouv.fr, https://ressources.data.sncf.comor

https://data.toulouse-metropole.fr

(11)

When should we consider data as “big”?

We deal with Big Data when:

data are atgoogle scale(rare)

data are big compared to ourcomputing capacities

... and depending onwhat we need to do with them

[R Core Team, 2015,Kane et al., 2013]

R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.

(12)

When should we consider data as “big”?

data are big compared to ourcomputing capacities

... and depending onwhat we need to do with them

(13)

When should we consider data as “big”?

data are big compared to ourcomputing capacities... and depending onwhat we need to do with them

(14)

Big Data and Statistics

Evolution of problems posed to statistics:

“smalln, smallp” problems

large dimensional problems:p>norpn big data problems: nis very large

[Jordan, 2013]:scale statistical methodsoriginally designed to deal with smallnand take advantage ofparallel or distributed computing

environmentsto deal with large/bignwhile ensuring good properties (consistency, good approximations...).

This requires a closer cooperation between statisticians and computer scientists.

(15)

Big Data and Statistics

(16)

Big Data and Statistics

(17)

Purpose of this presentation

What we will discuss

Standard approaches used toscale statistical methodswithexamplesof applications to learning methods discussed in previous presentations.

What we will not discuss

Practical implementations on various computing environments or programs in which these approaches can be used.

(18)

Purpose of this presentation

What we will discuss

Standard approaches used toscale statistical methodswithexamplesof applications to learning methods discussed in previous presentations.

What we will not discuss

Practical implementations on various computing environments or programs in which these approaches can be used.

(19)

Organization of the talk

(20)

Organization of the talk

(21)

Organization of the talk

(22)

Organization of the talk

(23)

Sommaire

1 Introduction

2 Subsampling

3 Divide & Conquer

4 Online learning

(24)

Overview of BLB

[Kleiner et al., 2012,Kleiner et al., 2014]

method used to scale any bootstrap estimation

consistency result demonstrated for a bootstrap estimation

Here: we describe the approach in the simplified case ofbagging (illustration for random forest)

Framework: (X_i,Y_i)i=1,...,n a learning set. We want to define a predictor of Y ∈R^from^X given the learning set.

(25)

Overview of BLB

(26)

Overview of BLB

(27)

Standard bagging

(X_i,Y_i)i=1,...,n

(X_i,Y_i)_i∈τ1 (X_i,Y_i)_i∈τb (X_i,Y_i)_i∈τB

ˆ_f¹ ˆ_f^b ˆ_f^B

aggregation:ˆ_f^bootstrap = _B¹ P_B

b=1ˆ_f^b subsample with replacementB times

Advantage for Big Data: Bootstrap estimators can be learned in parallel.

(28)

Standard bagging

(X_i,Y_i)i=1,...,n

(X_i,Y_i)_i∈τ1 (X_i,Y_i)_i∈τb (X_i,Y_i)_i∈τB

ˆ_f¹ ˆ_f^b ˆ_f^B

aggregation:ˆ_f^bootstrap = _B¹ P_B

b=1ˆ_f^b subsample with replacementB times

Advantage for Big Data: Bootstrap estimators can be learned in parallel.

(29)

Problem with standard bagging

Whennis big, the number of different observations inτ_b is∼0.63n⇒still BIG!

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap: bootstrap samples have sizemwithmn

But: The quality of the estimator strongly depends onm!

Idea behind BLB

Use bootstrap samples having sizenbut witha very small number of different observations in each of them.

(30)

Problem with standard bagging

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap:

bootstrap samples have sizemwithmn

Idea behind BLB

(31)

Problem with standard bagging

Idea behind BLB

(32)

Problem with standard bagging

Idea behind BLB

(33)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

(X₁^(B¹⁾,Y^(B₁¹⁾). . .(X^(B_m¹⁾,Y_m^(B¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

n₁^(B¹^,B²⁾. . .n^(B_m¹^,B²⁾

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf⁽^B¹^,^B²⁾ ˆf¹

ˆf^B¹

ˆf^BLB sampling, no

replacement (sizem_n)

over-sampling

mean

(34)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

(X₁^(B¹⁾,Y^(B₁¹⁾). . .(X^(B_m¹⁾,Y_m^(B¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

n₁^(B¹^,B²⁾. . .n^(B_m¹^,B²⁾

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf⁽^B¹^,^B²⁾ ˆf¹

ˆf^B¹

ˆf^BLB

sampling, no replacement (sizem_n)

over-sampling

mean

(35)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

(X₁^(B¹⁾,Y^(B₁¹⁾). . .(X^(B_m¹⁾,Y_m^(B¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

n₁^(B¹^,B²⁾. . .n^(B_m¹^,B²⁾

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf⁽^B¹^,^B²⁾ ˆf¹

ˆf^B¹

ˆf^BLB

over-sampling

mean

(36)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

(X₁^(B¹⁾,Y^(B₁¹⁾). . .(X^(B_m¹⁾,Y_m^(B¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

n₁^(B¹^,B²⁾. . .n^(B_m¹^,B²⁾

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf⁽^B¹^,^B²⁾

ˆf¹

ˆf^B¹

ˆf^BLB

over-sampling

mean

(37)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

(X₁^(B¹⁾,Y^(B₁¹⁾). . .(X^(B_m¹⁾,Y_m^(B¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

n₁^(B¹^,B²⁾. . .n^(B_m¹^,B²⁾

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf⁽^B¹^,^B²⁾ ˆf¹

ˆf^B¹

ˆf^BLB

over-sampling

mean

(38)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

(X₁^(B¹⁾,Y^(B₁¹⁾). . .(X^(B_m¹⁾,Y_m^(B¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

n₁^(B¹^,B²⁾. . .n^(B_m¹^,B²⁾

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf⁽^B¹^,^B²⁾ ˆf¹

ˆf^B¹

ˆf^BLB sampling, no

replacement (sizem_n)

over-sampling

mean

(39)

What is over-sampling and why is it working?

BLB steps:

1 createB₁samples (without replacement)of sizem∼n^γ (with γ∈[0.5,1]: forn=10⁶andγ =0.6, typicalmis about 4000, compared to 630 000 for standard bootstrap

2 for every subsampleτ_b, repeatB₂times:

I over-sampling: affect weights(n₁, . . . ,n_m)simulated asM n,_m¹1m

to observations inτ_b

I estimation step: train an estimator with weighted observations

3 aggregate by averaging

Remark: Final sample size (P_m

i=1n_i) is equal ton(with replacement) as in standard bootstrap samples.

(40)

What is over-sampling and why is it working?

BLB steps:

1 createB₁samples (without replacement)of sizem∼n^γ (with γ∈[0.5,1]

I over-sampling: affect weights(n1, . . . ,nm)simulated asM

n,_m¹1^m to observations inτb

(41)

What is over-sampling and why is it working?

BLB steps:

I estimation step: train an estimator with weighted observations (if the learning algorithm allows a genuine processing of weights,

computational cost is low because of the small size ofm)

Remark: Final sample size (Pm

(42)

What is over-sampling and why is it working?

BLB steps:

(43)

What is over-sampling and why is it working?

BLB steps:

(44)

Large scale kernel methods

Given a training set(X_i,Y_i)_i=1,...,n (X ∈ XandY ∈R^or{0, . . . ,K −1}), kernel methods(such as SVM) are based on the definition of a kernel:

K : X × X →Rthat is symmetric (K(x,x⁰) =K(x⁰,x)) and positive (PN

i,i⁰=1α_iα_i⁰K(x_i,x_i⁰)≥0).

This is equivalent to embedXinto an (implicit) Hilbert space in which standard (often linear) statistical methods can be used with the kernel trick:

K(x,x⁰) =hφ(x), φ(x⁰)iH

Standard complexity (number of elementary operations) of kernel learning methods: O(n²)or evenO(n³)(ex: K-PCA)

(45)

Large scale kernel methods

Given a training set(X_i,Y_i)i=1,...,n (X ∈ XandY ∈R^or{0, . . . ,K −1}), kernel methods(such as SVM) are based on the definition of a kernel:

K : X × X →R

(46)

Large scale kernel methods

Given a training set(X_i,Y_i)i=1,...,n (X ∈ XandY ∈R^or{0, . . . ,K −1}), kernel methods(such as SVM) are based on the definition of a kernel:

K : X × X →R

(47)

Low rank approximation solutions

Aim: approximateK= (K(X_i,X_i⁰)_i,i⁰_=1,...,nwith a low rank matrix (matrix with rankk n.

Then, use the approximation to train your predictor and correct it to

“re-scale” it ton. Typical computational cost:O(nk²).

(48)

Low rank approximation solutions

Aim: approximateK= (K(X_i,X_i⁰)_i,i⁰_=1,...,nwith a low rank matrix (matrix with rankk n.

Then, use the approximation to train your predictor and correct it to

“re-scale” it ton. Typical computational cost:O(nk²).

(49)

Sketch of Nyström approximation

[Williams and Seeger, 2000,Drineas and Mahoney, 2005]

Pick at randommobservations in{1, . . . , m}(without loss of generality suppose that the firstmones have been chosen).

Re-write: K=

"

K^(m) K^(m,n−m) K^(n−m,m) K^{(n−m,n−m)}

#

et K^(n,m)=

"

K^(m) K^(n−m,m)

# ,

withK^(n−m,m)= (K^(m,n−m))^>, and useK^(m)instead ofK.

(50)

Sketch of Nyström approximation

[Williams and Seeger, 2000,Drineas and Mahoney, 2005]

Pick at randommobservations in{1, . . . , m}(without loss of generality suppose that the firstmones have been chosen).

Re-write:

K=

"

K^(m) K^(m,n−m) K^(n−m,m) K^{(n−m,n−m)}

#

et K^(n,m)=

"

K^(m) K^(n−m,m)

# ,

withK^(n−m,m)= (K^(m,n−m))^>, and useK^(m)instead ofK.

(51)

Approximate spectral decomposition of K

Notations:

K eigenvectors:(v_j)_j=1,...,n

eigenvalues:(λ_j)j=1,...,n (positive, decreasing order)

K^(m) eigenvectors:(v_j^(m))j=1,...,m

eigenvalues:(λ^(m)_j )_j=1,...,m (positive, decreasing order)

∀j =1, . . . , m, µ_j' ⁿ

mµ^(m)_j and v_j' rm

n 1 µ^(m)_j ^K

(n,m)v_j^(m)

(52)

Approximate spectral decomposition of K

Notations:

K eigenvectors:(v_j)_j=1,...,n

eigenvalues:(λ^(m)_j )_j=1,...,m (positive, decreasing order)

∀j =1, . . . , m, µ_j' ⁿ

n 1 µ^(m)_j ^K

(n,m)v_j^(m)

(53)

Approximate spectral decomposition of K

Notations:

K eigenvectors:(v_j)j=1,...,n

eigenvalues:(λ^(m)_j )j=1,...,m (positive, decreasing order)

∀j =1, . . . , m, µ_j' ⁿ

n 1 µ^(m)_j ^K

(n,m)v_j^(m)

complexity of thedirect calculation: O(n³)

complexity of theapproximate solution: O(m³) +O(nm²)

(54)

Approximate spectral decomposition of K

Notations:

eigenvalues:(λ_j)_j=1,...,n (positive, decreasing order)

K^(m) eigenvectors:(v_j^(m))_j=1,...,m

∀j =1, . . . , m, µ_j' ⁿ

n 1 µ^(m)_j ^K

(n,m)v_j^(m)

Remark: When the rank ofKis<m, the approximation is exact.

(55)

Approximate spectral decomposition of K

Notations:

∀j =1, . . . , m, µ_j' ⁿ

n 1 µ^(m)_j ^K

(n,m)v_j^(m)

Notations: µ^(n,m)= _mⁿµ^(m)_j andv_j^(n,m) = qm

n 1

µ^(m)_j K^(n,m)v_j^(m)

(56)

Low rank approximation of K

Arankk (fork ≤m) approximation ofKis obtained with:

eK=K^(n,m)

K^(m)_k +

K^(m,n)

in whichK^(m)_k is the best rankk approximation ofK^(m)and

K^(m)_k +

is its pseudo inverse.

=Pk j=1

µ^(m)_j −1

v_j^(m)

v_j^(m) >

.

eK=

k

X

j=1

µ^(n,m)_j v_j^(n,m)

v_j^(n,m) >

Quality of the approximation;[Cortes et al., 2010,Bach, 2013]

With probability at least 1−δ,

keK−Kk₂≤ kK_k −Kk₂+ √ⁿ m max

i=1,...,nK_ii 2+log 1 δ

!!

whereK_k is the best rankk approximation ofKandK_iiare diagonal terms inK.

(57)

Low rank approximation of K

eK=K^(n,m)

K^(m)_k +

K^(m,n)

in whichK^(m)_k is the best rankk approximation ofK^(m)and

K^(m)_k +

=P_k

j=1

µ^(m)_j −1

v_j^(m)

v_j^(m) >

.

eK= Xk

j=1

v_j^(n,m) >

i=1,...,nK_ii 2+log 1 δ

!!

(58)

Low rank approximation of K

eK= Xk

j=1

v_j^(n,m) >

i=1,...,nK_ii 2+log 1 δ

!!

(59)

Low rank approximation of K

eK= Xk

j=1

v_j^(n,m) >

i=1,...,nK_ii 2+log 1 δ

!!

(60)

Application to kernel methods

Example of the kernel ridge regression [Saunders et al., 1998]

KRRis a kernel regression method with simply performs a linear regression in the feature space, regularized byL₂norm:

arg min

w∈H n

X

i=1

(Y_i− hw, φ(x_i)iH)²+λ 2kwk²_H

Solution:

ˆ_f : x∈ X → hw, φ(x_i)iH = Xn

i=1

αiK(X_i,x)





 α₁

... α_n







= (K+λIn)⁻¹





 Y₁

... Y_n







The inverse of(K+λIn)is computationallyvery costly!

Nyström approximation: replaceKby its Nyström rankk approximation... αhas the explicit form:

α˜= ¹ λ

Y−V_k

λIn+ Λ_kV^>_kV_k−1

Λ_kV^>_kY ,

Quality of the predictionis also controlled in this approximation (see

[Cortes et al., 2010,Bach, 2013]).

Similar methodsexist to use Nyström approximation in SVM.

(61)

Application to kernel methods

arg min

w∈H n

X

i=1

Solution:

ˆ_f : x ∈ X → hw, φ(x_i)iH = Xn

i=1

αiK(X_i,x)





 α₁

...

α_n







= (K+λIn)⁻¹





 Y₁

...

Y_n







The inverse of(K+λIn)is computationallyvery costly! Nyström approximation: replaceKby its Nyström rankk approximation... αhas the explicit form:

α˜= ¹ λ

Y−V_k

Λ_kV^>_kY ,

[Cortes et al., 2010,Bach, 2013]).

(62)

Application to kernel methods

arg min

w∈H n

X

i=1

Solution:

ˆ_f : x ∈ X → hw, φ(x_i)iH = Xn

i=1

αiK(X_i,x)





 α₁

...

α_n







= (K+λIn)⁻¹





 Y₁

...

Y_n







The inverse of(K+λIn)is computationallyvery costly!

Nyström approximation: replaceKby its Nyström rankk approximation... αhas the explicit form:

α˜= ¹ λ

Y−V_k

Λ_kV^>_kY ,

[Cortes et al., 2010,Bach, 2013]).

(63)

Application to kernel methods

Nyström approximation: replaceKby its Nyström rankk approximation...

αhas the explicit form:

α˜= ¹ λ

Y−V_k

Λ_kV^>_kY ,

[Cortes et al., 2010,Bach, 2013]).

(64)

Application to kernel methods

Nyström approximation: replaceKby its Nyström rankk approximation...

αhas the explicit form:

α˜= ¹ λ

Y−V_k

Λ_kV^>_kY ,

[Cortes et al., 2010,Bach, 2013]).

(65)

Sommaire

1 Introduction

2 Subsampling

3 Divide & Conquer

4 Online learning