• Aucun résultat trouvé

Random Forest for Big Data

N/A
N/A
Protected

Academic year: 2022

Partager "Random Forest for Big Data"

Copied!
86
0
0

Texte intégral

(1)

Random Forest for Big Data

Nathalie Villa-Vialaneix http://www.nathalievilla.org

& Robin Genuer, Jean-Michel Poggi & Christine Tuleau-Malot

13ème Journées de Statistique de Rennes Rennes, 21 octobre 2016

Nathalie Villa-Vialaneix |RF for Big Data 1/39

(2)

Sommaire

1 Random Forest

2 Strategies to use random forest with big data Bag of Little Bootstrap (BLB)

Map Reduce Online learning

3 Application

(3)

Sommaire

1 Random Forest

2 Strategies to use random forest with big data Bag of Little Bootstrap (BLB)

Map Reduce Online learning

3 Application

Nathalie Villa-Vialaneix |RF for Big Data 3/39

(4)

A short introduction to random forest

Introduced by[Breiman, 2001], they areensemble methods

[Dietterich, 2000], similarly asBagging, Boosting, Randomizing

Outputs, Random Subspace

Statistical learning algorithm that can be used forclassificationand regression. It has been used in many situations involving real data with success:

I microarray[Díaz-Uriarte and Alvarez de Andres, 2006]

I ecology[Prasad et al., 2006]

I pollution forecasting[Ghattas, 1999]

I la génomique[Goldstein et al., 2010,Boulesteix et al., 2012]

I for more references,[Verikas et al., 2011]

(5)

Description of RF

Ln={(X1,Y1), . . . ,(Xn,Yn)}i.i.d. observations of a random pair of variables(X,Y)st

X ∈Rp (explanatory variables)

Y ∈ Y(target variable). Ycan beR(regression) or{1, . . . ,M} (classification).

Purpose: define a predictorˆf : Rp → YfromLn. Random forest from[Breiman, 2001]

f(.,Θb), 1≤b≤Bo

is a set of regression or classification trees. (Θb)1≤b≤Bare i.i.d. random variables, independent ofLn.

The random forest is obtained by aggregation of the set.

Aggregation:

regression:ˆf(x) = B1 PB

b=1ˆf(x,Θb) classification:ˆf(x) =arg max1≤c≤MPB

b=11{ˆf(x,Θ

b)=c}

Nathalie Villa-Vialaneix |RF for Big Data 5/39

(6)

Description of RF

Ln={(X1,Y1), . . . ,(Xn,Yn)}i.i.d. observations of a random pair of variables(X,Y)st

X ∈Rp (explanatory variables)

Y ∈ Y(target variable). Ycan beR(regression) or{1, . . . ,M} (classification).

Purpose: define a predictorˆf : Rp → YfromLn. Random forest from[Breiman, 2001]

f(.,Θb), 1≤b ≤Bo

is a set of regression or classification trees.

b)1≤b≤B are i.i.d. random variables, independent ofLn.

The random forest is obtained by aggregation of the set. Aggregation:

regression:ˆf(x) = B1 PB

b=1ˆf(x,Θb) classification:ˆf(x) =arg max1≤c≤MPB

b=11{ˆf(x,Θ

b)=c}

(7)

Description of RF

Ln={(X1,Y1), . . . ,(Xn,Yn)}i.i.d. observations of a random pair of variables(X,Y)st

X ∈Rp (explanatory variables)

Y ∈ Y(target variable). Ycan beR(regression) or{1, . . . ,M} (classification).

Purpose: define a predictorˆf : Rp → YfromLn. Random forest from[Breiman, 2001]

f(.,Θb), 1≤b ≤Bo

is a set of regression or classification trees.

b)1≤b≤B are i.i.d. random variables, independent ofLn. The random forest is obtained by aggregation of the set.

Aggregation:

regression:ˆf(x) = B1 PB

b=1ˆf(x,Θb) classification:ˆf(x) =arg max1≤c≤MPB

b=11{ˆf(x,Θ

b)=c}

Nathalie Villa-Vialaneix |RF for Big Data 5/39

(8)

CART

Tree: piecewise constant

predictor obtained with recursive binary partitioning ofRp

Constrains: splits are parallel to the axes

At every step of the binary partitioning, data in the current node are split “at best” (i.e., to have thegreatest decrease in heterogeneityin the two child nodes

(9)

CART partitioning

Nathalie Villa-Vialaneix |RF for Big Data 7/39

(10)

Classification and regression frameworks

(11)

Random forest with Bagging

[Breiman, 1996]

(Xi,Yi)i=1,...,n

(Xi,Yi)i∈τ1 (Xi,Yi)i∈τb (Xi,Yi)i∈τB

ˆf1 ˆfb ˆfB

aggregation:ˆfbag= B1 PB b=1ˆfb CART

subsample with replacementB times

Nathalie Villa-Vialaneix |RF for Big Data 9/39

(12)

Trees in random forests

Variant of CART,

[Breiman et al., 1984]: piecewise constant predictor, obtained by a recursive partitioning ofRp with splits parallel to axes

But: At each step of the partitioning, we seek the “best”

split of data amongmtry randomly picked directions (variables).

No pruning (fully developed trees)

(13)

OOB error and estimation of the prediction error

OOB=OutOfBag OOB error

For predictingYi, only predictorsˆfb such thatib are used⇒bYi OOB error = 1nPn

i=1

Yi−bYi2

(regression) OOB error = 1nPn

i=11{Y

i,bYi} (classification)

estimation similar to standard cross validation estimation

... without splitting the training dataset because it isincluded in the bootstrap sample generation

Warning: a different forest is used for the prediction of eachYi!

Nathalie Villa-Vialaneix |RF for Big Data 11/39

(14)

Variable importance

Definition

Forj ∈ {1, . . . ,p}, for every bootstrap sampleb,permute values of thej variablein the bootstrap sample.

Predict observationi(OOB prediction) for treeb:

in a standard wayˆfb(Xi) after permutationˆfb(Xi(j))

Importance of variablejfor treeˆfb is the average increase in accuracy after permutation of variablej. Regression case:

I(Xj) = 1 B

XB

b=1

1 n

Xn

i=1

(ˆfb(Xi(j))−Yi)2−(ˆfb(Xi)−Yi)2

The greater the increase,

(15)

Why RF and Big Data?

on one hand,baggingis appealing because easily computed in parallel.

on thedark side, each bootstrap sample has the same size than the original dataset (i.e.,n, which is supposed to be LARGE) and contains approximately0.63ndifferent observations(which is also LARGe)!

Nathalie Villa-Vialaneix |RF for Big Data 13/39

(16)

Why RF and Big Data?

on one hand,baggingis appealing because easily computed in parallel.

on thedark side, each bootstrap sample has the same size than the original dataset (i.e.,n, which is supposed to be LARGE) and contains approximately0.63ndifferent observations(which is also LARGe)!

(17)

Sommaire

1 Random Forest

2 Strategies to use random forest with big data Bag of Little Bootstrap (BLB)

Map Reduce Online learning

3 Application

Nathalie Villa-Vialaneix |RF for Big Data 14/39

(18)

Types of strategy to handle big data problems

(19)

Types of strategy to handle big data problems

Here:Bag of Little Bootstrap

Nathalie Villa-Vialaneix |RF for Big Data 15/39

(20)

Types of strategy to handle big data problems

(21)

Types of strategy to handle big data problems

Nathalie Villa-Vialaneix |RF for Big Data 15/39

(22)

Overview of BLB

[Kleiner et al., 2012,Kleiner et al., 2014]

method used to scale any bootstrap estimation

consistency result demonstrated for a bootstrap estimation

Here: we describe the approach in the simplified case ofbagging(as for random forest)

Framework: (Xi,Yi)i=1,...,n a learning set. We want to define a predictor of Y ∈RfromX given the learning set.

(23)

Overview of BLB

[Kleiner et al., 2012,Kleiner et al., 2014]

method used to scale any bootstrap estimation

consistency result demonstrated for a bootstrap estimation

Here: we describe the approach in the simplified case ofbagging(as for random forest)

Framework: (Xi,Yi)i=1,...,n a learning set. We want to define a predictor of Y ∈RfromX given the learning set.

Nathalie Villa-Vialaneix |RF for Big Data 16/39

(24)

Overview of BLB

[Kleiner et al., 2012,Kleiner et al., 2014]

method used to scale any bootstrap estimation

consistency result demonstrated for a bootstrap estimation

Here: we describe the approach in the simplified case ofbagging(as for random forest)

Framework: (Xi,Yi)i=1,...,n a learning set. We want to define a predictor of Y ∈RfromX given the learning set.

(25)

Problem with standard bagging

Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap: bootstrap samples have sizemwithmn

But: The quality of the estimator strongly depends onm!

Idea behind BLB

Use bootstrap samples having sizenbut witha very small number of different observations in each of them.

Nathalie Villa-Vialaneix |RF for Big Data 17/39

(26)

Problem with standard bagging

Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap:

bootstrap samples have sizemwithmn

But: The quality of the estimator strongly depends onm!

Idea behind BLB

Use bootstrap samples having sizenbut witha very small number of different observations in each of them.

(27)

Problem with standard bagging

Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap:

bootstrap samples have sizemwithmn

But: The quality of the estimator strongly depends onm!

Idea behind BLB

Use bootstrap samples having sizenbut witha very small number of different observations in each of them.

Nathalie Villa-Vialaneix |RF for Big Data 17/39

(28)

Problem with standard bagging

Whennis big, the number of different observations inτb is∼0.63n⇒still BIG!

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap:

bootstrap samples have sizemwithmn

But: The quality of the estimator strongly depends onm!

Idea behind BLB

Use bootstrap samples having sizenbut witha very small number of different observations in each of them.

(29)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

n1(B1,B2). . .n(Bm1,B2)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf(B1,B2) ˆf1

ˆfB1

ˆfBLB sampling, no

replacement (sizemn)

over-sampling

mean

mean

Nathalie Villa-Vialaneix |RF for Big Data 18/39

(30)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

n1(B1,B2). . .n(Bm1,B2)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf(B1,B2) ˆf1

ˆfB1

ˆfBLB

sampling, no replacement (sizemn)

over-sampling

mean

mean

(31)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

n1(B1,B2). . .n(Bm1,B2)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf(B1,B2) ˆf1

ˆfB1

ˆfBLB

sampling, no replacement (sizemn)

over-sampling

mean

mean

Nathalie Villa-Vialaneix |RF for Big Data 18/39

(32)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf1

ˆfB1

ˆfBLB

sampling, no replacement (sizemn)

over-sampling

mean

mean

(33)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

(X1(B1),Y(B11)). . .(X(Bm1),Ym(B1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

n1(B1,B2). . .n(Bm1,B2)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf(B1,B2) ˆf1

ˆfB1

ˆfBLB

sampling, no replacement (sizemn)

over-sampling

mean

mean

Nathalie Villa-Vialaneix |RF for Big Data 18/39

(34)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X1(1),Y1(1)). . .(X(1)m,Ym(1))

...

n1(1,1). . .n(1,1)m

n(1,B1 2). . .n(1,Bm 2)

n(B11,1). . .n(Bm1,1)

...

...

ˆf(1,1)

ˆf(1,B2)

ˆf(B1,1)

ˆf1

ˆfBLB sampling, no

replacement (sizemn)

over-sampling

mean

mean

(35)

What is over-sampling and why is it working?

BLB steps:

1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]: forn=106andγ =0.6, typicalmis about 4000, compared to 630 000 for standard bootstrap

2 for every subsampleτb, repeatB2times:

I over-sampling: affect weights(n1, . . . ,nm)simulated asM n,m11m

to observations inτb

I estimation step: train an estimator with weighted observations

3 aggregate by averaging

Remark: Final sample size (Pm

i=1ni) is equal ton(with replacement) as in standard bootstrap samples.

Nathalie Villa-Vialaneix |RF for Big Data 19/39

(36)

What is over-sampling and why is it working?

BLB steps:

1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]

2 for every subsampleτb, repeatB2times:

I over-sampling: affect weights(n1, . . . ,nm)simulated asM

n,m11m to observations inτb

I estimation step: train an estimator with weighted observations

3 aggregate by averaging

Remark: Final sample size (Pm

i=1ni) is equal ton(with replacement) as in standard bootstrap samples.

(37)

What is over-sampling and why is it working?

BLB steps:

1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]

2 for every subsampleτb, repeatB2times:

I over-sampling: affect weights(n1, . . . ,nm)simulated asM

n,m11m to observations inτb

I estimation step: train an estimator with weighted observations (if the learning algorithm allows a genuine processing of weights,

computational cost is low because of the small size ofm)

3 aggregate by averaging

Remark: Final sample size (Pm

i=1ni) is equal ton(with replacement) as in standard bootstrap samples.

Nathalie Villa-Vialaneix |RF for Big Data 19/39

(38)

What is over-sampling and why is it working?

BLB steps:

1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]

2 for every subsampleτb, repeatB2times:

I over-sampling: affect weights(n1, . . . ,nm)simulated asM

n,m11m to observations inτb

I estimation step: train an estimator with weighted observations

3 aggregate by averaging

Remark: Final sample size (Pm

i=1ni) is equal ton(with replacement) as in standard bootstrap samples.

(39)

What is over-sampling and why is it working?

BLB steps:

1 createB1samples (without replacement)of sizem∼nγ (with γ∈[0.5,1]

2 for every subsampleτb, repeatB2times:

I over-sampling: affect weights(n1, . . . ,nm)simulated asM

n,m11m to observations inτb

I estimation step: train an estimator with weighted observations

3 aggregate by averaging

Remark: Final sample size (Pm

i=1ni) is equal ton(with replacement) as in standard bootstrap samples.

Nathalie Villa-Vialaneix |RF for Big Data 19/39

(40)

Overview of Map Reduce

Map Reduceis a generic method to deal with massive datasets stored on a distributed filesystem.

It has been developped by GoogleTM [Dean and Ghemawat, 2004](see also

[Chamandy et al., 2012]for example of use at Google).

(41)

Overview of Map Reduce

Data

Data 1

Data 2

Data 3

The data are brokeninto several bits.

Nathalie Villa-Vialaneix |RF for Big Data 20/39

(42)

Overview of Map Reduce

Data

Data 1

Data 2

Data 3

Map

Map

Map

{(key,value)}

{(key,value)}

{(key,value)}

(43)

Overview of Map Reduce

Data

Data 1

Data 2

Data 3

Map

Map

Map

{(key,value)}

{(key,value)}

{(key,value)}

Map jobs must beindependent! Result: indexed data.

Nathalie Villa-Vialaneix |RF for Big Data 20/39

(44)

Overview of Map Reduce

Data

Data 1

Data 2

Data 3

Map

Map

Map

{(key,value)}

{(key,value)}

{(key,value)}

Reduce key=keyk

Reduce key=key1

OUTPUT

(45)

MR implementation of random forest

A Map/Reduce implementation of random forest is included inMahout (Apache scalable machine learning library) which works as

[del Rio et al., 2014]:

data aresplit betweenQ bitssent to each Map job;

aMap jobtrain a random forest with a small number of trees in it;

there isno Reduce step(the final forest is the combination of all trees learned in the Map jobs).

Note that this implementationis not equivalent to the original random forest algorithmbecause the forests are not built on bootstrap samples of the original data set.

Nathalie Villa-Vialaneix |RF for Big Data 21/39

(46)

MR implementation of random forest

A Map/Reduce implementation of random forest is included inMahout (Apache scalable machine learning library) which works as

[del Rio et al., 2014]:

data aresplit betweenQ bitssent to each Map job;

aMap jobtrain a random forest with a small number of trees in it;

there isno Reduce step(the final forest is the combination of all trees learned in the Map jobs).

Note that this implementationis not equivalent to the original random forest algorithmbecause the forests are not built on bootstrap samples of the original data set.

(47)

Drawbacks of MR implementation of random forest

Locality of datacan yield to biased random forests in the different Map jobs⇒the combined forest might have poor prediction performances

OOB errorcannot be computed precisely because Map job are independent. A proxy of this quantity is given by the average of OOB errors obtained from the different Map tasks⇒again this quantity must be biased due to data locality (similar problem with VI).

Nathalie Villa-Vialaneix |RF for Big Data 22/39

(48)

Drawbacks of MR implementation of random forest

Locality of datacan yield to biased random forests in the different Map jobs⇒the combined forest might have poor prediction performances

OOB errorcannot be computed precisely because Map job are independent. A proxy of this quantity is given by the average of OOB errors obtained from the different Map tasks⇒again this quantity must be biased due to data locality (similar problem with VI).

(49)

Another MR implementation of random forest

... usingPoisson bootstrap[Chamandy et al., 2012]which is based on the fact that (for largen):

Binom n,1 n

!

'Poisson(1)

1 Map step∀r =1, . . . ,Q (chunk of dataτr). ∀i ∈τr, generateB random i.i.d. random variables from Poisson(1)nbi (b =1, . . . ,B).

Output:(key,value)are(b, (i, nbi))for all pairs(i,b)stnbi ,0 (indicesi stnbi ,0 are in bootstrap sample numberb nbi times);

2 Reduce stepproceeds bootstrap sample numberb: a tree is built from indicesi stnbi ,0 repeatednbi times.

Output: A tree... All trees are collected in a forest.

Closer to using RF directly on the entire datasetBut: every Reduce job should deal with approximately 0.63×ndifferent observations... (only the bootstrap part is simplified)

Nathalie Villa-Vialaneix |RF for Big Data 23/39

(50)

Another MR implementation of random forest

... usingPoisson bootstrap[Chamandy et al., 2012]which is based on the fact that (for largen):

Binom n,1 n

!

'Poisson(1)

1 Map step∀r =1, . . . ,Q (chunk of dataτr). ∀i ∈τr, generateB random i.i.d. random variables from Poisson(1)nbi (b =1, . . . ,B).

Output:(key,value)are(b, (i, nbi))for all pairs(i,b)stnbi ,0 (indicesistnbi ,0 are in bootstrap sample numberb nbi times);

2 Reduce stepproceeds bootstrap sample numberb: a tree is built from indicesi stnbi ,0 repeatednbi times.

Output: A tree... All trees are collected in a forest.

Closer to using RF directly on the entire datasetBut: every Reduce job should deal with approximately 0.63×ndifferent observations... (only the bootstrap part is simplified)

(51)

Another MR implementation of random forest

... usingPoisson bootstrap[Chamandy et al., 2012]which is based on the fact that (for largen):

Binom n,1 n

!

'Poisson(1)

1 Map step∀r =1, . . . ,Q (chunk of dataτr). ∀i ∈τr, generateB random i.i.d. random variables from Poisson(1)nbi (b =1, . . . ,B).

Output:(key,value)are(b, (i, nbi))for all pairs(i,b)stnbi ,0 (indicesistnbi ,0 are in bootstrap sample numberb nbi times);

2 Reduce stepproceeds bootstrap sample numberb: a tree is built from indicesi stnbi ,0 repeatednbi times.

Output: A tree... All trees are collected in a forest.

Closer to using RF directly on the entire datasetBut: every Reduce job should deal with approximately 0.63×ndifferent observations... (only the bootstrap part is simplified)

Nathalie Villa-Vialaneix |RF for Big Data 23/39

(52)

Another MR implementation of random forest

... usingPoisson bootstrap[Chamandy et al., 2012]which is based on the fact that (for largen):

Binom n,1 n

!

'Poisson(1)

1 Map step∀r =1, . . . ,Q (chunk of dataτr). ∀i ∈τr, generateB random i.i.d. random variables from Poisson(1)nbi (b =1, . . . ,B).

Output:(key,value)are(b, (i, nbi))for all pairs(i,b)stnbi ,0 (indicesistnbi ,0 are in bootstrap sample numberb nbi times);

2 Reduce stepproceeds bootstrap sample numberb: a tree is built from indicesi stnbi ,0 repeatednbi times.

Output: A tree... All trees are collected in a forest.

Closer to using RF directly on the entire datasetBut: every Reduce job should deal with approximately 0.63×ndifferent observations... (only the bootstrap part is simplified)

(53)

Another MR implementation of random forest

... usingPoisson bootstrap[Chamandy et al., 2012]which is based on the fact that (for largen):

Binom n,1 n

!

'Poisson(1)

1 Map step∀r =1, . . . ,Q (chunk of dataτr). ∀i ∈τr, generateB random i.i.d. random variables from Poisson(1)nbi (b =1, . . . ,B).

Output:(key,value)are(b, (i, nbi))for all pairs(i,b)stnbi ,0 (indicesistnbi ,0 are in bootstrap sample numberb nbi times);

2 Reduce stepproceeds bootstrap sample numberb: a tree is built from indicesi stnbi ,0 repeatednbi times.

Output: A tree... All trees are collected in a forest.

Closer to using RF directly on the entire datasetBut: every Reduce job should deal with approximately 0.63×ndifferent observations... (only the bootstrap part is simplified)

Nathalie Villa-Vialaneix |RF for Big Data 23/39

(54)

Another MR implementation of random forest

... usingPoisson bootstrap[Chamandy et al., 2012]which is based on the fact that (for largen):

Binom n,1 n

!

'Poisson(1)

1 Map step∀r =1, . . . ,Q (chunk of dataτr). ∀i ∈τr, generateB random i.i.d. random variables from Poisson(1)nbi (b =1, . . . ,B).

Output:(key,value)are(b, (i, nbi))for all pairs(i,b)stnbi ,0 (indicesistnbi ,0 are in bootstrap sample numberb nbi times);

2 Reduce stepproceeds bootstrap sample numberb: a tree is built from indicesi stnbi ,0 repeatednbi times.

Output: A tree... All trees are collected in a forest.

Closer to using RF directly on the entire datasetBut: every Reduce job .

(55)

Online learning framework

Data stream: Observations(Xi,Yi)i=1,...,n have been used to obtain a predictorˆfn

New data arrive(Xi,Yi)i=n+1,...,n+m:How to obtain a predictor from the entire dataset(Xi,Yi)i=1,...,n+m?

Naive approach: re-train a model from(Xi,Yi)i=1,...,n+m

More interesting approach:updateˆfn with the new information (Xi,Yi)i=n+1,...,n+m

Why is it interesting?

computational gainif the update has a small computational cost (it can even be interesting to deal directly with big data which do not arrive in stream)

storage gain

Nathalie Villa-Vialaneix |RF for Big Data 24/39

(56)

Online learning framework

Data stream: Observations(Xi,Yi)i=1,...,n have been used to obtain a predictorˆfn

New data arrive(Xi,Yi)i=n+1,...,n+m:How to obtain a predictor from the entire dataset(Xi,Yi)i=1,...,n+m?

Naive approach: re-train a model from(Xi,Yi)i=1,...,n+m

More interesting approach:updateˆfn with the new information (Xi,Yi)i=n+1,...,n+m

Why is it interesting?

computational gainif the update has a small computational cost (it can even be interesting to deal directly with big data which do not arrive in stream)

storage gain

(57)

Online learning framework

Data stream: Observations(Xi,Yi)i=1,...,n have been used to obtain a predictorˆfn

New data arrive(Xi,Yi)i=n+1,...,n+m:How to obtain a predictor from the entire dataset(Xi,Yi)i=1,...,n+m?

Naive approach: re-train a model from(Xi,Yi)i=1,...,n+m

More interesting approach:updateˆfn with the new information (Xi,Yi)i=n+1,...,n+m

Why is it interesting?

computational gainif the update has a small computational cost (it can even be interesting to deal directly with big data which do not arrive in stream)

storage gain

Nathalie Villa-Vialaneix |RF for Big Data 24/39

(58)

Framework of online bagging

ˆfn = 1 B

B

X

b=1

ˆfb

n

in which ˆfb

n has been built from a bootstrap sample in{1, . . . , n} we know how to updateˆfb

n with new data online

Question: Can we update the bootstrap samples online when new data (Xi,Yi)i=n+1,...,n+marrive?

(59)

Framework of online bagging

ˆfn = 1 B

B

X

b=1

ˆfb

n

in which ˆfb

n has been built from a bootstrap sample in{1, . . . , n} we know how to updateˆfb

n with new data online

Question: Can we update the bootstrap samples online when new data (Xi,Yi)i=n+1,...,n+m arrive?

Nathalie Villa-Vialaneix |RF for Big Data 25/39

(60)

Online bootstrap using Poisson bootstrap

1 generate weights for every bootstrap samples and every new observation:nbi ∼Poisson(1)fori =n+1, . . . ,n+mand b=1, . . . , B

2 updateˆfb

n with the observationsXisuch thatnib ,0, each repeated nbi times

3 update the predictor:

ˆfn+m = 1 B

XB

b=1

ˆfb

n+m.

(61)

Online bootstrap using Poisson bootstrap

1 generate weights for every bootstrap samples and every new observation:nbi ∼Poisson(1)fori =n+1, . . . ,n+mand b=1, . . . , B

2 updateˆfb

n with the observationsXisuch thatnib ,0, each repeated nbi times

3 update the predictor:

ˆfn+m = 1 B

XB

b=1

ˆfb

n+m.

Nathalie Villa-Vialaneix |RF for Big Data 26/39

(62)

Online bootstrap using Poisson bootstrap

1 generate weights for every bootstrap samples and every new observation:nbi ∼Poisson(1)fori =n+1, . . . ,n+mand b=1, . . . , B

2 updateˆfb

n with the observationsXisuch thatnib ,0, each repeated nbi times

3 update the predictor:

ˆfn+m = 1 B

XB

b=1

ˆfb

n+m.

(63)

PRF

InPurely Random Forest[Biau et al., 2008], the splits are generated independentlyfrom the data

splits are obtained by randomly choosing a variable and a splitting point within the range of this variable

decision is made in a standard way

Nathalie Villa-Vialaneix |RF for Big Data 27/39

(64)

PRF

InPurely Random Forest[Biau et al., 2008], the splits are generated independentlyfrom the data

splits are obtained by randomly choosing a variable and a splitting point within the range of this variable

decision is made in a standard way

(65)

Online PRF

PRF is described by:

∀b =1, . . . ,B,ˆfb

n: PR tree for bootstrap sample numberb

∀b =1, . . . ,B, for all terminal leaflinˆfb

n, obsbn,lis the number of observations in(Xi)i=1, ...,nwhich falls in leafl and valb,ln is the average Y for these observations (regression framework)

Online update with Poisson bootstrap:

∀b =1, . . . ,B,∀i ∈ {n+1, . . . , n+m}stnbi ,0 and for the terminal leafl ofXi:

valb,li = val

b,l

i1×obsb,li1+nib×Yi obsbi,l1+nbi (online update of the mean...)

obsb,li =obsb,li−1+nbi

Nathalie Villa-Vialaneix |RF for Big Data 28/39

(66)

Online PRF

PRF is described by:

∀b =1, . . . ,B,ˆfb

n: PR tree for bootstrap sample numberb

∀b =1, . . . ,B, for all terminal leaflinˆfb

n, obsbn,lis the number of observations in(Xi)i=1, ...,nwhich falls in leafl and valb,ln is the average Y for these observations (regression framework)

Online update with Poisson bootstrap:

∀b =1, . . . ,B,∀i ∈ {n+1, . . . , n+m}stnbi ,0 and for the terminal leafl ofXi:

valb,li = val

b,l

i−1×obsb,li−1+nib×Yi obsbi,l1+nbi (online update of the mean...)

obsb,li =obsb,li−1+nib

(67)

Online PRF

PRF is described by:

∀b =1, . . . ,B,ˆfb

n: PR tree for bootstrap sample numberb

∀b =1, . . . ,B, for all terminal leaflinˆfb

n, obsbn,lis the number of observations in(Xi)i=1, ...,nwhich falls in leafl and valb,ln is the average Y for these observations (regression framework)

Online update with Poisson bootstrap:

∀b =1, . . . ,B,∀i ∈ {n+1, . . . , n+m}stnbi ,0 and for the terminal leafl ofXi:

valb,li = val

b,l

i−1×obsb,li−1+nib×Yi obsbi,l1+nbi (online update of the mean...)

obsb,li =obsb,li−1+nib

Nathalie Villa-Vialaneix |RF for Big Data 28/39

(68)

Online RF

Developed to handle data streams (data arrive sequentially) in an online manner (we can not keep all data from the past):

[Saffari et al., 2009]

Can deal with massive data streams (addressing both Volume and Velocity characteristics), but also to handle massive (static) data, by running through the data sequentially

In depth adaptation of Breiman’s RF: even the tree growing mechanism is changed

Main idea: think only in terms ofproportionsof output classes, instead of observations (classification framework)

(69)

Sommaire

1 Random Forest

2 Strategies to use random forest with big data Bag of Little Bootstrap (BLB)

Map Reduce Online learning

3 Application

Nathalie Villa-Vialaneix |RF for Big Data 30/39

(70)

When should we consider data as “big”?

We deal with Big Data when:

data are atgoogle scale(rare)

data are big compared to ourcomputing capacities

... and depending onwhat we need to do with them

[R Core Team, 2016,Kane et al., 2013]

R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.

(71)

When should we consider data as “big”?

We deal with Big Data when:

data are atgoogle scale(rare)

data are big compared to ourcomputing capacities

... and depending onwhat we need to do with them

[R Core Team, 2016,Kane et al., 2013]

R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.

Nathalie Villa-Vialaneix |RF for Big Data 31/39

(72)

When should we consider data as “big”?

We deal with Big Data when:

data are atgoogle scale(rare)

data are big compared to ourcomputing capacities... and depending onwhat we need to do with them

[R Core Team, 2016,Kane et al., 2013]

R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it

(73)

Implementation

standardrandomForestR package(done) bigmemorybigrfR package

MapReduce

I at hand(done)

I R packagesrmr,rhadoop

I Mahout implementation

Online RF: Python code from[Denil et al., 2013], and C++ code from

[Saffari et al., 2009].

Nathalie Villa-Vialaneix |RF for Big Data 32/39

(74)

MR-RF in practice: case study

[Genuer et al., 2015]

15,000,000 observations generated from: Y with

P(Y =1) =P(Y =−1) =0.5 and the conditional distribution of the (X(j))j=1,...,7givenY =y

with probability equal to 0.7,X(j)∼ N(jy,1)forj ∈ {1,2,3}and X(j)∼ N(0,1)forj ∈ {4,5,6};

with probability equal to 0.3,Xj∼ N(0,1)forj ∈ {1,2,3}and X(j)∼ N((j−3)y,1)forj ∈ {4,5,6};

X7 ∼ N(0,1).

Comparisonof subsampling, BLB, MR with well distributed data within Map jobs.

(75)

MR-RF in practice: case study

[Genuer et al., 2015]

15,000,000 observations generated from: Y with

P(Y =1) =P(Y =−1) =0.5 and the conditional distribution of the (X(j))j=1,...,7givenY =y

with probability equal to 0.7,X(j)∼ N(jy,1)forj ∈ {1,2,3}and X(j)∼ N(0,1)forj ∈ {4,5,6};

with probability equal to 0.3,Xj∼ N(0,1)forj ∈ {1,2,3}and X(j)∼ N((j−3)y,1)forj ∈ {4,5,6};

X7 ∼ N(0,1).

Comparisonof subsampling, BLB, MR with well distributed data within Map jobs.

Nathalie Villa-Vialaneix |RF for Big Data 33/39

(76)

Airline data

Benchmark data in Big Data articles (e.g.,[Wang et al., 2015])

containing more than124 millions of observationsand29 variables Aim: predictdelay_status(1=delayed, 0=on time) of a flight using 4 explanatory variables (distance,night,week-end,

departure_time).

Not really massive data: 12 Gocsv file Still useful to illustrate some Big Data issues:

I too large to fit in RAM (of most of nowadays laptops)

I R struggles to perform complex computations unless data take less than 10%−20%of RAM (total memory size of manipulated objects cannot exceed RAM limit)

I long computation times to deal with this dataset

Preliminary experiments on a Linux 64 bits server with 8 processors,

(77)

Comparison of different BDRF on a simulation study

sequential forest: took approximately 7 hours and the resulting OOB error was equal to 4.564e−3.

Method Comp. time BDerrForest errForest errTest sampling 10% 3 min 4.622e(-3) 4.381e(-3) 4.300e(-3) sampling 1% 9 sec 4.586e(-3) 4.363e(-3) 4.400e(-3) sampling 0.1% 1 sec 5.600e(-3) 4.714e(-3) 4.573e(-3) sampling 0.01% 0.3 sec 4.666e(-3) 5.957e(-3) 5.753e(-3) BLB-RF 5/20 1 min 4.138e(-3) 4.294e(-3) 4.267e(-3) BLB-RF 10/10 3 min 4.138e(-3) 4.278e(-3) 4.267e(-3) MR-RF 100/1 2 min 1.397e(-2) 4.235 e(-3) 4.006e(-3) MR-RF 100/10 2 min 8.646e(-3) 4.155e(-3) 4.293e(-3) MR-RF 10/10 6 min 8.501e(-3) 4.290e(-3) 4.253e(-3) MR-RF 10/100 21 min 4.556e(-3) 4.249e(-3) 4.260e(-3)

all methods provide satisfactory results comparable with sequential RF

average OOB error over the Map forests can be a bad approximation of true OOB error (sometimes optimistic, sometimes pessimistic)

Nathalie Villa-Vialaneix |RF for Big Data 35/39

(78)

Comparison of different BDRF on a simulation study

sequential forest: took approximately 7 hours and the resulting OOB error was equal to 4.564e−3.

Method Comp. time BDerrForest errForest errTest sampling 10% 3 min 4.622e(-3) 4.381e(-3) 4.300e(-3) sampling 1% 9 sec 4.586e(-3) 4.363e(-3) 4.400e(-3) sampling 0.1% 1 sec 5.600e(-3) 4.714e(-3) 4.573e(-3) sampling 0.01% 0.3 sec 4.666e(-3) 5.957e(-3) 5.753e(-3) BLB-RF 5/20 1 min 4.138e(-3) 4.294e(-3) 4.267e(-3) BLB-RF 10/10 3 min 4.138e(-3) 4.278e(-3) 4.267e(-3) MR-RF 100/1 2 min 1.397e(-2) 4.235 e(-3) 4.006e(-3) MR-RF 100/10 2 min 8.646e(-3) 4.155e(-3) 4.293e(-3) MR-RF 10/10 6 min 8.501e(-3) 4.290e(-3) 4.253e(-3) MR-RF 10/100 21 min 4.556e(-3) 4.249e(-3) 4.260e(-3) all methods provide satisfactory results comparable with sequential RF

Références

Documents relatifs

Nanobiosci., vol. Malik, “Scale-space and edge detection using anisotropic diffusion,” IEEE Trans. Ushida, “CNN-based difference-con- trolled adaptive non-linear image filters,”

My host was Sancharika Samuha (SAS), a feminist non- governmental organization (NGO) dedicated to the promotion of gender equality of women through various media initiatives

Although studies that have linked surface production to benthic biomass and abundance are typically at larger scales than the Placentia Bay study, multiple surveys of Placentia

Molecular mechanics (MMX) modelling has become a valuable tool to assess the importance of steric effects on structure and reactions of inorganic and organometallic

2. Duty to harmonize the special or regional conventions with the basic principles of this Convention. It is a convenient flexibility to allow the amendment of the application of

difference between the “small data” standard framework compared to the Big Data in terms of statistical accuracy when approximate versions of the original approach are used to deal

More precisely, one variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to

As indicated in [4], the standard MR version of RF, denoted by MR-RF in the re- maining of this paper, is based on the parallel construction of trees obtained on bootstrap subsamples