Random Forest for Big Data

(1)

Random Forest for Big Data

Nathalie Villa-Vialaneix http://www.nathalievilla.org

& Robin Genuer, Jean-Michel Poggi & Christine Tuleau-Malot

13ème Journées de Statistique de Rennes Rennes, 21 octobre 2016

Nathalie Villa-Vialaneix |RF for Big Data 1/39

(2)

Sommaire

1 Random Forest

2 Strategies to use random forest with big data Bag of Little Bootstrap (BLB)

Map Reduce Online learning

3 Application

(3)

Sommaire

1 Random Forest

3 Application

(4)

A short introduction to random forest

Introduced by[Breiman, 2001], they areensemble methods

[Dietterich, 2000], similarly asBagging, Boosting, Randomizing

Outputs, Random Subspace

Statistical learning algorithm that can be used forclassificationand regression. It has been used in many situations involving real data with success:

I microarray[Díaz-Uriarte and Alvarez de Andres, 2006]

I ecology[Prasad et al., 2006]

I pollution forecasting[Ghattas, 1999]

I la génomique[Goldstein et al., 2010,Boulesteix et al., 2012]

I for more references,[Verikas et al., 2011]

(5)

Description of RF

L_n={(X₁,Y₁), . . . ,(X_n,Y_n)}i.i.d. observations of a random pair of variables(X,Y)st

X ∈R^p (explanatory variables)

Y ∈ Y(target variable). Ycan beR(regression) or{1, . . . ,M} (classification).

Purpose: define a predictorˆ_f : R^p → YfromL_n. Random forest from[Breiman, 2001]

nˆ_f(.,Θ_b), 1≤b≤Bo

is a set of regression or classification trees. (Θb)1≤b≤Bare i.i.d. random variables, independent ofL_n.

The random forest is obtained by aggregation of the set.

Aggregation:

regression:ˆ_f(x) = _B¹ PB

b=1ˆ_f(x,Θ_b) classification:ˆ_f(x) =arg max_1≤c≤MP_B

b=11_{_ˆ_f(x,Θ

b)=c}

(6)

Description of RF

nˆ_f(.,Θ_b), 1≤b ≤Bo

is a set of regression or classification trees.

(Θb)1≤b≤B are i.i.d. random variables, independent ofL_n.

The random forest is obtained by aggregation of the set. Aggregation:

regression:ˆ_f(x) = _B¹ PB

b=11_{_ˆ_f(x,Θ

b)=c}

(7)

Description of RF

nˆ_f(.,Θ_b), 1≤b ≤Bo

is a set of regression or classification trees.

(Θb)1≤b≤B are i.i.d. random variables, independent ofL_n. The random forest is obtained by aggregation of the set.

Aggregation:

regression:ˆ_f(x) = _B¹ P_B

b=11_{_ˆ_f(x,Θ

b)=c}

(8)

CART

Tree: piecewise constant

predictor obtained with recursive binary partitioning ofR^p

Constrains: splits are parallel to the axes

At every step of the binary partitioning, data in the current node are split “at best” (i.e., to have thegreatest decrease in heterogeneityin the two child nodes

(9)

CART partitioning

(10)

Classification and regression frameworks

(11)

Random forest with Bagging

[Breiman, 1996]

(X_i,Y_i)i=1,...,n

(X_i,Y_i)_i∈τ1 (X_i,Y_i)i∈τb (X_i,Y_i)_i∈τB

ˆ_f¹ ˆ_f^b ˆ_f^B

aggregation:ˆ_f^bag= _B¹ PB b=1ˆ_f^b CART

subsample with replacementB times

(12)

Trees in random forests

Variant of CART,

[Breiman et al., 1984]: piecewise constant predictor, obtained by a recursive partitioning ofR^p ^with splits parallel to axes

But: At each step of the partitioning, we seek the “best”

split of data amongmtry randomly picked directions (variables).

No pruning (fully developed trees)

(13)

OOB error and estimation of the prediction error

OOB=OutOfBag OOB error

For predictingY_i, only predictorsˆ_f^b _{such that}_i <τ_b are used⇒bY_i OOB error = ¹_nPn

i=1

Y_i−bY_i2

(regression) OOB error = ¹_nP_n

i=11_{Y

i,bYi} (classification)

estimation similar to standard cross validation estimation

... without splitting the training dataset because it isincluded in the bootstrap sample generation

Warning: a different forest is used for the prediction of eachY_i!

(14)

Variable importance

Definition

Forj ∈ {1, . . . ,p}, for every bootstrap sampleb,permute values of thej variablein the bootstrap sample.

Predict observationi(OOB prediction) for treeb:

in a standard wayˆ_f^b(X_i) after permutationˆ_f^b(X_i^(j))

Importance of variablejfor treeˆ_f^b is the average increase in accuracy after permutation of variablej. Regression case:

I(X^j) = ¹ B

XB

b=1

1 n

Xn

i=1

(ˆf^b(X_i^(j))−Y_i)²−(ˆf^b(X_i)−Y_i)²

The greater the increase,

(15)

Why RF and Big Data?

on one hand,baggingis appealing because easily computed in parallel.

on thedark side, each bootstrap sample has the same size than the original dataset (i.e.,n, which is supposed to be LARGE) and contains approximately0.63ndifferent observations(which is also LARGe)!

(16)

Why RF and Big Data?

on one hand,baggingis appealing because easily computed in parallel.

on thedark side, each bootstrap sample has the same size than the original dataset (i.e.,n, which is supposed to be LARGE) and contains approximately0.63ndifferent observations(which is also LARGe)!

(17)

Sommaire

1 Random Forest

3 Application

(18)

Types of strategy to handle big data problems

(19)

Types of strategy to handle big data problems

Here:Bag of Little Bootstrap

(20)

Types of strategy to handle big data problems

(21)

Types of strategy to handle big data problems

(22)

Overview of BLB

[Kleiner et al., 2012,Kleiner et al., 2014]

method used to scale any bootstrap estimation

consistency result demonstrated for a bootstrap estimation

Here: we describe the approach in the simplified case ofbagging(as for random forest)

Framework: (X_i,Y_i)i=1,...,n a learning set. We want to define a predictor of Y ∈R^from^X given the learning set.

(23)

Overview of BLB

(24)

Overview of BLB

(25)

Problem with standard bagging

Whennis big, the number of different observations inτ_b is∼0.63n⇒still BIG!

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap: bootstrap samples have sizemwithmn

But: The quality of the estimator strongly depends onm!

Idea behind BLB

Use bootstrap samples having sizenbut witha very small number of different observations in each of them.

(26)

Problem with standard bagging

First solution...: [Bickel et al., 1997]propose the “m-out-of-n” bootstrap:

bootstrap samples have sizemwithmn

Idea behind BLB

(27)

Problem with standard bagging

Idea behind BLB

(28)

Problem with standard bagging

Idea behind BLB

(29)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

(X₁^(B¹⁾,Y^(B₁¹⁾). . .(X^(B_m¹⁾,Y_m^(B¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

n₁^(B¹^,B²⁾. . .n^(B_m¹^,B²⁾

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf⁽^B¹^,^B²⁾ ˆf¹

ˆf^B¹

ˆf^BLB sampling, no

replacement (sizem_n)

over-sampling

mean

(30)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

n₁^(B¹^,B²⁾. . .n^(B_m¹^,B²⁾

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf⁽^B¹^,^B²⁾ ˆf¹

ˆf^B¹

ˆf^BLB

sampling, no replacement (sizem_n)

over-sampling

mean

(31)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

(X₁^(B¹⁾,Y^(B₁¹⁾). . .(X^(B_m¹⁾,Y_m^(B¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

n₁^(B¹^,B²⁾. . .n^(B_m¹^,B²⁾

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf⁽^B¹^,^B²⁾ ˆf¹

ˆf^B¹

ˆf^BLB

over-sampling

mean

(32)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf¹

ˆf^B¹

ˆf^BLB

over-sampling

mean

(33)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

(X₁^(B¹⁾,Y^(B₁¹⁾). . .(X^(B_m¹⁾,Y_m^(B¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

n₁^(B¹^,B²⁾. . .n^(B_m¹^,B²⁾

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf⁽^B¹^,^B²⁾ ˆf¹

ˆf^B¹

ˆf^BLB

over-sampling

mean

(34)

Presentation of BLB

(X1,Y1). . .(Xn,Yn)

(X₁⁽¹⁾,Y₁⁽¹⁾). . .(X⁽¹⁾_m,Y_m⁽¹⁾)

...

n₁^(1,1). . .n^(1,1)_m

n^(1,B₁ ²⁾. . .n^(1,B_m ²⁾

n^(B₁¹^,1). . .n^(B_m¹^,1)

...

ˆf⁽¹^,¹⁾

ˆf⁽¹^,^B²⁾

ˆf⁽^B¹^,¹⁾

ˆf¹

ˆf^BLB sampling, no

replacement (sizem_n)

over-sampling

mean

(35)

What is over-sampling and why is it working?

BLB steps:

1 createB₁samples (without replacement)of sizem∼n^γ (with γ∈[0.5,1]: forn=10⁶andγ =0.6, typicalmis about 4000, compared to 630 000 for standard bootstrap

2 for every subsampleτ_b, repeatB₂times:

I over-sampling: affect weights(n₁, . . . ,n_m)simulated asM n,_m¹1m

to observations inτ_b

I estimation step: train an estimator with weighted observations

3 aggregate by averaging

Remark: Final sample size (P_m

i=1n_i) is equal ton(with replacement) as in standard bootstrap samples.

(36)

What is over-sampling and why is it working?

BLB steps:

1 createB₁samples (without replacement)of sizem∼n^γ (with γ∈[0.5,1]

I over-sampling: affect weights(n1, . . . ,nm)simulated asM

n,_m¹1^m to observations inτb

(37)

What is over-sampling and why is it working?

BLB steps:

I estimation step: train an estimator with weighted observations (if the learning algorithm allows a genuine processing of weights,

computational cost is low because of the small size ofm)

Remark: Final sample size (Pm

(38)

What is over-sampling and why is it working?

BLB steps:

(39)

What is over-sampling and why is it working?

BLB steps:

(40)

Overview of Map Reduce

Map Reduceis a generic method to deal with massive datasets stored on a distributed filesystem.

It has been developped by Google^TM [Dean and Ghemawat, 2004](see also

[Chamandy et al., 2012]for example of use at Google).

(41)

Overview of Map Reduce

Data

Data 1

Data 2

Data 3

The data are brokeninto several bits.

(42)

Overview of Map Reduce

Data

Data 1

Data 2

Data 3

Map

{(key,value)}

(43)

Overview of Map Reduce

Data

Data 1

Data 2

Data 3

Map

{(key,value)}

Map jobs must beindependent! Result: indexed data.

(44)

Overview of Map Reduce

Data

Data 1

Data 2

Data 3

Map

{(key,value)}

Reduce key=key_k

Reduce key=key₁

OUTPUT

(45)

MR implementation of random forest

A Map/Reduce implementation of random forest is included inMahout (Apache scalable machine learning library) which works as

[del Rio et al., 2014]:

data aresplit betweenQ bitssent to each Map job;

aMap jobtrain a random forest with a small number of trees in it;

there isno Reduce step(the final forest is the combination of all trees learned in the Map jobs).

Note that this implementationis not equivalent to the original random forest algorithmbecause the forests are not built on bootstrap samples of the original data set.

(46)

MR implementation of random forest

A Map/Reduce implementation of random forest is included inMahout (Apache scalable machine learning library) which works as

[del Rio et al., 2014]:

data aresplit betweenQ bitssent to each Map job;

aMap jobtrain a random forest with a small number of trees in it;

there isno Reduce step(the final forest is the combination of all trees learned in the Map jobs).

Note that this implementationis not equivalent to the original random forest algorithmbecause the forests are not built on bootstrap samples of the original data set.

(47)

Drawbacks of MR implementation of random forest

Locality of datacan yield to biased random forests in the different Map jobs⇒the combined forest might have poor prediction performances

OOB errorcannot be computed precisely because Map job are independent. A proxy of this quantity is given by the average of OOB errors obtained from the different Map tasks⇒again this quantity must be biased due to data locality (similar problem with VI).

(48)

Drawbacks of MR implementation of random forest

Locality of datacan yield to biased random forests in the different Map jobs⇒the combined forest might have poor prediction performances

OOB errorcannot be computed precisely because Map job are independent. A proxy of this quantity is given by the average of OOB errors obtained from the different Map tasks⇒again this quantity must be biased due to data locality (similar problem with VI).

(49)

Another MR implementation of random forest

... usingPoisson bootstrap[Chamandy et al., 2012]which is based on the fact that (for largen):

Binom n,¹ n

!

'Poisson(1)

1 Map step∀r =1, . . . ,Q (chunk of dataτ_r). ∀i ∈τ_r, generateB random i.i.d. random variables from Poisson(1)n^b_i (b =1, . . . ,B).

Output:(key,value)are(b, (i, n^b_i))for all pairs(i,b)stn^b_i ,⁰ (indicesi stn^b_i ,0 are in bootstrap sample numberb n^b_i times);

2 Reduce stepproceeds bootstrap sample numberb: a tree is built from indicesi stn^b_i ,^{0 repeated}ⁿ^b_i ^times.

Output: A tree... All trees are collected in a forest.

Closer to using RF directly on the entire datasetBut: every Reduce job should deal with approximately 0.63×ndifferent observations... (only the bootstrap part is simplified)

(50)

Another MR implementation of random forest

Binom n,¹ n

!

'Poisson(1)

Output:(key,value)are(b, (i, n^b_i))for all pairs(i,b)stn^b_i ,⁰ (indicesistn^b_i ,0 are in bootstrap sample numberb n^b_i times);

(51)

Another MR implementation of random forest

Binom n,¹ n

!

'Poisson(1)

(52)

Another MR implementation of random forest

Binom n,¹ n

!

'Poisson(1)

(53)

Another MR implementation of random forest

Binom n,¹ n

!

'Poisson(1)

(54)

Another MR implementation of random forest

Binom n,¹ n

!

'Poisson(1)

Closer to using RF directly on the entire datasetBut: every Reduce job .

(55)

Online learning framework

Data stream: Observations(X_i,Y_i)_i=1,...,n have been used to obtain a predictorˆ_f_n

New data arrive(X_i,Y_i)i=n+1,...,n+m:How to obtain a predictor from the entire dataset(X_i,Y_i)i=1,...,n+m?

Naive approach: re-train a model from(X_i,Y_i)i=1,...,n+m

More interesting approach:updateˆ_f_n with the new information (X_i,Y_i)i=n+1,...,n+m

Why is it interesting?

computational gainif the update has a small computational cost (it can even be interesting to deal directly with big data which do not arrive in stream)

storage gain

(56)

Online learning framework

storage gain

(57)

Online learning framework

storage gain

(58)

Framework of online bagging

ˆ_f_n = ¹ B

B

X

b=1

ˆ_f^b

n

in which ˆ_f^b

n has been built from a bootstrap sample in{1, . . . , n} we know how to updateˆ_f^b

n with new data online

Question: Can we update the bootstrap samples online when new data (X_i,Y_i)i=n+1,...,n+marrive?

(59)

Framework of online bagging

ˆ_f_n = ¹ B

B

X

b=1

ˆ_f^b

n

in which ˆ_f^b

n has been built from a bootstrap sample in{1, . . . , n} we know how to updateˆ_f^b

n with new data online

Question: Can we update the bootstrap samples online when new data (X_i,Y_i)i=n+1,...,n+m arrive?

(60)

Online bootstrap using Poisson bootstrap

1 generate weights for every bootstrap samples and every new observation:n^b_i ∼Poisson(1)fori =n+1, . . . ,n+mand b=1, . . . , B

2 updateˆ_f^b

n with the observationsX_isuch thatn_i^b ,0, each repeated n^b_i times

3 update the predictor:

ˆ_f_n+m = ¹ B

XB

b=1

ˆ_f^b

n+m.

(61)

Online bootstrap using Poisson bootstrap

2 updateˆ_f^b

ˆ_f_n+m = ¹ B

XB

b=1

ˆ_f^b

n+m.

(62)

Online bootstrap using Poisson bootstrap

2 updateˆ_f^b

ˆ_f_n+m = ¹ B

XB

b=1

ˆ_f^b

n+m.

(63)

PRF

InPurely Random Forest[Biau et al., 2008], the splits are generated independentlyfrom the data

splits are obtained by randomly choosing a variable and a splitting point within the range of this variable

decision is made in a standard way

(64)

PRF

InPurely Random Forest[Biau et al., 2008], the splits are generated independentlyfrom the data

splits are obtained by randomly choosing a variable and a splitting point within the range of this variable

decision is made in a standard way

(65)

Online PRF

PRF is described by:

∀b =1, . . . ,B,ˆ_f^b

n: PR tree for bootstrap sample numberb

∀b =1, . . . ,B, for all terminal leaflinˆ_f^b

n, obs^b_n^,lis the number of observations in(X_i)i=1, ...,nwhich falls in leafl and val^b,l_n is the average Y for these observations (regression framework)

Online update with Poisson bootstrap:

∀b =1, . . . ,B,∀i ∈ {n+1, . . . , n+m}stn^b_i ,0 and for the terminal leafl ofX_i:

val^b,l_i = ^val

b,l

i−1×obs^b,l_i₋₁+n_i^b×Y_i obs^b_i₋^,l₁+n^b_i (online update of the mean...)

obs^b,l_i =obs^b,l_i−1+n^b_i

(66)

Online PRF

∀b =1, . . . ,B,ˆ_f^b

val^b,l_i = ^val

b,l

i−1×obs^b,l_i−1+n_i^b×Y_i obs^b_i₋^,l₁+n^b_i (online update of the mean...)

obs^b,l_i =obs^b,l_i−1+n_i^b

(67)

Online PRF

∀b =1, . . . ,B,ˆ_f^b

val^b,l_i = ^val

b,l

i−1×obs^b,l_i−1+n_i^b×Y_i obs^b_i₋^,l₁+n^b_i (online update of the mean...)

obs^b,l_i =obs^b,l_i−1+n_i^b

(68)

Online RF

Developed to handle data streams (data arrive sequentially) in an online manner (we can not keep all data from the past):

[Saffari et al., 2009]

Can deal with massive data streams (addressing both Volume and Velocity characteristics), but also to handle massive (static) data, by running through the data sequentially

In depth adaptation of Breiman’s RF: even the tree growing mechanism is changed

Main idea: think only in terms ofproportionsof output classes, instead of observations (classification framework)

(69)

Sommaire

1 Random Forest

3 Application

(70)

When should we consider data as “big”?

We deal with Big Data when:

data are atgoogle scale(rare)

data are big compared to ourcomputing capacities

... and depending onwhat we need to do with them

[R Core Team, 2016,Kane et al., 2013]

R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.

(71)

When should we consider data as “big”?

data are big compared to ourcomputing capacities

... and depending onwhat we need to do with them

R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it exceeds 50%.

(72)

When should we consider data as “big”?

data are big compared to ourcomputing capacities... and depending onwhat we need to do with them

R is not well-suited for working with data structures larger than about 10–20% of a computer’s RAM. Data exceeding 50% of available RAM are essentially unusable because the overhead of all but the simplest of calculations quickly consumes all available RAM. Based on these guidelines, we consider a data set large if it exceeds 20% of the RAM on a given machine and massive if it

(73)

Implementation

standardrandomForestR package(done) bigmemorybigrfR package

MapReduce

I at hand(done)

I R packagesrmr,rhadoop

I Mahout implementation

Online RF: Python code from[Denil et al., 2013], and C++ code from

[Saffari et al., 2009].

(74)

MR-RF in practice: case study

[Genuer et al., 2015]

15,000,000 observations generated from: Y with

P(Y =1) =P(Y =−1) =0.5 and the conditional distribution of the (X^(j))j=1,...,7givenY =y

with probability equal to 0.7,X^(j)∼ N(jy,1)forj ∈ {1,2,3}and X^(j)∼ N(0,1)forj ∈ {4,5,6};

with probability equal to 0.3,X^j∼ N(0,1)forj ∈ {1,2,3}and X^(j)∼ N((j−3)y,1)forj ∈ {4,5,6};

X⁷ ∼ N(0,1).

Comparisonof subsampling, BLB, MR with well distributed data within Map jobs.

(75)

MR-RF in practice: case study

[Genuer et al., 2015]

15,000,000 observations generated from: Y with

P(Y =1) =P(Y =−1) =0.5 and the conditional distribution of the (X^(j))j=1,...,7givenY =y

with probability equal to 0.7,X^(j)∼ N(jy,1)forj ∈ {1,2,3}and X^(j)∼ N(0,1)forj ∈ {4,5,6};

with probability equal to 0.3,X^j∼ N(0,1)forj ∈ {1,2,3}and X^(j)∼ N((j−3)y,1)forj ∈ {4,5,6};

X⁷ ∼ N(0,1).

Comparisonof subsampling, BLB, MR with well distributed data within Map jobs.

(76)

Airline data

Benchmark data in Big Data articles (e.g.,[Wang et al., 2015])

containing more than124 millions of observationsand29 variables Aim: predictdelay_status(1=delayed, 0=on time) of a flight using 4 explanatory variables (distance,night,week-end,

departure_time).

Not really massive data: 12 Gocsv file Still useful to illustrate some Big Data issues:

I too large to fit in RAM (of most of nowadays laptops)

I R struggles to perform complex computations unless data take less than 10%−20%of RAM (total memory size of manipulated objects cannot exceed RAM limit)

I long computation times to deal with this dataset

Preliminary experiments on a Linux 64 bits server with 8 processors,

(77)

Comparison of different BDRF on a simulation study

sequential forest: took approximately 7 hours and the resulting OOB error was equal to 4.564e⁻³.

Method Comp. time BDerrForest errForest errTest sampling 10% 3 min 4.622e(-3) 4.381e(-3) 4.300e(-3) sampling 1% 9 sec 4.586e(-3) 4.363e(-3) 4.400e(-3) sampling 0.1% 1 sec 5.600e(-3) 4.714e(-3) 4.573e(-3) sampling 0.01% 0.3 sec 4.666e(-3) 5.957e(-3) 5.753e(-3) BLB-RF 5/20 1 min 4.138e(-3) 4.294e(-3) 4.267e(-3) BLB-RF 10/10 3 min 4.138e(-3) 4.278e(-3) 4.267e(-3) MR-RF 100/1 2 min 1.397e(-2) 4.235 e(-3) 4.006e(-3) MR-RF 100/10 2 min 8.646e(-3) 4.155e(-3) 4.293e(-3) MR-RF 10/10 6 min 8.501e(-3) 4.290e(-3) 4.253e(-3) MR-RF 10/100 21 min 4.556e(-3) 4.249e(-3) 4.260e(-3)

all methods provide satisfactory results comparable with sequential RF

average OOB error over the Map forests can be a bad approximation of true OOB error (sometimes optimistic, sometimes pessimistic)

(78)

Comparison of different BDRF on a simulation study

sequential forest: took approximately 7 hours and the resulting OOB error was equal to 4.564e⁻³.

Method Comp. time BDerrForest errForest errTest sampling 10% 3 min 4.622e(-3) 4.381e(-3) 4.300e(-3) sampling 1% 9 sec 4.586e(-3) 4.363e(-3) 4.400e(-3) sampling 0.1% 1 sec 5.600e(-3) 4.714e(-3) 4.573e(-3) sampling 0.01% 0.3 sec 4.666e(-3) 5.957e(-3) 5.753e(-3) BLB-RF 5/20 1 min 4.138e(-3) 4.294e(-3) 4.267e(-3) BLB-RF 10/10 3 min 4.138e(-3) 4.278e(-3) 4.267e(-3) MR-RF 100/1 2 min 1.397e(-2) 4.235 e(-3) 4.006e(-3) MR-RF 100/10 2 min 8.646e(-3) 4.155e(-3) 4.293e(-3) MR-RF 10/10 6 min 8.501e(-3) 4.290e(-3) 4.253e(-3) MR-RF 10/100 21 min 4.556e(-3) 4.249e(-3) 4.260e(-3) all methods provide satisfactory results comparable with sequential RF