Regression: data (X

(1)

1/39

Random forests Purely random forests Toy forests Hold-out random forests Conclusion

Analyse du risque de forêts purement aléatoires

Sylvain Arlot¹ (collaboration avec Robin Genuer²)

1Université Paris-Sud

2ISPED, Université Bordeaux 2

Séminaire AgroParisTech, Paris 1er Octobre 2018

Références: arXiv:1407.3939 arXiv:1604.01515

(2)

Outline

1 Random forests

2 Purely random forests

3 Toy forests in one dimension

4 Hold-out random forests

(3)

3/39

Outline

1 Random forests

(4)

Regression: data (X

1

, Y

1

), . . . , (X

n

, Y

n

)

−3

−2

−1 0 1 2 3 4

(5)

5/39

Goal: find the signal (denoising)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(6)

Regression

Data Dn: (X1,Y1), . . . ,(Xn,Yn)∈R^d×R (i.i.d. ∼P) Y_i =s^?(X_i) +ε_i

with s^?(X) =E[Y|X] (regression function).

Goal: learn f measurable function X →Rs.t. the quadratic risk

E(X,Y)∼P

h

f(X)−s^?(X)²ⁱ is minimal.

(7)

6/39

Regression

Data Dn: (X1,Y1), . . . ,(Xn,Yn)∈R^d×R (i.i.d. ∼P) Y_i =s^?(X_i) +ε_i

with s^?(X) =E[Y|X] (regression function).

Goal: learn f measurable function X →Rs.t. the quadratic risk

E(X,Y)∼P

h f(X)−s^?(X)²ⁱ is minimal.

(8)

Regression tree (Breiman et al, 1984)

Tree: piecewise-constant

predictor, obtained by partitioning recursivelyR^d.

Restriction: splits parallel to the axes.

1 Choice of the partition U (tree structure)

2 For eachλ∈U(tree leaf), choice of the estimationβ^b_λ ofs^?(x) whenx ∈λ. Here,β^bλ=Yλ average of the (Y_i)_X_i∈λ.

(9)

7/39

Regression tree (Breiman et al, 1984)

Usually, at each step, one looks for the best split of the data into two groups

(minimize sum of

within-group variances)Dn.

2 For eachλ∈U(tree leaf), choice of the estimationβ^b_λ ofs^?(x) whenx ∈λ. Here,β^bλ=Yλ average of the (Yi)_X_i∈λ.

(10)

Regression tree (Breiman et al, 1984)

2 For eachλ∈U(tree leaf), choice of the estimationβ^b_λ ofs^?(x) whenx ∈λ.

Here,β^bλ=Yλ average of the (Yi)_X_i∈λ.

(11)

8/39

Random forest (Breiman, 2001)

Definition (Random forest (Breiman, 2001)) n

bsΘj,16j 6q^o collection of tree predictors, (Θj)_16j6q i.i.d. r.v.

independent fromDn.

Random forest predictor_bs obtained by aggregating the tree collection.

bs(x) = 1 q

q

X

j=1

bs_Θ_j(x)

ensemble method (Dietterich, 1999, 2000)

powerfulstatistical learningalgorithm, for both classification andregression.

(12)

Bagging (“bootstrap aggregating”)

Bootstrap(Efron, 1979): draw n i.i.d. r.v., uniform over (X_i,Y_i)/i = 1, . . . ,n (sampling with replacement)

⇒ resampleD_n^b

Bootstrapping a tree: _bs_tree^b =_bs_tree(D_n^b)

Bagging: bootstrap (q independent resamples) then aggregation

bs_bagging(x) = 1 q

q

X

j=1

bs_tree^b,j (x)

(13)

10/39

Random Forest-Random Inputs (Breiman, 2001)

Definition (RI tree)

In a RI tree, at each node,mtry variables are randomly chosen.

Then, the best cut direction is chosen only among the chosen variables.

Definition (Random forest RI)

A random forest RI (RF-RI) is obtained byaggregating RI trees built on independentbootstrap resamples.

RF-RI ⇔ bagging on RI trees

(14)

Random Forest-Random Inputs

Dn

Bootstrap

uu {{ ((

Dn^b,1

RI tree

Dn^b,2

. . . . . . Dn^b,q

. . . . . .

bs_Θ₁

Aggregation ₎₎ bs_Θ₂

##

. . . . . . _bs_Θ_q

vv

bsRF−RI

(15)

12/39

Example of application of random forests: Kinect

Depth image ⇒ depth comparison features at each pixel

⇒body part at each pixel ⇒body part positions ⇒ · · ·

Figures from Shotton et al (2011)

(16)

Theoretical results on RF-RI

Few theoretical results on Breiman’s original RF-RI Most results:

focus on aspecific partof the algorithm (resampling, split criterion),

modifythe algorithm (eg, subsampling instead of resampling) makestrong assumptionsons^?

References (see survey paperby Biau and Scornet, 2016):

Mentch & Hooker (2014), Scornet, Biau & Vert (2015), Wager & Athey (2015), ...

⇒ Here, we consider simplified RF models, for which a precise analysis is possible: purely random forests

(17)

13/39

Theoretical results on RF-RI

Few theoretical results on Breiman’s original RF-RI Most results:

focus on aspecific partof the algorithm (resampling, split criterion),

modifythe algorithm (eg, subsampling instead of resampling) makestrong assumptionsons^?

References (see survey paperby Biau and Scornet, 2016):

Mentch & Hooker (2014), Scornet, Biau & Vert (2015), Wager & Athey (2015), ...

⇒ Here, we consider simplified RF models, for which a precise analysis is possible: purely random forests

(18)

Outline

1 Random forests

(19)

15/39

Purely random forests

Definition (Purely random tree) bs_U(x) =^X

λ∈U

Y_λ(D_n)1x∈λ

whereYλ(Dn) is the average of (Yi)_X_i∈λ ,(Xi,Yi)∈Dn and the partitionUis independent fromDn.

Definition (Purely random forest)

bs(x) = 1 q

q

X

j=1

bs_Uj(x) withU¹, . . . ,U^q i.i.d., independent fromD_n.

Example (“hold-out RF” model): use someextra dataD⁰_nfor building the trees: U^j =URI(Dn^0?^j)) (can be done by splitting the sample into two subsamplesD_n andD_n⁰).

B

From now on, Dn is the sample used for computing the Y_λ(Dn), and we assume its size is n.

(20)

Purely random forests

bs(x) = 1 q

q

X

j=1

bs_U^j(x) = 1 q

q

X

j=1

X

λ∈U^j

Yλ(Dn)1x∈λ

withU¹, . . . ,U^q i.i.d., independent fromDn.

Example (“hold-out RF” model): use someextra data D_n⁰ for building the trees: U^j =URI(Dn^0?^j)) (can be done by splitting the sample into two subsamplesDn andD_n⁰).

B

(21)

15/39

Purely random forests

bs(x) = 1 q

q

X

j=1

bs_U^j(x) = 1 q

q

X

j=1

X

λ∈U^j

Yλ(Dn)1x∈λ

withU¹, . . . ,U^q i.i.d., independent fromDn.

Example (“hold-out RF” model): use someextra data D_n⁰ for building the trees: U^j =URI(Dn^0?^j)) (can be done by splitting the sample into two subsamplesDn andD_n⁰).

B

(22)

Purely random forests

U¹

U²

. . . . . . U^q

UsingDn, with or without resampling

Independent fromD_n

. . . . . .

bs_U¹

Aggregation ₍₍

bs_U²

""

. . . . . . bs_U^q

vv

bs_PRF

(23)

17/39

Purely random forests: theory

Consistency: Biau, Devroye & Lugosi (2008), Scornet (2014)

Rates of convergence: Breiman (2004), Biau (2012)

Some adaptivity to dimension reduction(sparse framework):

Biau (2012)

Forests decrease the estimation error(Biau, 2012; Genuer, 2012)

⇒ What about approximation error? Almost the same for a forest and a tree?

(24)

Purely random forests: theory

Consistency: Biau, Devroye & Lugosi (2008), Scornet (2014)

Rates of convergence: Breiman (2004), Biau (2012)

Some adaptivity to dimension reduction(sparse framework):

Biau (2012)

Forests decrease the estimation error(Biau, 2012; Genuer, 2012)

⇒ What about approximation error?

Almost the same for a forest and a tree?

(25)

18/39

Risk of a single tree (regressogram)

Given the partitionU, regressogram estimator bs_U(x) := ^X

λ∈U

Y_λ1x∈λ

whereY_λ is the average of (Y_i)_X_i_∈λ.

bs_U∈argmin

f∈S_U

(1 n

n

X

i=1

Yi −f(Xi)² )

whereS_U is the vector space of functions which are constant over eachλ∈U.

Define:

˜s_U(x) := ^X

λ∈U

β_λ1x∈λ whereβ_λ :=E[s^?(X)|X ∈λ] .

⇒˜s_U∈argmin_f_∈S

UE

h f(X)−s^?(X)²ⁱ and ˜s_U(x) =Ebs_U(x)|U

(26)

Risk of a single tree (regressogram)

Given the partitionU, regressogram estimator bs_U(x) := ^X

λ∈U

Y_λ1x∈λ

whereY_λ is the average of (Y_i)_X_i_∈λ.

bs_U∈argmin

f∈S_U

(1 n

n

X

i=1

Yi −f(Xi)² )

whereS_U is the vector space of functions which are constant over eachλ∈U.

Define:

˜s_U(x) := ^X

λ∈U

β_λ1x∈λ whereβ_λ :=E[s^?(X)|X ∈λ] .

(27)

19/39

Risk decomposition: single tree

E h

bs_U(X)−s^?(X)²ⁱ

=E

h ˜s_U(X)−s^?(X)²ⁱ+E h

bs_U(X)−˜s_U(X)²ⁱ

= Approximation error + Estimation error

Ifs^? is smooth,X ∼ U([0,1]) andUregular partition intoD pieces, then

E

h ˜s_U(X)−s^?(X)²ⁱ∝ 1 D²

If var(Y|X) =σ² does not depend onX, then E

h

bs_U(X)−˜s_U(X)²ⁱ≈ σ²D n

(28)

Risk decomposition: single tree

E h

=E

= Approximation error + Estimation error Ifs^? is smooth,X ∼ U([0,1]) andUregular partition intoD pieces, then

E

h ˜s_U(X)−s^?(X)²ⁱ∝ 1 D²

h

(29)

19/39

Risk decomposition: single tree

E h

=E

= Approximation error + Estimation error Ifs^? is smooth,X ∼ U([0,1]) andUregular partition intoD pieces, then

E

h ˜s_U(X)−s^?(X)²ⁱ∝ 1 D²

h

(30)

Approximation and estimation errors

0 20 40 60 80 100

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

Approx. error Estim. error E[Risk]

(31)

21/39

Risk decomposition: purely random forest

(U^j)_16j6q finite partitions, i.i.d. ∼ U Estimator (forest): _bs_U^1···q(x) := 1

q

X

j=1

bs_U^j(x) Ideal forest: ˜s_U^1···q(x) := 1

q

X

j=1

˜s_Uj(x) =Ebs_U^1···q(x)|U^1···q

Quadratic risk decomposition (givenX =x) E

h

bs_U^1···q(x)−s^?(x)²ⁱ=E

h ˜s_U^1···q(x)−s^?(x)²ⁱ +E

h

bs_U^1···q(x)−˜s_U^1···q(x)²ⁱ Approximation error: B_U,q(x) :=E

h

˜

s_U^1···q(x)−s^?(x)²ⁱ

(32)

Risk decomposition: purely random forest

(U^j)_16j6q finite partitions, i.i.d. ∼ U Estimator (forest): _bs_U^1···q(x) := 1

q

X

j=1

bs_U^j(x) Ideal forest: ˜s_U^1···q(x) := 1

q

X

j=1

˜s_Uj(x) =Ebs_U^1···q(x)|U^1···q

Quadratic risk decomposition (givenX =x) E

h

bs_U^1···q(x)−s^?(x)²ⁱ=E

h ˜s_U^1···q(x)−s^?(x)²ⁱ +E

h

bs_U^1···q(x)−˜s_U^1···q(x)²ⁱ Approximation error: B (x) :=E

h

˜

s (x)−s^?(x)²ⁱ

(33)

22/39

Bias decomposition (given X = x )

B_U,q(x) =B_U,∞(x) + V_U(x) q where B_U,∞(x) :=E˜s_U(x)−s^?(x)²

and V_U(x) := var ˜s_U(x)

B_U,∞(x) is the approx. error of the infinite forest: ˜sU,∞(x) :=E˜s_U(x) to be compared with theapproximation error of a single tree

BU,1(x) =BU,∞(x) +VU(x)

(34)

Outline

1 Random forests

(35)

24/39

Toy forests in one dimension

Assume: X = [0,1) andX uniform over [0,1)

U∼ U_k^toy defined by:

U=

0,1−T k

,

1−T

k ,2−T k

, . . . ,

k−T

k ,1

whereT has uniform distribution over [0,1].

(36)

Interpretation of the ideal infinite forest

Proposition (A. & Genuer, 2014)

For any x∈^h¹_k,1−¹_kⁱ, the ideal infinite forest at x satisfies:

˜

sU,∞(x) = (s^?∗h_k)(x) = Z 1

0

s^?(t)h_k(x−t)dt where

hk(u) =











k(1−ku) if 06u6 ¹_k k(1 +ku) if − ¹_k 6u 60 0 if |u|> ¹_k

(37)

26/39

Interpretation of the ideal infinite forest: proof

I_U(x) := the interval ofU to whichx belongs

˜

s_U(x) = 1

|I_U(x)|

Z

I_U(x)

s^?(t)dt

Ifx∈^h¹_k,1−¹_kⁱ, I_U(x) =^hx+^V^x_k⁻¹,x+^V_k^x whereV_x has uniform distribution over [0,1].

˜

sU,∞(x) =EU

˜s_U(x)

=k Z 1

0

s^?(t)P

x+Vx−1

k 6t<x+ Vx

k

dt

=k Z 1

0

s^?(t)P k(t−x)<V_x 6k(t−x) + 1dt

(38)

Interpretation of the ideal infinite forest: proof

I_U(x) := the interval ofU to whichx belongs

˜

s_U(x) = 1

|I_U(x)|

Z

I_U(x)

s^?(t)dt

Ifx∈^h¹_k,1−¹_kⁱ, I_U(x) =^hx+^V^x_k⁻¹,x+^V_k^x whereV_x has uniform distribution over [0,1].

˜

sU,∞(x) =EU

˜s_U(x)

=k Z 1

0

s^?(t)P

x+Vx−1

k 6t<x+ Vx

k

dt

=k Z 1

0

s^?(t)P k(t−x)<Vx 6k(t−x) + 1dt

(39)

27/39

Analysis of the approximation error

(H2) s^? twice differentiable over (0,1) ands^?00 bounded Taylor-Lagrange formula: for everyt ∈(0,1), some ct,x ∈(0,1) exists such that

s^?(t)−s^?(x) =s^?0(x)(t−x) +1

2s^?00(c_t,x)(t−x)²

Therefore,

˜

s_U(x)−s^?(x) =k Z x+^Vx_k

x+^Vx_k⁻¹

(s^?(t)−s^?(x))dt

=k s^?0(x) Z x+^Vx_k

x+^Vx_k⁻¹

(t−x)dt+R₁(x)

= s^?0(x) k

Vx−1

2

+R1(x) whereR₁(x) = ^k₂ ^R^x+

Vx k

x+^Vx−1_k s^?00(c_t,x)(t−x)²dt

(40)

Analysis of the approximation error

(H2) s^? twice differentiable over (0,1) ands^?00 bounded Taylor-Lagrange formula: for everyt ∈(0,1), some ct,x ∈(0,1) exists such that

s^?(t)−s^?(x) =s^?0(x)(t−x) +1

2s^?00(c_t,x)(t−x)² Therefore,

˜

s_U(x)−s^?(x) =k Z x+^Vx_k

x+^Vx_k⁻¹

(s^?(t)−s^?(x))dt

=k s^?0(x) Z x+^Vx_k

x+^Vx−1_k

(t−x)dt+R₁(x)

= s^?0(x) k

Vx−1

2

+R1(x)

x+^Vx

(41)

28/39

Analysis of the approximation error

E_U˜s_U(x)−s^?(x)² 6

k⁴ V_U(x) ∼

k→+∞

k² Proposition (A. & Genuer, 2014)

Assuming (H2), for every x ∈^h_k¹,1−¹_kⁱ, B_U^toy

k ,1(x) ∼

k→+∞

k² B_U^toy

k ,∞(x)6

k⁴ Z 1−¹

k 1 k

B_U^toy

k ,1(x)dx ∼

k→+∞

k²

Z 1−¹

k 1 k

B_U^toy

k ,∞(x)dx 6

k⁴ Ratek⁻⁴ is tight assuming:

(H3) s^? three times differentiable over (0,1) ands^?000 bounded

(42)

Estimation error

General fact (Jensen’s inequality):

E h

bs_U,∞(X)−˜s_U,∞(X)²ⁱ6E h

For the toy forest, without any resampling for computing labels and assuming that var(Y|X) =σ²:

E h

bs_U(X)−˜s_U(X)²ⁱ≈ σ²k n E

h

bs_U,∞(X)−˜s_U,∞(X)²ⁱ≈ 2 3

σ²k n (A. & Genuer, 2016)

(43)

29/39

Estimation error

General fact (Jensen’s inequality):

E h

bs_U,∞(X)−˜s_U,∞(X)²ⁱ6E h

bs_U(X)−˜s_U(X)²ⁱ For the toy forest, without any resampling for computing labels and assuming that var(Y|X) =σ²:

E h

bs_U(X)−˜s_U(X)²ⁱ≈ σ²k n E

h

bs_U,∞(X)−˜s_U,∞(X)²ⁱ≈ 2 3

σ²k n (A. & Genuer, 2016)

(44)

Summary: risk analysis

Single tree Infinite forest

(q= 1) (q=∞)

E h

bs_U^1···q(x)−s^?(x)²ⁱ≈ c1(s^?,x) k² +σ²k

n

c2(s^?,x)

k⁴ +2σ²k 3n

where c1(s^?,x) =s^?0(x)²

12 and c2(s^?,x) =s^?00(x)² 144 . Assumptions:

x ∈(0,1) far from boundary

(H3) s^? three times differentiable over (0,1) ands^?000 bounded X uniform over [0,1]

var(Y|X) =σ²

(45)

31/39

Rates of convergence

Corollary: risk convergence rates (far from boundaries, withk =k_n^? optimal):

Tree> n^−2/3

Infinite forest6 n^−4/5 ⇒ minimax C²

Remarks:

q > (k_n^?)² is sufficient to get an “infinite” forest with subsampling a out ofn for computing labels: estimation error of a single tree ^σ_a²^k instead of ^σ_n²^k; no change for infinite forest

(46)

Rates of convergence

Corollary: risk convergence rates (far from boundaries, withk =k_n^? optimal):

Tree> n^−2/3

Infinite forest6 n^−4/5 ⇒ minimax C²

Remarks:

q > (k_n^?)² is sufficient to get an“infinite” forest with subsampling a out ofn for computing labels:

estimation error of a single tree ^σ_a²^k instead of ^σ_n²^k; no change for infinite forest

(47)

32/39

Outline

1 Random forests

(48)

Definition (Biau, 2012)

SplitD_n into D_n₁ andD_n₂ U¹

U²

. . . U^q

UsingD_n₂, no resampling here

RI partitions, usingD_n₁

. . . . . .

bs_U¹

Aggregation ₎₎ bs_U²

##

. . . _bs_U^q

{{

bsHO−RF

(49)

34/39

Numerical experiments: framework

Data generation:

Xi ∼ U([0,1]^d) Yi =s^?(Xi) +εi

εi ∼ N(0, σ²) σ²= 1/16 s^? :x∈[0,1]^d 7→ 1

10×^h10 sin(πx₁x₂) + 20(x₃−0.5)²+ 10x₄+ 5x₅ⁱ. Data split: n1 = 1 280 n2= 25 600

Forests definition:

nodesize= 1 k ∈ {2⁵,2⁶,2⁷,2⁸}

“Large” forests are made ofq =k trees.

Compute integrated approximation/estimation errors

(50)

Numerical experiments: results (d = 5)

Single tree Large forest No bootstrap

mtry=d

0.13

k^0.17 +1.04σ²k n₂

0.13

k^0.17+ 1.04σ²k n₂ Bootstrap

mtry=d

0.14

k^0.17 +1.06σ²k n2

0.15

k^0.29+ 0.08σ²k n2

No bootstrap mtry=bd/3c

0.23

k^0.19 +1.01σ²k n2

0.06

k^0.31+ 0.06σ²k n2

Bootstrap mtry=bd/3c

0.25

k^0.20 +1.02σ²k n2

0.06

k^0.34+ 0.05σ²k n2

2

d + 2 ≈0.286 4

d+ 4 ≈0.444

(51)

36/39

Numerical experiments: results (d = 10)

Single tree Large forest No bootstrap

mtry=d

0.11

k^0.12 +1.03σ²k n₂

0.11

k^0.12+ 1.03σ²k n₂ Bootstrap

mtry=d

0.11

k^0.11 +1.05σ²k n2

0.10

k^0.19+ 0.04σ²k n2

No bootstrap mtry=bd/3c

0.21

k^0.18 +1.08σ²k n2

0.08

k^0.25+ 0.04σ²k n2

Bootstrap mtry=bd/3c

0.20

k^0.16 +1.05σ²k n2

0.07

k^0.26+ 0.03σ²k n2

2

d + 2 ≈0.167 4

d+ 4 ≈0.286

(52)

Conclusion

Forests improve the order of magnitudeof the approximation error, compared to a single tree

Estimation error seems to change only by a constant factor (at least for toy forests);

not contradictory with literature: here, we fix k; different picture if nodesizeis fixed (+subsampling)

Randomization:

randomization of labels seems to have no impact;

strong impact of randomization of partitions(hold-out RF: both bootstrap and mtry)

(53)

37/39

Conclusion

Forests improve the order of magnitudeof the approximation error, compared to a single tree

Estimation error seems to change only by a constant factor (at least for toy forests);

not contradictory with literature: here, we fix k; different picture if nodesizeis fixed (+subsampling)

Randomization:

randomization of labels seems to have no impact;

strong impact of randomization of partitions(hold-out RF:

both bootstrap and mtry)

(54)

Approximation error: generalization

General result on the approximation errorunder (H2)/(H3):

e.g., roughly, if x iscentered in its cell (on average overU), tree approx. error∝M₂ infinite forestapprox. error∝M²₂ whereM₂ ≈averagesquare distance fromx to the boundary of its cell (∝k⁻² for toy forests)

toy forests in dimension d: approximation error∝k^−2/d vs. k^−4/d (infinite forest reachesminimaxC² rates)

purely uniformly random forests in dimension 1(split a random cell, chosen with probability equal to its volume): ≈toy balanced purely random forests(full binary tree, uniform splits) in dimension d: k^−α (tree) vs. k^−2α (forest) where α=−log₂1− _2d¹ ⇒ not minimax rates!

Mondrian forests(Mourtada, Gaïffas & Scornet 2018).

(55)

38/39

Approximation error: generalization

General result on the approximation errorunder (H2)/(H3):

e.g., roughly, if x iscentered in its cell (on average overU), tree approx. error∝M₂ infinite forestapprox. error∝M²₂ whereM₂ ≈averagesquare distance fromx to the boundary of its cell (∝k⁻² for toy forests)

toy forests in dimension d: approximation error∝k^−2/d vs.

k^−4/d (infinite forest reachesminimaxC² rates)

purely uniformly random forests in dimension 1(split a random cell, chosen with probability equal to its volume): ≈toy balanced purely random forests(full binary tree, uniform splits) in dimension d: k^−α (tree) vs. k^−2α (forest) where α=−log₂1− _2d¹ ⇒ not minimax rates!

Mondrian forests(Mourtada, Gaïffas & Scornet 2018).

(56)

Open problems / future work

Theory on approximation error of hold-out RF?

⇒ understand the typical shape of the cell that contains x, for a RI tree

(x centered on average? square distance to boundary?)

Theory on estimation errorof other models (beyond toy)?

of hold-out RF?

Extensive numerical experiments? (other functions s^?, ...)