Analysis of purely random forests

(1)

Random forests Purely random forests Toy forests Hold-out random forests Conclusion

Analysis of purely random forests

Sylvain Arlot^1,2 (joint work with Robin Genuer³)

1Universit´e Paris-Saclay

2Inria Saclay, Celeste project-team

3ISPED, Universit´e de Bordeaux

Colloquium, Department of Statistics and Actuarial Science, The University of Iowa

March 18, 2021

(2)

Outline

1 Random forests

2 Purely random forests

3 Toy forests

4 Hold-out random forests

(3)

Outline

1 Random forests

3 Toy forests

(4)

Regression: data (X

1

, Y

1

), . . . , (X

n

, Y

n

)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−3

−2

−1 0 1 2 3 4

(5)

Goal: find the signal (denoising)

−2

−1 0 1 2 3 4

(6)

Regression

Data Dn: (X1,Y1), . . . ,(Xn,Yn)∈R^p×R (i.i.d. ∼P) Yi =s^?(Xi) +εi

with s^?(X) =E[Y|X] (regression function).

Goal: learn f measurable function X →Rs.t. the quadratic risk

E(X,Y)∼P

h f(X)−s^?(X)²ⁱ is minimal.

(7)

Regression

Data Dn: (X1,Y1), . . . ,(Xn,Yn)∈R^p×R (i.i.d. ∼P) Yi =s^?(Xi) +εi

with s^?(X) =E[Y|X] (regression function).

Goal: learn f measurable function X →Rs.t. the quadratic risk

E(X,Y)∼P

h f(X)−s^?(X)²ⁱ is minimal.

(8)

Regression tree (Breiman et al, 1984)

Tree: piecewise-constant

predictor, obtained by partitioning recursivelyR^p.

Restriction: splits parallel to the axes.

1 Choice of the partition U (tree structure)

2 For eachλ∈U(tree leaf), choice of the estimationβ^b_λ ofs^?(x) whenx ∈λ. Here,β^bλ=Yλ average of the (Yi)_X_i∈λ.

(9)

Regression tree (Breiman et al, 1984)

Usually, at each step, one looks for the best split of the data into two groups

(minimize sum of

within-group variances)Dn.

2 For eachλ∈U(tree leaf), choice of the estimationβ^b_λ ofs^?(x) whenx ∈λ. Here,β^b_λ=Y_λ average of the (Yi)_X_i∈λ.

(10)

Regression tree (Breiman et al, 1984)

2 For eachλ∈U(tree leaf), choice of the estimationβ^b_λ ofs^?(x) whenx ∈λ.

Here,β^bλ=Yλ average of the (Yi)_X_i∈λ.

(11)

Random forest (Breiman, 2001)

Definition (Random forest (Breiman, 2001)) n

bsΘj,16j 6q^o collection of tree predictors, (Θj)_16j6q i.i.d. r.v.

independent fromDn.

Random forest predictor_bs obtained by aggregating the tree collection.

bs(x) = 1 q

q

X

j=1

bs_Θ_j(x)

ensemble method (Dietterich, 1999, 2000)

(12)

Bagging (“bootstrap aggregating”)

Bootstrap(Efron, 1979): draw n i.i.d. r.v., uniform over (X_i,Y_i)/i = 1, . . . ,n (sampling with replacement)

⇒ resampleD_n^b

Bootstrapping a tree: _bs_tree^b =_bs_tree(D_n^b)

Bagging: bootstrap (q independent resamples) then aggregation

bs_bagging(x) = 1 q

q

X

j=1

bs_tree^b,j (x)

(13)

Random Forest-Random Inputs (Breiman, 2001)

Definition (RI tree)

In a RI tree, at each node,mtry variables are randomly chosen.

Then, the best cut direction is chosen only among the chosen variables.

Definition (Random forest RI)

A random forest RI (RF-RI) is obtained byaggregating RI trees built on independentbootstrap resamples.

(14)

Random Forest-Random Inputs

Dn

Bootstrap

uu {{ ((

D_n^b,1

RI tree

D_n^b,2

. . . . . . D_n^b,q

. . . . . .

bsΘ1

Aggregation ₎₎ bsΘ2

##

. . . . . . _bsΘq

vv

bsRF−RI

(15)

Theoretical results on RF-RI

Few theoretical results on Breiman’s original RF-RI, despite their excellent numerical performance (eg, Fern´andez-Delgado et al, 2014)

Most results:

focus on aspecific partof the algorithm (resampling, split criterion),

modifythe algorithm (eg, subsampling instead of resampling) makestrong assumptionsons^?

References (see survey paperby Biau and Scornet, 2016):

Scornet, Biau & Vert (2015), Mentch & Hooker (2016), Wager & Athey (2018), Genuer & Poggi (2019), ...

⇒ Here, we consider simplified RF models, for which a precise analysis is possible: purely random forests

(16)

Theoretical results on RF-RI

Few theoretical results on Breiman’s original RF-RI, despite their excellent numerical performance (eg, Fern´andez-Delgado et al, 2014)

Most results:

focus on aspecific partof the algorithm (resampling, split criterion),

modifythe algorithm (eg, subsampling instead of resampling) makestrong assumptionsons^?

References (see survey paperby Biau and Scornet, 2016):

Scornet, Biau & Vert (2015), Mentch & Hooker (2016), Wager & Athey (2018), Genuer & Poggi (2019), ...

⇒ Here, we consider simplified RF models, for which a precise analysis is possible: purely random forests

(17)

Outline

1 Random forests

3 Toy forests

(18)

Purely random forests

Definition (Purely random tree) bs_U(x) =^X

λ∈U

Y_λ(D_n)1x∈λ

whereYλ(Dn) is the average of (Yi)_X_i∈λ ,(Xi,Yi)∈D_n and the partitionUis independent fromD_n.

Definition (Purely random forest)

bs(x) = 1 q

q

X

j=1

bs_U^j(x) withU¹, . . . ,U^q i.i.d., independent fromDn.

Example (“hold-out RF” model): use someextra dataD⁰_nfor building the trees: U^j =URI(Dn^0?^j)) (can be done by splitting the sample into two subsamplesDn andD_n⁰).

B

From now on, D_n is the sample used for computing (Y_λ(D_n))λ∈U, and we assume its size isn.

(19)

Purely random forests

bs(x) = 1 q

q

X

j=1

bs_Uj(x) = 1 q

q

X

j=1

X

λ∈U^j

Y_λ(D_n)1x∈λ

withU¹, . . . ,U^q i.i.d., independent fromD_n.

Example (“hold-out RF” model): use someextra data D_n⁰ for building the trees: U^j =URI(Dn^0?^j)) (can be done by splitting the sample into two subsamplesD_n andD_n⁰).

B

From now on, Dn is the sample used for computing (Y_λ(Dn))λ∈U, and we assume its size isn.

(20)

Purely random forests

bs(x) = 1 q

q

X

j=1

bs_Uj(x) = 1 q

q

X

j=1

X

λ∈U^j

Y_λ(D_n)1x∈λ

withU¹, . . . ,U^q i.i.d., independent fromD_n.

Example (“hold-out RF” model): use someextra data D_n⁰ for building the trees: U^j =URI(Dn^0?^j)) (can be done by splitting the sample into two subsamplesD_n andD_n⁰).

B

From now on, Dn is the sample used for computing (Y_λ(Dn))λ∈U, and we assume its size isn.

(21)

Purely random forests

U¹

U²

. . . . . . U^q

UsingDn, with or without resampling

Independent fromD_n

. . . . . .

bs_U1

Aggregation ₍₍

bs_U2

""

. . . . . . _bs_U^q

vvs

(22)

Purely random forests: theory

Consistency: Biau, Devroye & Lugosi (2008), Scornet (2014)

Rates of convergence: Breiman (2004), Biau (2012), Klusowski (2018), Duroux & Scornet (2018), Mourtada, Gaiffas & Scornet (2017 & 2020)

Some adaptivity to dimension reduction(sparse framework):

Biau (2012), Klusowski (2018)

Forests decrease the estimation error(Biau, 2012; Genuer, 2012)

⇒ What about approximation error? Almost the same for a forest and a tree?

(23)

Purely random forests: theory

Consistency: Biau, Devroye & Lugosi (2008), Scornet (2014)

Rates of convergence: Breiman (2004), Biau (2012), Klusowski (2018), Duroux & Scornet (2018), Mourtada, Gaiffas & Scornet (2017 & 2020)

Some adaptivity to dimension reduction(sparse framework):

Biau (2012), Klusowski (2018)

Forests decrease the estimation error(Biau, 2012; Genuer, 2012)

⇒ What about approximation error?

(24)

Risk of a single tree (regressogram)

Given the partitionU, regressogram estimator bs_U(x) := ^X

λ∈U

Yλ1x∈λ

whereY_λ is the average of (Y_i)_X_i∈λ.

bs_U∈argmin

f∈S_U

(1 n

n

X

i=1

Yi −f(Xi)² )

whereS_U is the vector space of functions which are constant over eachλ∈U.

Define:

˜s_U(x) := ^X

λ∈U

β_λ1x∈λ whereβ_λ :=E[s^?(X)|X ∈λ] .

⇒˜s_U∈argmin_f_∈S

UE

h f(X)−s^?(X)²ⁱ and ˜s_U(x) =Ebs_U(x)|U

(25)

Risk of a single tree (regressogram)

Given the partitionU, regressogram estimator bs_U(x) := ^X

λ∈U

Yλ1x∈λ

whereY_λ is the average of (Y_i)_X_i∈λ.

bs_U∈argmin

f∈S_U

(1 n

n

X

i=1

Yi −f(Xi)² )

whereS_U is the vector space of functions which are constant over eachλ∈U.

Define:

˜s_U(x) := ^Xβ_λ1x∈λ whereβ_λ :=E[s^?(X)|X ∈λ] .

(26)

Risk decomposition: single tree

E

h s^?(X)−_bs_U(X)²ⁱ

=E

h s^?(X)−˜s_U(X)²ⁱ+E

h ˜s_U(X)−_bs_U(X)²ⁱ

= Approximation error + Estimation error

Ifs^? is smooth,X ∼ U([0,1]^p) andUregular partition intoD pieces, then

E h

s^?(X)−˜s_U(X)²ⁱ∝ 1 D^2/p

If var(Y|X) =σ² does not depend onX, then E

h ˜s_U(X)−_bs_U(X)²ⁱ≈ σ²D n

(27)

Risk decomposition: single tree

E

=E

E h

s^?(X)−˜s_U(X)²ⁱ∝ 1 D^2/p

(28)

Risk decomposition: single tree

E

=E

E h

s^?(X)−˜s_U(X)²ⁱ∝ 1 D^2/p

(29)

Approximation and estimation errors, p = 1

0.05 0.1 0.15 0.2 0.25 0.3 0.35

Approx. error Estim. error E[Risk]

(30)

Risk decomposition: purely random forest

(U^j)16j6q finite partitions, i.i.d. ∼ U Estimator (forest): _bs_U^1···q(x) := 1

q

X

j=1

bs_U^j(x) Ideal forest: ˜s_U^1···q(x) := 1

q

X

j=1

˜s_U^j(x) =Ebs_U^1···q(x)|U^1···q

Quadratic risk decomposition (givenX =x) E

h s^?(x)−_bs_U^1···q(x)²ⁱ=E

h s^?(x)−˜s_U^1···q(x)²ⁱ +E

h ˜s_U^1···q(x)−_bs_U^1···q(x)²ⁱ+δ_U^1···q(x) Approximation error: B_U,q(x) :=E

h s^?(x)−˜s_U^1···q(x)²ⁱ

(31)

Risk decomposition: purely random forest

(U^j)16j6q finite partitions, i.i.d. ∼ U Estimator (forest): _bs_U^1···q(x) := 1

q

X

j=1

bs_U^j(x) Ideal forest: ˜s_U^1···q(x) := 1

q

X

j=1

˜s_U^j(x) =Ebs_U^1···q(x)|U^1···q

Quadratic risk decomposition (givenX =x) E

h s^?(x)−_bs_U^1···q(x)²ⁱ=E

h s^?(x)−˜s_U^1···q(x)²ⁱ +E

h ˜s_U^1···q(x)−_bs_U^1···q(x)²ⁱ+δ_U^1···q(x)

(32)

Approximation error decomposition (given X = x )

B_U,q(x) =B_U,∞(x) + V_U(x) q where B_U,∞(x) :=Es^?(x)−˜s_U(x)²

and V_U(x) := var ˜s_U(x)

B_U,∞(x) is the approx. error of the infinite forest: ˜s_U,∞(x) :=E˜s_U(x) to be compared with theapproximation error of a single tree

B_U_,1(x) =B_U,∞(x) +V_U(x)

(33)

Outline

1 Random forests

3 Toy forests

(34)

Toy forests

Assume: X = [0,1)^p andX uniform over [0,1)^p

Ifp= 1,U∼ U_k^toy defined by:

U=

0,1−T k

,

1−T

k ,2−T k

, . . . ,

k−T

k ,1

whereT has uniform distribution over [0,1].

T/k T/k T/k T/k

1 3/k

2/k 1/k

0

Ifp>1,T_j for each coordinate j = 1, . . . ,p, independent

(35)

Toy forests

Assume: X = [0,1)^p andX uniform over [0,1)^p

Ifp= 1,U∼ U_k^toy defined by:

U=

0,1−T k

,

1−T

k ,2−T k

, . . . ,

k−T

k ,1

whereT has uniform distribution over [0,1].

T/k T/k T/k T/k

1 3/k

2/k 1/k

0

(36)

Toy forest, p = 2: example

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

1

X2

(37)

Interpretation of the ideal infinite forest (p = 1)

Proposition (A. & Genuer, 2014–2020) For any x ∈^h_k¹,1− ¹_kⁱ, the ideal infinite forest at x satisfies:

˜

s_U,∞(x) = (s^?∗h_k)(x) = Z 1

0

s^?(t)h_k(x−t)dt where

h_k(u) =







k(1−ku) if 06u6 _k¹ k(1 +ku) if − ¹_k 6u 60

h_k(u) 0k

(38)

Interpretation of the ideal infinite forest (p = 1): proof

I_U(x) := the interval ofU to whichx belongs

˜

s_U(x) = 1

|I_U(x)|

Z

I_U(x)

s^?(t)dt

Ifx∈^h¹_k,1−¹_kⁱ, I_U(x) =^hx+^V^x_k⁻¹,x+^V_k^x whereV_x has uniform distribution over [0,1].

˜

s_U,∞(x) =E_U˜s_U(x)

=k Z 1

0

s^?(t)P

x+Vx −1

k 6t <x+Vx

k

dt

=k Z 1

0

s^?(t) P k(t−x)<Vx 6k(t−x) + 1

| {z }

=hk(x−t) if 1/k6x61−1/k

dt

(39)

Interpretation of the ideal infinite forest (p = 1): proof

I_U(x) := the interval ofU to whichx belongs

˜

s_U(x) = 1

|I_U(x)|

Z

I_U(x)

s^?(t)dt

Ifx∈^h¹_k,1−¹_kⁱ, I_U(x) =^hx+^V^x_k⁻¹,x+^V_k^x whereV_x has uniform distribution over [0,1].

˜

s_U,∞(x) =E_U˜s_U(x)

=k Z 1

0

s^?(t)P

x+Vx −1

k 6t<x+Vx

k

dt

=k Z 1

s^?(t) P k(t−x)<V 6k(t−x) + 1 dt

(40)

Analysis of the approximation error, p = 1, x ∈

^h¹_k

, 1 −

_k¹ⁱ

(H2) s^? twice differentiable over (0,1) ands^?00 bounded Taylor-Lagrange formula: for everyt ∈(0,1), some ct,x ∈(0,1) exists such that

s^?(t)−s^?(x) =s^?0(x)(t−x) +1

2s^?00(c_t,x)(t−x)²

Therefore,

˜

s_U(x)−s^?(x) =k Z x+^Vx_k

x+^Vx_k⁻¹

(s^?(t)−s^?(x))dt

=k s^?0(x) Z x+^Vx_k

x+^Vx_k⁻¹

(t−x)dt+R₁(x)

= s^?0(x) k

Vx−1

2

+R1(x) whereR₁(x) = ^k₂ ^R^x+

Vx k

x+^Vx−1_k s^?00(c_t,x)(t−x)²dt.

(41)

Analysis of the approximation error, p = 1, x ∈

^h¹_k

, 1 −

_k¹ⁱ

(H2) s^? twice differentiable over (0,1) ands^?00 bounded Taylor-Lagrange formula: for everyt ∈(0,1), some ct,x ∈(0,1) exists such that

s^?(t)−s^?(x) =s^?0(x)(t−x) +1

2s^?00(c_t,x)(t−x)² Therefore,

˜

s_U(x)−s^?(x) =k Z x+^Vx_k

x+^Vx_k⁻¹

(s^?(t)−s^?(x))dt

=k s^?0(x) Z x+^Vx_k

x+^Vx−1_k

(t−x)dt+R₁(x)

= s^?0(x) k

Vx−1

2

+R1(x)

(42)

Analysis of the approximation error, p = 1, x ∈

^h¹_k

, 1 −

_k¹ⁱ

(H2) s^? twice differentiable over (0,1) ands^?00 bounded

˜

s_U(x)−s^?(x) = s^?0(x) k

V_x−1

2

+R₁(x) whereR1(x) = ^k₂ ^R^x+

Vx k

x+^Vx_k⁻¹s^?00(ct,x)(t−x)²dt.

Hence, B_Utoy

k ,∞(x) =E_Us^?(x)−˜s_U(x)² =E_UR₁(x)² 6 k⁴ and

V_Utoy

k (x) = var

s^?0(x) k

Vx−1

2

+R1(x)

k→+∞∼

s^?0(x)²var(Vx)

k² .

(43)

Analysis of the approximation error, p > 1

(Hθ)s^? ∈ C^1,(θ−1)(X) H¨older space,θ∈[1,2]

B_Utoy

k ,∞(x) =E_Us^?(x)−˜s_U(x)² 6

k^2θ/p V_Utoy k

(x) ∼

k→+∞

k^2/p

Proposition (A. & Genuer, 2014–2020) Assuming (Hθ),θ∈[1,2],∀x ∈¹_k,1−¹_k^p,

B_Utoy

k ,1(x) ∼

k→+∞

k^2/p B_Utoy

k ,∞(x)6

k^2θ/p Z

[¹^,1−¹]^pB_Utoy

k ,1(x)dx ∼

k→+∞

k^2/p

Z

[¹^,1−¹]^pB_Utoy

k ,∞(x)dx 6

k^2θ/p

(44)

Estimation error

General fact (Jensen’s inequality):

E

h ˜s_U,∞(X)−_bs_U,∞(X)²ⁱ6E

For the toy forest, without any resampling for computing labels and assuming that var(Y|X) =σ²:

E

h ˜s_U(X)−_bs_U(X)²ⁱ≈ σ²k n E

h ˜s_U,∞(X)−_bs_U,∞(X)²ⁱ≈ 2 3

σ²k n (A. & Genuer, 2016)

(45)

Estimation error

General fact (Jensen’s inequality):

E

h ˜s_U,∞(X)−_bs_U,∞(X)²ⁱ6E

For the toy forest, without any resampling for computing labels and assuming that var(Y|X) =σ²:

E

h ˜s_U(X)−_bs_U(X)²ⁱ≈ σ²k n E

h ˜s_U,∞(X)−_bs_U_,∞(X)²ⁱ≈ 2 3

σ²k n

(46)

Summary: risk analysis

Single tree Infinite forest

(q = 1) (q =∞)

E

h s^?(x)−_bs_U^1···q(x)²ⁱ ≈ c₁(s^?,x) k^2/p +σ²k

n 6 c_θ⁰(s^?,x)

k^2θ/p +2σ²k 3n

where c1(s^?,x) = ^s^?0₁₂^(x)² . Assumptions:

x ∈(0,1)^p far from boundary (Hθ) s^?∈ C^1,(θ−1)(X),θ∈[1,2]

X uniform over [0,1)^p var(Y|X) =σ²

no resampling for computing labels

(47)

Rates of convergence

Corollary: risk convergence rates (far from boundaries, withk =k_n^? optimal), under (Hθ),θ∈[1,2]:

Tree risk> n^−2/(2+p) if s^? not constant,θ >1 Infinite forest risk6 n^{−2θ/(2θ+p)} ⇒ minimax C^θ, θ∈[1,2]

Remarks:

q > (k_n^?)² is sufficient to get an “infinite” forest with subsampling a out ofn for computing labels: estimation error of a single tree ^σ_a²^k instead of ^σ_n²^k; no change for infinite forest

(48)

Rates of convergence

Corollary: risk convergence rates (far from boundaries, withk =k_n^? optimal), under (Hθ),θ∈[1,2]:

Tree risk> n^−2/(2+p) if s^? not constant,θ >1 Infinite forest risk6 n^{−2θ/(2θ+p)} ⇒ minimax C^θ, θ∈[1,2]

Remarks:

q > (k_n^?)² is sufficient to get an“infinite” forest with subsampling a out ofn for computing labels:

estimation error of a single tree ^σ_a²^k instead of ^σ_n²^k; no change for infinite forest

(49)

Outline

1 Random forests

3 Toy forests

(50)

Definition (Biau, 2012)

SplitDn into Dn1 andDn2

U¹

U²

. . . U^q

UsingD_n₂, no resampling here

RI partitions, usingDn1

. . . . . .

bs_U1

Aggregation ₎₎ bs_U2

##

. . . _bs_U^q

{{

bsHO−RF

⇒ purely random forest closely related to double-sample trees (Wager & Athey, 2018)

(51)

Numerical experiments: framework

Data generation:

Xi ∼ U([0,1]^p) Yi =s^?(Xi) +εi

ε_i ∼ N(0, σ²) σ²= 1/16 s^? :x∈[0,1]^p 7→ 1

10×^h10 sin(πx₁x₂) + 20(x₃−0.5)²+ 10x₄+ 5x₅ⁱ . Data split: n₁ = 1 280 n₂= 25 600

Forests definition:

nodesize= 1

k ∈ {2⁵,2⁶,2⁷,2⁸,2⁹}

“Large” forests are made ofq =k trees.

(52)

Numerical experiments: results (p = 5)

Single tree Large forest No bootstrap

mtry=p

0.13

k^0.17 +1.04σ²k n₂

0.13

k^0.17+ 1.04σ²k n₂ Bootstrap

mtry=p

0.14

k^0.17 +1.06σ²k n₂

0.15

k^0.29+ 0.08σ²k n₂ No bootstrap

mtry=bp/3c 0.23

k^0.19 +1.01σ²k n₂

0.06

mtry=bp/3c 0.25

k^0.20 +1.02σ²k n₂

0.06

k^0.34+ 0.05σ²k n₂ 2

2 +p ≈0.286 4

4 +p ≈0.444

(53)

Numerical experiments: results (p = 10)

Single tree Large forest No bootstrap

mtry=p

0.11

k^0.12 +1.03σ²k n₂

0.11

mtry=p

0.11

k^0.11 +1.05σ²k n₂

0.10

k^0.19+ 0.04σ²k n₂ No bootstrap

mtry=bp/3c 0.21

k^0.18 +1.08σ²k n₂

0.08

mtry=bp/3c 0.20

k^0.16 +1.05σ²k n₂

0.07

k^0.26+ 0.03σ²k n₂

(54)

Conclusion

Forests improve the order of magnitudeof the approximation error, compared to a single tree

Estimation error seems to change only by a constant factor (at least for toy forests);

not contradictory with literature: here, we fix k; different picture if nodesizeis fixed (+subsampling)

Randomization:

randomization of labels seems to have no impact;

strong impact of randomization of partitions(hold-out RF: both bootstrap and mtry)

(55)

Conclusion

Forests improve the order of magnitudeof the approximation error, compared to a single tree

Estimation error seems to change only by a constant factor (at least for toy forests);

not contradictory with literature: here, we fix k; different picture if nodesizeis fixed (+subsampling)

Randomization:

randomization of labels seems to have no impact;

strong impact of randomization of partitions(hold-out RF:

(56)

Approximation error: generalization: purf

Purely uniformly random forests:

split a random cell, chosen with probability equal to its volume

⇒in dimension p= 1, rates similar to toy forests

0.0 0.2 0.4 0.6 0.8 1.0

0.00.20.40.60.81.0

X2

(57)

Approximation error: generalization: bprf

Balanced purely random forestsin dimensionp:

full binary tree, uniform splits⇒ k^−α (tree) vs. k^−2α (forest) whereα=−log₂1−_2p¹ ⇒not minimax rates!

0.20.40.60.81.0

X2

(58)

Approximation error: 4 PRFs, different rates

0.0 0.2 0.4 0.6 0.8 1.0

UBPRF

X¹

X2

0.0 0.2 0.4 0.6 0.8 1.0

BPRF

X¹

X2

0.0 0.2 0.4 0.6 0.8 1.0

PURF

X2

0.0 0.2 0.4 0.6 0.8 1.0

TOY

X2

(59)

Approximation error: generalization

General result on the approximation errorunder (Hθ):

e.g., roughly, if x iscentered in its cell (on average overU), tree approx. error∝M₂ infinite forestapprox. error∝M²₂ whereM₂ ≈averagesquare distance fromx to the boundary of its cell (∝k^−2/p for toy forests)

other PRF studied in the literature: Mondrian forests (Mourtada, Ga¨ıffas & Scornet 2017 & 2020), centered random forests (Biau, 2012; Klusowski, 2018), ...

(60)

Open problems / future work

Theory on approximation error of hold-out RF?

⇒ understand the typical shape of the cell that contains x, for a RI tree

(x centered on average? square distance to boundary?) Theory on estimation errorof other PRF (beyond toy and PURF), with lower bounds?

of hold-out RF?

Extensive numerical experiments? (other functions s^?, ...)