Bias-variance decomposition in Random Forests

(1)

Bias-variance decomposition

in Random Forests

Gilles Louppe @glouppe

(2)

Motivation

In supervised learning, combining the predictions of several

randomizedmodels often achieves better resultsthan a single non-randomized model.

(3)

Supervised learning

• The inputsare random variables X = X1, ..., Xp;

• The outputis a random variable Y.

• Data comes as a finite learning set

L = {(xi,yi)|i = 0, . . . , N − 1},

wherexi ∈X = X1× ... ×Xp andyi ∈Yare randomly drawn

from PX,Y.

• The goal is to find a model ϕ_L :X7→Yminimizing Err (ϕ_L_{) = E}X,Y{L(Y,ϕL(X))}.

(4)

Performance evaluation

Classification

• Symbolic output (e.g.,Y = {yes, no})

• Zero-one loss

L(Y , ϕ_L(X )) = 1(Y 6= ϕ_L(X ))

Regression

• Numerical output (e.g.,Y = R)

• Squared error loss

(5)

Decision trees

0.7 0.5 X1 X2

t

₅

t

₃

t

₄ 𝑡2 𝑋1≤ 0.7 𝑡1 𝑡3 𝑡4 𝑡5 𝒙 𝑝(𝑌 = 𝑐|𝑋 = 𝒙) Split node Leaf node ≤ > 𝑋2≤ 0.5 ≤ >

t ∈ ϕ : nodes of the tree ϕ Xt : split variable at t

vt ∈ R : split threshold at t

(6)

Bias-variance decomposition in regression

Theorem. For the squared error loss, the bias-variance

decomposition of the expected generalization error at X = x is

EL{Err(ϕL(x))} = noise(x) + bias2(x) + var(x)

where

noise(x) = Err (ϕB(x)),

bias2(x) = (ϕB(x) − EL{ϕL(x)})2,

(7)

(8)

Diagnosing the generalization error of a decision tree

• (Residual error : Lowest achievable error, independent of ϕ_L.)

• Bias : Decision trees usually havelow bias.

• Variance : They often suffer from high variance.

• Solution : Combine the predictions of several randomized trees into a single model.

(9)

Random forests

𝒙 𝑝𝜑1(𝑌 = 𝑐|𝑋 = 𝒙) 𝜑₁ 𝜑_𝑀 … 𝑝𝜑𝑚(𝑌 = 𝑐|𝑋 = 𝒙) ∑ 𝑝𝜓(𝑌 = 𝑐|𝑋 = 𝒙) Randomization • Bootstrap samples

}

(10)

Bias-variance decomposition (cont.)

Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error EL{Err(ψL,θ1,...,θM(x))} at X = x of an ensemble of M

randomized models ϕ_L,θ_m is

EL{Err(ψL,θ1,...,θM(x))} = noise(x) + bias

2_{(x) + var(x),} where noise(x) = Err (ϕB(x)), bias2(x) = (ϕB(x) − EL,θ{ϕL,θ(x)})2, var(x) = ρ(x)σ2_L,θ(x) +1 − ρ(x) M σ 2 L,θ(x).

(11)

Interpretation of ρ(x)

(Louppe, 2014)

Theorem. ρ(x) = VL{Eθ|L{ϕL,θ(x)}}

VL{Eθ|L{ϕL,θ(x)}}+E_L{Vθ|L{ϕL,θ(x)}}

In other words, it is the ratio between

• the variance due to the learning set and

• the total variance, accounting for random effects due to both the learning set and the random perburbations.

ρ(x) → 1 when variance is mostly due to the learning set ; ρ(x) → 0 when variance is mostly due to the random perturbations ;

(12)

Diagnosing the generalization error of random forests

• Bias : Identicalto the bias of a single randomized tree.

• Variance : var(x) = ρ(x)σ2_L,θ(x) +1−ρ(x)_M σ2_L,θ(x) As M →∞,var(x) → ρ(x)σ2_L,θ(x)

The stronger the randomization, ρ(x) → 0, var(x) → 0. The weaker the randomization, ρ(x) → 1, var(x) → σ2

L,θ(x)

Bias-variance trade-off. Randomizationincreases bias but makes it possible toreduce the variance of the corresponding ensemble model through averaging.

(13)

Bias-variance decomposition in classification

Theorem. For the zero-one loss and binary classification, the expected generalization error EL{Err(ϕL(x))} at X = x

decomposes as follows : EL{Err(ϕL(x))} = P(ϕB(x) 6= Y ) + Φ(0.5 − E_p L{bpL(Y = ϕB(x))} VL{bp_L(Y = ϕB(x))} )(2P(ϕB(x) = Y ) − 1) • _{For E}_L{bp_L(Y = ϕB(x)} > 0.5, VL{bpL(Y = ϕB(x))} → 0

makes Φ → 0 and the expected generalization error tends to the error of the Bayes model.

• _{Conversely, for E}_L{bp_L(Y = ϕB(x)} < 0.5,

VL{bp_L(Y = ϕB(x))} → 0 makes Φ → 1 and the error is

(14)