• Aucun résultat trouvé

Bias-variance decomposition in Random Forests

N/A
N/A
Protected

Academic year: 2021

Partager "Bias-variance decomposition in Random Forests"

Copied!
14
0
0

Texte intégral

(1)

Bias-variance decomposition

in Random Forests

Gilles Louppe @glouppe

(2)

Motivation

In supervised learning, combining the predictions of several

randomizedmodels often achieves better resultsthan a single non-randomized model.

(3)

Supervised learning

• The inputsare random variables X = X1, ..., Xp;

• The outputis a random variable Y.

• Data comes as a finite learning set

L = {(xi,yi)|i = 0, . . . , N − 1},

wherexi ∈X = X1× ... ×Xp andyi ∈Yare randomly drawn

from PX,Y.

• The goal is to find a model ϕL :X7→Yminimizing Err (ϕL) = EX,Y{L(Y,ϕL(X))}.

(4)

Performance evaluation

Classification

• Symbolic output (e.g.,Y = {yes, no})

• Zero-one loss

L(Y , ϕL(X )) = 1(Y 6= ϕL(X ))

Regression

• Numerical output (e.g.,Y = R)

• Squared error loss

(5)

Decision trees

0.7 0.5 X1 X2

t

5

t

3

t

4 𝑡2 𝑋1≤ 0.7 𝑡1 𝑡3 𝑡4 𝑡5 𝒙 𝑝(𝑌 = 𝑐|𝑋 = 𝒙) Split node Leaf node ≤ > 𝑋2≤ 0.5 ≤ >

t ∈ ϕ : nodes of the tree ϕ Xt : split variable at t

vt ∈ R : split threshold at t

(6)

Bias-variance decomposition in regression

Theorem. For the squared error loss, the bias-variance

decomposition of the expected generalization error at X = x is

EL{Err(ϕL(x))} = noise(x) + bias2(x) + var(x)

where

noise(x) = Err (ϕB(x)),

bias2(x) = (ϕB(x) − EL{ϕL(x)})2,

(7)
(8)

Diagnosing the generalization error of a decision tree

• (Residual error : Lowest achievable error, independent of ϕL.)

• Bias : Decision trees usually havelow bias.

• Variance : They often suffer from high variance.

• Solution : Combine the predictions of several randomized trees into a single model.

(9)

Random forests

𝒙 𝑝𝜑1(𝑌 = 𝑐|𝑋 = 𝒙) 𝜑1 𝜑𝑀 … 𝑝𝜑𝑚(𝑌 = 𝑐|𝑋 = 𝒙) ∑ 𝑝𝜓(𝑌 = 𝑐|𝑋 = 𝒙) Randomization • Bootstrap samples

}

(10)

Bias-variance decomposition (cont.)

Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error EL{Err(ψL,θ1,...,θM(x))} at X = x of an ensemble of M

randomized models ϕL,θm is

EL{Err(ψL,θ1,...,θM(x))} = noise(x) + bias

2(x) + var(x), where noise(x) = Err (ϕB(x)), bias2(x) = (ϕB(x) − EL,θ{ϕL,θ(x)})2, var(x) = ρ(x)σ2L,θ(x) +1 − ρ(x) M σ 2 L,θ(x).

(11)

Interpretation of ρ(x)

(Louppe, 2014)

Theorem. ρ(x) = VL{Eθ|L{ϕL,θ(x)}}

VL{Eθ|L{ϕL,θ(x)}}+EL{Vθ|L{ϕL,θ(x)}}

In other words, it is the ratio between

• the variance due to the learning set and

• the total variance, accounting for random effects due to both the learning set and the random perburbations.

ρ(x) → 1 when variance is mostly due to the learning set ; ρ(x) → 0 when variance is mostly due to the random perturbations ;

(12)

Diagnosing the generalization error of random forests

• Bias : Identicalto the bias of a single randomized tree.

• Variance : var(x) = ρ(x)σ2L,θ(x) +1−ρ(x)M σ2L,θ(x) As M →∞,var(x) → ρ(x)σ2L,θ(x)

The stronger the randomization, ρ(x) → 0, var(x) → 0. The weaker the randomization, ρ(x) → 1, var(x) → σ2

L,θ(x)

Bias-variance trade-off. Randomizationincreases bias but makes it possible toreduce the variance of the corresponding ensemble model through averaging.

(13)

Bias-variance decomposition in classification

Theorem. For the zero-one loss and binary classification, the expected generalization error EL{Err(ϕL(x))} at X = x

decomposes as follows : EL{Err(ϕL(x))} = P(ϕB(x) 6= Y ) + Φ(0.5 − Ep L{bpL(Y = ϕB(x))} VL{bpL(Y = ϕB(x))} )(2P(ϕB(x) = Y ) − 1) • For EL{bpL(Y = ϕB(x)} > 0.5, VL{bpL(Y = ϕB(x))} → 0

makes Φ → 0 and the expected generalization error tends to the error of the Bayes model.

Conversely, for EL{bpL(Y = ϕB(x)} < 0.5,

VL{bpL(Y = ϕB(x))} → 0 makes Φ → 1 and the error is

(14)

Références

Documents relatifs

Synthesis of the performance of LB in the case of the sphere.Convergence rate of EMNA with QR mutations (as in [14, 12]), which pro- vide the state of the art convergence rates for

Yet, if time preferences are characterized by increasing impatience (the second difference is negative), present bias could well be associated with preference reversal toward the

In [34], Wschebor obtained an asymptotic bound, as the dimension n goes to infinity, for the variance of number of real roots of a system of n independent

Then, the defined oracle, whose relevance was empirically confirmed on realistic channels in a millimeter wave massive MIMO context, was compared to the classical LMMSE

search Navon task; (1b) one of the test displays used in the similarity-matching Navon task, containing one target figure (top figure) and two comparison figures (bottom figures)

and markers, presence and number of QTL in the SNP data, MAF, relationship between individuals) and the lack of information about the QTL, quantifying the bias (when it exists)

For small data sets and models of analysis fitting both genetic and maternal environmental effects or a direct-maternal covariance,. this frequently induces the need to

Four percentages were significantly higher than chance both by subjects and by items: when the pronoun referred to the Patient of Agent-Patient verbs; when the pronoun referred to