Embedded Variable Selection in Classification Trees

(1)

HAL Id: hal-01019767

https://hal.archives-ouvertes.fr/hal-01019767

Submitted on 6 Jun 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Embedded Variable Selection in Classification Trees

Servane Gey, Tristan Mary-Huard

To cite this version:

Servane Gey, Tristan Mary-Huard. Embedded Variable Selection in Classification Trees. GfKl 2011 :

Joint Conference of the German Classification Society (GfKl) and the German Association for Pattern

Recognition (DAG), Aug 2011, Francfort, Germany. n.p. �hal-01019767�

(2)

Embedded Variable Selection in Classification Trees

Servane Gey1, Tristan Mary-Huard2

1_{MAP5, UMR 8145, Université Paris Descartes, Paris, France} 2

UMR AgroParisTech/INRA 518, Paris, France

(3)

Overview

Introduction

?

Binary classification setting

?

Model and variable selection in classification

?

Classification tree

Variable selection for CART

?

Classes of classification trees

?

Theoretical results

(4)

Binary classification

Prediction of the unknown label Y (0 or 1) of an observation X .

⇒

Use training sample D

₌

(

X₁

,

Y₁

)

, ...,

(

X_n

,

Y_n

)

i.

∼ P

i.d to build a classifier

_b

f

b

f

:

X → {

0

,

1

}

X

_7→

_b

Y

.

Quality assessment

?

Classification risk and loss : Quality of the resulting classifier

b

f

L

(

_b

f

)

_{= P}

(

_b

f

(

X

)

₆₌

Y

_|

D

)

`

(

b

f

,

f∗

)

=

L

(

b

f

)

−

L

(

f∗

)

?

Average loss : Quality of the classification algorithm

E

D

[

`

(

b

f

,

f

∗

_)]

Remark : All these quantities depend on

P

that is unknown.

(5)

Basics of Vapnik Theory : structural risk minimization (SRM)

Consider and collection of classes of classifiers

C

1

, ...,

C

M. Define

fm

=

arg min f∈Cm L

(

f

)

,

b

fm

=

arg min f∈Cm Ln

(

f

)

,

b

f

=

arg min m

·

Ln

(

b

fm

)

+ α

V_C_m n

¸

?

Class complexity

If

C

₁

, ...,

C

Mhave finite VC dimensions VC1

, ...

VCM, then

E

D

[

`

(

b

f

,

f ∗

_)]

≤

C

½

inf m

µ

`

(

f_m

,

f∗

)

+

K

r

V_C_m n

¶¾

+

λ

n (Vapnik, 1998)

.

?

Classification task complexity(Margin Assumption)

If there exists h

_∈

]

0

;

0

.

5

[

such that

P

(

¯

η

(

x

)

−

1

/

2

¯

≤

h

)

=

0

,

with

η

(

x

)

= P

(

Y

=

1

|

X

=

x

)

then

E

D

[

`

(

b

f

,

f ∗

_)]

_≤

_C

½

_inf m

µ

`

(

f_m

,

f∗

)

₊

K0

µ

_V Cm n

¶¶¾

+

λ

0

(6)

Application to variable selection in classification

Assume that X

_{∈ R}

p. Define f_m₍_k₎

₌

arg min f∈Cm(k) L

(

f

)

,

b

fm(k)

=

arg min f∈Cm(k) L_n

(

f

)

?

Variable selection

Choose

_b

f such that

b

f

=

arg min m(k)

·

Ln

(

b

fm(k)

)

+ α

V_C m(k) n

+α

0_log[(p k

)]

¸

Then (under strong margin assumption)

E

D

[

`

(

b

f

,

f ∗

_)]

_≤

_C_log(p)

½

inf m(k)

µ

`

(

f_m₍_k₎

,

f∗

)

₊

K0

µ

_V_C m(k) n

¶¶¾

+

λ

n

(Massart, 2000, Mary-Huard et al., 2007)

(7)

Classification trees

General strategy Heuristic approach (CART, Breiman, 1984)

?

Find a tree Tmaxsuch that Ln

(

fTmax

)

=

0,

Choose

?

Prune T_maxusing criterion :

b

f

=

arg min

T

Ln

(

fT

)

+ α

|

T

_|

n

b

f

=

arg minT⊆Tmax

Ln

(

fT

)

+ α

|

T

_|

n

(8)

Definitions

Consider a tree Tc`with

- a given configuration c,

- a given list

`

of associated variables.

Remark : A same variable may be associated

with several nodes.

Class of tree classifiers

Define

C

c`

= {

f

/

f based on Tc`

} ,

H_c_`

₌

VC log-entropy of class

C

_c_`

,

f_c_`

₌

arg min f∈Cc` L

(

f

)

,

b

fc`

=

arg min f∈Cc` Ln

(

f

)

.

Remark : Two classifiers f

,

f0

_{∈ C}

_c_`only differ in their thresholds and labels.

(9)

Risk bound for one class

Proposition

Assume that strong margin assumption is satisfied. For all C

_>

1, there exist positive constants K1and K2depending on C such that

E

D

[

`

(

b

fc`

,

f∗

)]

≤

C

½

`

(

f_c_`

,

f∗

)

+

K1

µ |

Tc`

|

log

(

2n

)

n

¶¾

+

K 2 n

.

Idea of proof

?

Show that E

[

H_c_`

]

_{≤ |}

T_c_`

_|

log

(

2n

)

,

(10)

Combinatorics for variable selection

To take into account variable selection in the penalized criterion, one needs to count the number of classes sharing the same a priori complexity.

?

Parametric case(Logistic regression, LDA,...) - One parameter per variable,

- 2 classes with classifiers based on k variables have the same a priori complexity,

⇒

(

p_k

)

classes of a priori complexity k .

?

Classification trees

- One parameter per internal node (threshold),

- 2 classes

C

_c_`and

C

_c0_`0such that

_|

T_c_`

_{| = |}

T_c0_`0

|

have the same a priori complexity

⇒

Count the number of classes based on trees of size k!

(11)

Combinatorics for variable selection

A tree T_c_`is defined by

- a configuration,

- a list of variables associated with each node.

?

Number of configurations of size k:

N_ck

₌

1 k

Ã

2k

₋

2 k

₋

1

!

?

Number of variable lists of size k: -the list is ordered :

{

1

,

2

,

3

}

_{6= {}

2

,

1

,

3

}

,

- variables are selected with replacement :

{

1

,

2

,

1

}

.

⇒

N_`k

₌

pk−1instead of

(

p_k

)

!

?

Number of classes based on trees of size

_|

T_c_`

_{| =}

k: Nk

₌

N_ck

_×

N_`k

₌

1 k

Ã

2k

₋

2 k

₋

1

!

×

pk−1

⇒

log

(

Nk

)

_{≤ λ|}

T_c_`

_|

log

(

p

)

(12)

Combinatorics for variable selection

- a configuration,

?

N_ck

₌

1 k

Ã

2k

₋

2 k

₋

1

!

?

Number of variable lists of size k: -the list is ordered :

{

1

,

2

,

3

} 6=

{

2

,

1

,

3

},

- variables are selected with replacement :

{

1

,

2

,

1

}

.

⇒

N_`k

₌

pk−1instead of

(

p_k

)

!

?

_|

T_c_`

_{| =}

k: Nk

₌

N_ck

_×

N_`k

₌

1 k

Ã

2k

₋

2 k

₋

1

!

×

pk−1

⇒

log

(

Nk

)

_{≤ λ|}

Tc`

|

log

(

p

)

(13)

Combinatorics for variable selection

- a configuration,

?

N_ck

₌

1 k

Ã

2k

₋

2 k

₋

1

!

?

Number of variable lists of size k: - the list is ordered :

{

1

,

2

,

3

_{} 6= {}

2

,

1

,

3

}

,

-variables are selected with replacement :

{

1

,

2

,

1

}.

⇒

N_`k

₌

pk−1instead of

(

p_k

)

!

?

_|

T_c_`

_{| =}

k: Nk

₌

N_ck

_×

N_`k

₌

1 k

Ã

2k

₋

2 k

₋

1

!

×

pk−1

⇒

log

(

Nk

)

_{≤ λ|}

T_c_`

_|

log

(

p

)

(14)

Risk bound for tree classifiers

Proposition

Assume that strong margin assumption is satisfied. If

b

f

=

arg min c,`

(

L_n

(

_b

f_c_`

)

₊

pen

(

c

,

`

))

,

where pen

(

c

,

`

)

₌

C_h1

|

T_c_`

_|

log

(

2n

)

n

+

C 2 h

|

T_c_`

_|

log

(

p

)

n

with constants C1_h

,

C_h2depending on h appearing in the margin condition, then there exist positive constants C

,

C0

,

C00such that

E

D

[

l

(

b

f

,

f∗

)]

≤

C log

(

p

)

½

inf c,`

½

`

(

f_c_`

,

f∗

)

₊

C0

µ |

T_c_`

_|

log

(

2n

)

n

¶¾¾

+

C 00 n

.

Remark :

Theory: pen

(

c

,

`

)

=

(

a_n

₊

b_nlog

(

p

)

|

T_c_`

_{| =}

α

_p_,_n

_|

T_c_`

_|

Practice (CART): pen

(

c

,

`

)

=

α

CV

|

Tc`

|

Does

α

CVmatch

α

p,n?

(15)

Illustration on simulated data (1)

- Variables X1

, ...,

Xpare independent,

- If X1

_>

0 and X2

_>

0 P

(

Y

₌

1

)

₌

q, otherwise P

(

Y

₌

1

)

₌

1

₋

q

Remark : Easy case

- The Bayes classifier belongs to the collection of classes, - Strong margin assumption is satisfied.

(16)

Illustration on simulated data (2)

- P

(

Y

₌

1

)

₌

0

.

5

- For j

₌

1

,

2, Xj

_|

Y

₌

0

,→ N

(

0

,

σ

2

₎

_{and X}j

|

Y

₌

1

,→ N

(

1

,

σ

2

₎

_,

- Additional variables are independent and non-informative.

Remark : Difficult case

- The Bayes classifier does NOT belong to the collection of classes, - Strong margin assumption is NOT satisfied.

(17)

Conclusion

Model selection for tree classifiers:

- Already investigated (Nobel 02, Gey & Nedelec 06, Gey 10), - Variable selection not investigated so far.

- Pruning step now validated from this point of view.

Theory vs practice

- Theory : exhaustive search, - Practice : forward strategy,

- Nonetheless theoretical results are informative !

Extension

- In this talk : strong margin assumption

- Can be extended to less restrictive margin assumption - Manuscript on arXiv.org :

(18)

Bibliography

Breiman L., Friedman J., Olshen R. & Stone, C. (1984) Classification And

Regression Trees, Chapman & Hall.

Gey S. & Nédélec E. (2005) Model selection for CART regression trees, IEEE Trans.

Inform. Theory, 51, 658–670.

Koltchinskii, V. (2006) Local Rademacher Complexities and Oracle Inequalities in

Risk Minimization, Annals of Statistics, 34, 2593–2656.

Mary-Huard T., Robin S. & Daudin J.-J. (2007) A penalized criterion for variable

selection in classification, J. of Mult. Anal., 98, 695–705.

Massart P. (2000) Some applications of concentration inequalities to statistics,

Annales de la Faculté des Sciences de Toulouse.

Massart P. & Nédélec E. (2006) Risk Bounds for Statistical Learning, Annals of

Statistics, 34, 2326–2366.

Nobel A.B. (2002) Analysis of a complexity-based pruning scheme for classification

trees, IEEE Trans. Inform. Theory, 48, 2362–2368.