• Aucun résultat trouvé

Embedded Variable Selection in Classification Trees

N/A
N/A
Protected

Academic year: 2021

Partager "Embedded Variable Selection in Classification Trees"

Copied!
18
0
0

Texte intégral

(1)

HAL Id: hal-01019767

https://hal.archives-ouvertes.fr/hal-01019767

Submitted on 6 Jun 2020

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Embedded Variable Selection in Classification Trees

Servane Gey, Tristan Mary-Huard

To cite this version:

Servane Gey, Tristan Mary-Huard. Embedded Variable Selection in Classification Trees. GfKl 2011 :

Joint Conference of the German Classification Society (GfKl) and the German Association for Pattern

Recognition (DAG), Aug 2011, Francfort, Germany. n.p. �hal-01019767�

(2)

Embedded Variable Selection in Classification Trees

Servane Gey1, Tristan Mary-Huard2

1MAP5, UMR 8145, Université Paris Descartes, Paris, France 2

UMR AgroParisTech/INRA 518, Paris, France

(3)

Overview

Introduction

?

Binary classification setting

?

Model and variable selection in classification

?

Classification tree

Variable selection for CART

?

Classes of classification trees

?

Theoretical results

(4)

Binary classification

Binary classification

Prediction of the unknown label Y (0 or 1) of an observation X .

Use training sample D

=

(

X1

,

Y1

)

, ...,

(

Xn

,

Yn

)

i.

∼ P

i.d to build a classifier

b

f

b

f

:

X → {

0

,

1

}

X

7→

b

Y

.

Quality assessment

?

Classification risk and loss : Quality of the resulting classifier

b

f

L

(

b

f

)

= P

(

b

f

(

X

)

6=

Y

|

D

)

`

(

b

f

,

f∗

)

=

L

(

b

f

)

L

(

f∗

)

?

Average loss : Quality of the classification algorithm

E

D

[

`

(

b

f

,

f

)]

Remark : All these quantities depend on

P

that is unknown.

(5)

Basics of Vapnik Theory : structural risk minimization (SRM)

Consider and collection of classes of classifiers

C

1

, ...,

C

M. Define

fm

=

arg min f∈Cm L

(

f

)

,

b

fm

=

arg min f∈Cm Ln

(

f

)

,

b

f

=

arg min m

·

Ln

(

b

fm

)

+ α

VCm n

¸

?

Class complexity

If

C

1

, ...,

C

Mhave finite VC dimensions VC1

, ...

VCM, then

E

D

[

`

(

b

f

,

f ∗

)]

C

½

inf m

µ

`

(

fm

,

f∗

)

+

K

r

VCm n

¶¾

+

λ

n (Vapnik, 1998)

.

?

Classification task complexity(Margin Assumption)

If there exists h

]

0

;

0

.

5

[

such that

P

(

¯

¯

η

(

x

)

1

/

2

¯

¯

h

)

=

0

,

with

η

(

x

)

= P

(

Y

=

1

|

X

=

x

)

then

E

D

[

`

(

b

f

,

f ∗

)]

C

½

inf m

µ

`

(

fm

,

f∗

)

+

K0

µ

V Cm n

¶¶¾

+

λ

0

(6)

Application to variable selection in classification

Assume that X

∈ R

p. Define fm(k)

=

arg min f∈Cm(k) L

(

f

)

,

b

fm(k)

=

arg min f∈Cm(k) Ln

(

f

)

?

Variable selection

Choose

b

f such that

b

f

=

arg min m(k)

·

Ln

(

b

fm(k)

)

+ α

VC m(k) n

0log[(p k

)]

¸

Then (under strong margin assumption)

E

D

[

`

(

b

f

,

f ∗

)]

Clog(p)

½

inf m(k)

µ

`

(

fm(k)

,

f∗

)

+

K0

µ

VC m(k) n

¶¶¾

+

λ

n

(Massart, 2000, Mary-Huard et al., 2007)

(7)

Classification trees

General strategy Heuristic approach (CART, Breiman, 1984)

?

Find a tree Tmaxsuch that Ln

(

fTmax

)

=

0,

Choose

?

Prune Tmaxusing criterion :

b

f

=

arg min

T

Ln

(

fT

)

+ α

|

T

|

n

b

f

=

arg minT⊆Tmax

Ln

(

fT

)

+ α

|

T

|

n

(8)

Definitions

Consider a tree Tc`with

- a given configuration c,

- a given list

`

of associated variables.

Remark : A same variable may be associated

with several nodes.

Class of tree classifiers

Define

C

c`

= {

f

/

f based on Tc`

} ,

Hc`

=

VC log-entropy of class

C

c`

,

fc`

=

arg min f∈Cc` L

(

f

)

,

b

fc`

=

arg min f∈Cc` Ln

(

f

)

.

Remark : Two classifiers f

,

f0

∈ C

c`only differ in their thresholds and labels.

(9)

Risk bound for one class

Proposition

Assume that strong margin assumption is satisfied. For all C

>

1, there exist positive constants K1and K2depending on C such that

E

D

[

`

(

b

fc`

,

f∗

)]

C

½

`

(

fc`

,

f∗

)

+

K1

µ |

Tc`

|

log

(

2n

)

n

¶¾

+

K 2 n

.

Idea of proof

?

Show that E

[

Hc`

]

≤ |

Tc`

|

log

(

2n

)

,

(10)

Combinatorics for variable selection

To take into account variable selection in the penalized criterion, one needs to count the number of classes sharing the same a priori complexity.

?

Parametric case(Logistic regression, LDA,...) - One parameter per variable,

- 2 classes with classifiers based on k variables have the same a priori complexity,

(

pk

)

classes of a priori complexity k .

?

Classification trees

- One parameter per internal node (threshold),

- 2 classes

C

c`and

C

c0`0such that

|

Tc`

| = |

Tc0`0

|

have the same a priori complexity

Count the number of classes based on trees of size k!

(11)

Combinatorics for variable selection

A tree Tc`is defined by

- a configuration,

- a list of variables associated with each node.

?

Number of configurations of size k:

Nck

=

1 k

Ã

2k

2 k

1

!

?

Number of variable lists of size k: -the list is ordered :

{

1

,

2

,

3

}

6= {

2

,

1

,

3

}

,

- variables are selected with replacement :

{

1

,

2

,

1

}

.

N`k

=

pk−1instead of

(

pk

)

!

?

Number of classes based on trees of size

|

Tc`

| =

k: Nk

=

Nck

×

N`k

=

1 k

Ã

2k

2 k

1

!

×

pk−1

log

(

Nk

)

≤ λ|

Tc`

|

log

(

p

)

(12)

Combinatorics for variable selection

A tree Tc`is defined by

- a configuration,

- a list of variables associated with each node.

?

Number of configurations of size k:

Nck

=

1 k

Ã

2k

2 k

1

!

?

Number of variable lists of size k: -the list is ordered :

{

1

,

2

,

3

} 6=

{

2

,

1

,

3

},

- variables are selected with replacement :

{

1

,

2

,

1

}

.

N`k

=

pk−1instead of

(

pk

)

!

?

Number of classes based on trees of size

|

Tc`

| =

k: Nk

=

Nck

×

N`k

=

1 k

Ã

2k

2 k

1

!

×

pk−1

log

(

Nk

)

≤ λ|

Tc`

|

log

(

p

)

(13)

Combinatorics for variable selection

A tree Tc`is defined by

- a configuration,

- a list of variables associated with each node.

?

Number of configurations of size k:

Nck

=

1 k

Ã

2k

2 k

1

!

?

Number of variable lists of size k: - the list is ordered :

{

1

,

2

,

3

} 6= {

2

,

1

,

3

}

,

-variables are selected with replacement :

{

1

,

2

,

1

}.

N`k

=

pk−1instead of

(

pk

)

!

?

Number of classes based on trees of size

|

Tc`

| =

k: Nk

=

Nck

×

N`k

=

1 k

Ã

2k

2 k

1

!

×

pk−1

log

(

Nk

)

≤ λ|

Tc`

|

log

(

p

)

(14)

Risk bound for tree classifiers

Proposition

Assume that strong margin assumption is satisfied. If

b

f

=

arg min c,`

(

Ln

(

b

fc`

)

+

pen

(

c

,

`

))

,

where pen

(

c

,

`

)

=

Ch1

|

Tc`

|

log

(

2n

)

n

+

C 2 h

|

Tc`

|

log

(

p

)

n

with constants C1h

,

Ch2depending on h appearing in the margin condition, then there exist positive constants C

,

C0

,

C00such that

E

D

[

l

(

b

f

,

f∗

)]

C log

(

p

)

½

inf c,`

½

`

(

fc`

,

f∗

)

+

C0

µ |

Tc`

|

log

(

2n

)

n

¶¾¾

+

C 00 n

.

Remark :

Theory: pen

(

c

,

`

)

=

(

an

+

bnlog

(

p

)

)

|

Tc`

| =

α

p,n

|

Tc`

|

Practice (CART): pen

(

c

,

`

)

=

α

CV

|

Tc`

|

Does

α

CVmatch

α

p,n?

(15)

Illustration on simulated data (1)

- Variables X1

, ...,

Xpare independent,

- If X1

>

0 and X2

>

0 P

(

Y

=

1

)

=

q, otherwise P

(

Y

=

1

)

=

1

q

Remark : Easy case

- The Bayes classifier belongs to the collection of classes, - Strong margin assumption is satisfied.

(16)

Illustration on simulated data (2)

- P

(

Y

=

1

)

=

0

.

5

- For j

=

1

,

2, Xj

|

Y

=

0

,→ N

(

0

,

σ

2

)

and Xj

|

Y

=

1

,→ N

(

1

,

σ

2

)

,

- Additional variables are independent and non-informative.

Remark : Difficult case

- The Bayes classifier does NOT belong to the collection of classes, - Strong margin assumption is NOT satisfied.

(17)

Conclusion

Model selection for tree classifiers:

- Already investigated (Nobel 02, Gey & Nedelec 06, Gey 10), - Variable selection not investigated so far.

- Pruning step now validated from this point of view.

Theory vs practice

- Theory : exhaustive search, - Practice : forward strategy,

- Nonetheless theoretical results are informative !

Extension

- In this talk : strong margin assumption

- Can be extended to less restrictive margin assumption - Manuscript on arXiv.org :

(18)

Bibliography

Breiman L., Friedman J., Olshen R. & Stone, C. (1984) Classification And

Regression Trees, Chapman & Hall.

Gey S. & Nédélec E. (2005) Model selection for CART regression trees, IEEE Trans.

Inform. Theory, 51, 658–670.

Koltchinskii, V. (2006) Local Rademacher Complexities and Oracle Inequalities in

Risk Minimization, Annals of Statistics, 34, 2593–2656.

Mary-Huard T., Robin S. & Daudin J.-J. (2007) A penalized criterion for variable

selection in classification, J. of Mult. Anal., 98, 695–705.

Massart P. (2000) Some applications of concentration inequalities to statistics,

Annales de la Faculté des Sciences de Toulouse.

Massart P. & Nédélec E. (2006) Risk Bounds for Statistical Learning, Annals of

Statistics, 34, 2326–2366.

Nobel A.B. (2002) Analysis of a complexity-based pruning scheme for classification

trees, IEEE Trans. Inform. Theory, 48, 2362–2368.

Figure

Illustration on simulated data (1)
Illustration on simulated data (2)

Références

Documents relatifs

Concerning the selection methods, as mentioned in section I, our aim is not to find the optimal subset of individual classifiers among a large ensemble of trees, but rather to

To achieve the formal study of variable independence, we choose Monadic Second Order Logic to express n-ary queries, as it is often used as a yardstick logic in the context of

Numerous variable selection methods rely on a two-stage procedure, where a sparsity-inducing penalty is used in the first stage to predict the support, which is then conveyed to

The IDM is not really based on belief function and may result in too simple belief functions; Denoeux multinomial model is more elaborated, fits better with a belief function

We found that the two variable selection methods had comparable classification accuracy, but that the model selection approach had substantially better accuracy in selecting

Models in competition are composed of the relevant clustering variables S, the subset R of S required to explain the irrelevant variables according to a linear regression, in

Key-words: Discriminant, redundant or independent variables, Variable se- lection, Gaussian classification models, Linear regression, BIC.. ∗ Institut de Math´ ematiques de