• Aucun résultat trouvé

Robustesse des classifications obtenues par clustering

N/A
N/A
Protected

Academic year: 2021

Partager "Robustesse des classifications obtenues par clustering"

Copied!
79
0
0

Texte intégral

(1)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Robustness of classification based on

clustering

Ch. Ruwet

D

EPARTMENT OF

M

ATHEMATICS

- U

NIVERSITY OF

L

IÈGE

(2)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Outline

1

Introduction

2

Statistical functionals

3

Error Rate

4

Influence functions

5

Asymptotic Loss

6

Breakdown point

7

Some improvements

(3)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Outline

1

Introduction

2

Statistical functionals

3

Error Rate

4

Influence functions

5

Asymptotic Loss

6

Breakdown point

7

Some improvements

(4)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Classification

0

5

10

15

0

5

10

15

Algèbre

Analyse

Succès

Echec

(5)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Clustering

0

5

10

15

0

5

10

15

Algèbre

Analyse

(6)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

The k -means clustering method

X

n

= {

x

1

, . . . ,

x

n

}

a dataset in p dimensions;

Aim of clustering

: Group similar observations in k

clusters C

1

, . . . ,

C

k

;

The k -means algorithm constructs clusters in order to

minimize the within cluster sum of squared distances

The clusters centers

(

T

1

n

, . . . ,

T

k

n

)

are solutions of

min

{

t

1

,...,

t

k}⊂Rp

n

X

i

=

1



inf

1

j

k

k

x

i

t

j

k



2

;

The classification rule:

x

C

j

n

⇔ k

x

T

j

n

k =

min

1

i

k

k

x

T

n

i

k;

Let us focus on k

=

2 groups:

C

1

n

=



x

∈ R

p

: (

T

1

n

T

2

n

)

t

x

1

2



k

T

1

n

k

2

− k

T

2

n

k

2

>

0





.

(7)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

The k -means clustering method

X

n

= {

x

1

, . . . ,

x

n

}

a dataset in p dimensions;

Aim of clustering

: Group similar observations in k

clusters C

1

, . . . ,

C

k

;

The k -means algorithm constructs clusters in order to

minimize the within cluster sum of squared distances

The clusters centers

(

T

1

n

, . . . ,

T

k

n

)

are solutions of

min

{

t

1

,...,

t

k

}⊂R

p

n

X

i

=

1



inf

1

j

k

k

x

i

t

j

k

2

;

The classification rule:

x

C

j

n

⇔ k

x

T

j

n

k =

min

1

i

k

k

x

T

n

i

k;

Let us focus on k

=

2 groups:

C

1

n

=



x

∈ R

p

: (

T

1

n

T

2

n

)

t

x

1

2



k

T

1

n

k

2

− k

T

2

n

k

2

>

0





.

(8)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

The k -means clustering method

X

n

= {

x

1

, . . . ,

x

n

}

a dataset in p dimensions;

Aim of clustering

: Group similar observations in k

clusters C

1

, . . . ,

C

k

;

The k -means algorithm constructs clusters in order to

minimize the within cluster sum of squared distances

The clusters centers

(

T

1

n

, . . . ,

T

k

n

)

are solutions of

min

{

t

1

,...,

t

k

}⊂R

p

n

X

i

=

1



inf

1

j

k

k

x

i

t

j

k

2

;

The classification rule:

x

C

j

n

⇔ k

x

T

j

n

k =

min

1

i

k

k

x

T

n

i

k;

Let us focus on k

=

2 groups:

C

1

n

=



x

∈ R

p

: (

T

1

n

T

2

n

)

t

x

1

2



k

T

1

n

k

2

− k

T

2

n

k

2

>

0





.

(9)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Result of 2-means

0

5

10

15

20

0

5

10

15

20

Algèbre

Analyse

T

1

T

2

C

2

C

1

(10)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Introduction of contamination

0

5

10

15

20

0

5

10

15

20

Algèbre

Analyse

(11)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Result of 2-means with contamination

0

5

10

15

20

0

5

10

15

20

Algèbre

Analyse

(12)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

The generalized 2-means clustering method

The clusters centers

(

T

n

1

,

T

2

n

)

are solution of

min

{

t

1

,

t

2

}⊂R

p

n

X

i

=

1



inf

1

j

2

k

x

i

t

j

k



for an increasing penalty function

Ω : R

+

→ R

+

such

that

Ω(0) =

0.

Classical penalty functions:

Ω(

x

) =

x

2

2-means method

Ω(

x

) =

x

2-medoids method

The classification rule:

x

C

1

n

⇔ Ω(k

x

T

1

n

k) ≤ Ω(k

x

T

2

n

k)

⇔ k

x

T

1

n

k ≤ k

x

T

2

n

k.

(13)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Classical penalty functions

0

1

2

3

4

5

0

5

10

15

20

25

x

2−means

2−medoids

(

x

)

(14)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Result of 2-medoids

0

5

10

15

20

0

5

10

15

20

Algèbre

Analyse

(15)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Result of 2-medoids with contamination

0

5

10

15

20

0

5

10

15

20

Algèbre

Analyse

(16)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Outline

1

Introduction

2

Statistical functionals

3

Error Rate

4

Influence functions

5

Asymptotic Loss

6

Breakdown point

7

Some improvements

(17)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Statistical functionals

The empirical

distribution F

n

is

replaced by a

cumulative distribution

F

∈ F

;

A statistical functional

T

: F → R

l

:

F

7→

T

(

F

)

such that T

(

F

n

) =

T

n

.

−3 −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 x onction de répar tition Fn F F

(18)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Classification setting

Suppose

X

F arises from G

1

and G

2

with

π

i

(

F

) =

IP

F

[

X

G

i

]

then

F is a mixture of two distributions

F

= π

1

(

F

)

F

1

+ π

2

(

F

)

F

2

with

π

1

+ π

2

=

1 and where F

1

and F

2

are the conditional

distributions under G

1

and G

2

with densities f

1

and f

2

.

(19)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Univariate mixture distribution

−4

−2

0

2

4

0.00

0.05

0.10

0.15

0.20

F

x

0

.

4

0

.

6

(20)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

The generalized 2-means statistical functionals

The clusters centers

(

T

1

(

F

),

T

2

(

F

))

are solution of

min

{

t

1

,

t

2

}⊂R

p

Z



inf

1

j

2

k

x

t

j

k



dF

(

x

)

for a suitable increasing penalty function

Ω;

The classification rule is

R

F

:

x

7→

j

⇔ k

x

T

j

(

F

)k =

min

1

i

2

k

x

T

i

(

F

)k;

The clusters are

C

1

(

F

) =



x

∈ R

p

:

A

(

F

)

t

x

+

b

(

F

) >

0

C

2

(

F

) = R

p

\

C

1

(

F

)

with A(

F

) =

T

1

(

F

) −

T

2

(

F

)

and b(

F

) = −

1

2

k

T

1

(

F

)k

2

− k

T

2

(

F

)k

2



.

(21)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Mixture of spherically symmetric distributions

X

F

µ,σ

2

if its density is

f

µ,σ

2

(

x

) =

K

σ

p

g

 (

x

− µ)

t

(

x

− µ)

σ

2



where g is a non-increasing generator function and with K a

constant such that the honesty condition holds.

Examples:

Multivariate Normal distribution: g(

r

) =

exp(−

r

2

)

Multivariate Student distribution: g(

r

) =

1

+

r

ν

(22)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Mixture of spherically symmetric distributions

X

F

µ,σ

2

if its density is

f

µ,σ

2

(

x

) =

K

σ

p

g

 (

x

− µ)

t

(

x

− µ)

σ

2



where g is a non-increasing generator function and with K a

constant such that the honesty condition holds.

Examples:

Multivariate Normal distribution: g(

r

) =

exp(−

r

2

)

Multivariate Student distribution: g(

r

) =

1

+

r

ν



ν+2p

Model (M):

(M)

F

M

= π

1

F

−µ,σ

2

+ π

2

F

µ,σ

2

with

µ = µ

1

e

1

and

µ

1

>

0

.

(23)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Representation

−4

−2

0

2

4

−4

−2

0

2

4

x1

x2

F

M

(24)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Position of T

1

(F

M

)

and T

2

(F

M

)

2-means:

Proposition (Kurata and Qiu, 2011)

Under the model distribution (M), the 2-means centers are

on the first axis.

Generalized 2-means:

Conjecture (Ruwet and Haesbroeck, 2011)

Under the model distribution (M), the generalized 2-means

centers are on the first axis.

(25)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Position of T

1

(F

M

)

and T

2

(F

M

)

2-means:

Proposition (Kurata and Qiu, 2011)

Under the model distribution (M), the 2-means centers are

on the first axis.

Generalized 2-means:

Conjecture (Ruwet and Haesbroeck, 2011)

Under the model distribution (M), the generalized 2-means

centers are on the first axis.

(26)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Outline

1

Introduction

2

Statistical functionals

3

Error Rate

4

Influence functions

5

Asymptotic Loss

6

Breakdown point

7

Some improvements

(27)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Error rate of the 2-means

0

5

10

15

20

0

5

10

15

20

Algèbre

Analyse

Succès

Echec

(28)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Error rate

Training sample according to F : estimation of the rule

Test sample according to F

m

: evaluation of the rule

In ideal circumstances : F

=

F

m

Probability to misclassify data coming from F

m

:

ER(

F

,

F

m

) = π

1

(

F

m

)IP

F

m

[

R

F

(

X

) 6=

1|

G

1

]

+ π

2

(

F

m

)IP

F

m

[

R

F

(

X

) 6=

2|

G

2

]

=

2

X

j

=1

π

j

(

F

m

)IP

F

m



R

F

(

X

) 6=

j

|

G

j



(29)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Error rate

Training sample according to F : estimation of the rule

Test sample according to F

m

: evaluation of the rule

In ideal circumstances : F

=

F

m

Probability to misclassify data coming from F

m

:

ER(

F

,

F

m

) = π

1

(

F

m

)IP

F

m

[

R

F

(

X

) 6=

1|

G

1

]

+ π

2

(

F

m

)IP

F

m

[

R

F

(

X

) 6=

2|

G

2

]

=

2

X

j

=

1

π

j

(

F

m

)IP

F

m



R

F

(

X

) 6=

j

|

G

j



(30)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Optimality in classification

A classification rule is optimal if the corresponding error

rate is minimal;

The optimal classification rule is the Bayes rule :

x

C

1

(

F

) ⇔ π

1

(

F

)

f

1

(

x

) > π

2

(

F

)

f

2

(

x

)

(Anderson, 1958).

(31)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Optimality in classification

A classification rule is optimal if the corresponding error

rate is minimal;

The optimal classification rule is the Bayes rule :

x

C

1

(

F

) ⇔ π

1

(

F

)

f

1

(

x

) > π

2

(

F

)

f

2

(

x

)

(Anderson, 1958).

Proposition (Ruwet and Haesbroeck, 2011)

The 2-means procedure is optimal under the model

F

O

=

0.5 F

−µ

,

σ

2

+

0.5 F

µ

,

σ

2

with

µ = µ

1

e

1

and

µ

1

>

0.

With the Conjecture, the generalized 2-means procedures

are also optimal under F

O

.

(32)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Contaminated distribution

A contaminated distribution is defined by

F

ε

""

D

D

D

D

D

D

D

D

zzuu

uu

uu

uu

uu

1

− ε :

F

ε :

G

(33)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Contaminated distribution

A contaminated distribution is defined by

F

ε

""

D

D

D

D

D

D

D

D

zzuu

uu

uu

uu

uu

1

− ε :

F

ε :

G

where G is a arbitrary distribution function.

To see the influence of one singular point x , G

= ∆

x

leading

to

(34)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Contaminated mixture

−4

−2

0

2

4

0.00

0.05

0.10

0.15

0.20

x

(1−0.05) 0.6

(1−0.05) 0.4

0.05

F

ε,

x

(35)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Error rate under contamination

Contaminated training sample according to F

ε

:

estimation of the rule

Test sample according to F

m

: evaluation of the rule

ER(

F

ε,

F

m

) =

2

X

j

=

1

π

j

(

F

m

)IP

F

m



R

(

X

) 6=

j

|

G

j



(36)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Outline

1

Introduction

2

Statistical functionals

3

Error Rate

4

Influence functions

5

Asymptotic Loss

6

Breakdown point

7

Some improvements

(37)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Definition and properties of the first order

influence function

Hampel et al. (1986) : For any statistical functional T and

any distribution F ,

IF(

x

;

T,

F

) =

lim

ε→

0

T((1

− ε)

F

+ ε∆

x

) −

T(

F

)

ε

=

∂ε

T((1

− ε)

F

+ ε∆

x

)

ε=

0

(under condition of existence);

E

F

[

IF

(

X

;

T

,

F

)] =

0;

First order Taylor expansion of T at F :

T

(

F

ε,

x

) ≈

T

(

F

) + ε

IF

(

x

;

T

,

F

)

(38)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Definition and properties of the first order

influence function

Hampel et al. (1986) : For any statistical functional T and

any distribution F ,

IF(

x

;

T,

F

) =

lim

ε→

0

T((1

− ε)

F

+ ε∆

x

) −

T(

F

)

ε

=

∂ε

T((1

− ε)

F

+ ε∆

x

)

ε=

0

(under condition of existence);

E

F

[IF(

X

;

T,

F

)] =

0;

First order Taylor expansion of T at F :

T(

F

ε,

x

) ≈

T(

F

) + εIF(

x

;

T,

F

)

(39)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

First order influence function of the error rate

(40)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

First order influence function of the error rate

Now, the training sample is distributed as F

ε,

x

.

If the model distribution is F

O

,

ER(

F

ε,

x

,

F

O

) ≈

ER(

F

O

,

F

O

) + εIF(

x

;

ER,

F

O

)

ER(

F

ε,

x

,

F

O

) ≥

ER(

F

O

,

F

O

)

(41)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

First order influence function of the error rate

Now, the training sample is distributed as F

ε,

x

.

If the model distribution is F

O

,

ER(

F

ε,

x

,

F

O

) ≈

ER(

F

O

,

F

O

) + εIF(

x

;

ER,

F

O

)

ER(

F

ε,

x

,

F

O

) ≥

ER(

F

O

,

F

O

)

E

F

O

[IF(

X

;

ER,

F

O

)] =

0

IF(

x

;

ER,

F

O

) ≡

0

(42)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Definition of the second order influence function

For any statistical functional T and any distribution F ,

IF2(

x

;

T,

F

O

) =

2

∂ε

2

T((1

− ε)

F

O

+ ε∆

x

)

ε=

0

(43)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Definition of the second order influence function

For any statistical functional T and any distribution F ,

IF2(

x

;

T,

F

O

) =

2

∂ε

2

T((1

− ε)

F

O

+ ε∆

x

)

ε=

0

(under condition of existence).

Second order Taylor expansion of ER at F

O

:

ER

(

F

ε,x

,

F

O

) ≈

ER

(

F

O

,

F

O

) +

ε

2

2

IF2

(

x

;

ER

,

F

O

)

for

ε

small enough.

(44)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

First order influence function of the error rate

under F

M

Proposition (Ruwet and Haesbroeck, 2011)

Under F

M

, the first order influence function of the error rate

of the generalized 2-means classification procedure is given

by

IF(

x

;

ER,

F

M

) =

π

2

− π

1

2

f

µ,σ

2

(0)

IF(

x

;

T

1

,

F

M

)+IF(

x

;

T

2

,

F

M

)



t

e

1

for all x such that A

(

F

M

)

t

x

+

b

(

F

M

) 6=

0.

This influence function is bounded as soon as the influence

functions of the generalized 2-means centers (see next

slide) are bounded.

The influence function is also available for any model

distribution F .

(45)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

First order influence function of the generalized

2-means centers

Proposition (García-Escudero and Gordaliza, 1999)

The influence function of the generalized 2-means centers

T

1

and T

2

is given by



IF(

x

;

T

1

,

F

m

)

IF(

x

;

T

2

,

F

m

)



=

M

1



ω

1

(

x

)

ω

2

(

x

)



where

ω

i

(

x

) = −

grad

y

Ω(k

y

k)

y

=

x

T

i

(

F

m

)

I(

x

C

i

(

F

m

))

and

where the matrix M depends only on the distribution F

m

.

This influence function is bounded as soon as M

1

exists

and as soon as the gradient of

is bounded.

(46)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Univariate second order influence function of

the error rate under F

O

Proposition (Ruwet and Haesbroeck, 2011)

Under F

O

, the univariate second order influence function of

the error rate of the generalized 2-means classification

procedure is given by

IF2(

x

;

ER,

F

O

) = −

1

4

f

−µ,σ

2

(0)

IF(

x

;

T

1

,

F

O

)+IF(

x

;

T

2

,

F

O

)



2

for all x such that A

(

F

O

)

x

+

b

(

F

O

) 6=

0.

The influence function is also available for multivariate

distributions.

(47)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Graph of IF(x

;

ER,

F

M

)

−4

−2

0

2

4

−0.2

−0.1

0.0

0.1

0.2

x

IF ER

2−means

2−medoids

(48)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Graph of IF(x

;

ER,

F

M

)

−4

−2

0

2

4

−2.0

−1.0

0.0

0.5

2−means

x

IF ER

π

1

=

0

.

4

π

1

=

0

.

6

π

1

=

0

.

2

π

1

=

0

.

8

(49)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Graph of IF(x

;

ER,

F

M

)

−4

−2

0

2

4

−0.2

−0.1

0.0

0.1

0.2

2−medoids

x

IF ER

∆=

1

∆=

3

∆=

5

(50)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Graph of IF2(x

;

ER,

F

O

)

−4

−2

0

2

4

0

1

2

3

4

5

6

x

IF2 ER

2−means

2−medoids

(51)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Outline

1

Introduction

2

Statistical functionals

3

Error Rate

4

Influence functions

5

Asymptotic Loss

6

Breakdown point

7

Some improvements

(52)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Asymptotic loss

Under optimality (F

O

), a measure of the expected increase

in error rate when estimating the optimal clustering rule from

a finite sample with empirical cdf F

n

is

A-Loss

=

lim

n

→∞

n E

F

O

[ER(

F

n

,

F

O

) −

ER(

F

O

,

F

O

)].

As in Croux et al. (2008) :

Proposition

Under some regularity conditions of the clusters centers

estimators,

A-Loss

=

1

(53)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Asymptotic loss

Proposition (Ruwet and Haesbroeck, 2011)

Under an optimal mixture of normal distributions, F

N

, with

µ = ∆/2 e

1

, the asymptotic loss of the generalized 2-means

procedure is given by

A-Loss

=

16σ

3

τ

2

f

0

,

1

 ∆





τ

2

[ASV(

T

21

) +

ASV(

T

11

)

+

2ASC(

T

11

,

T

21

)]

+ σ

2

[ASV(

T

12

) +

ASV(

T

22

) −

2ASC(

T

12

,

T

22

)]



where ASV and ASC stand for the asymptotic variance and

covariance of their component (at the model distribution).

(54)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Graph of the asymptotic loss

0

2

4

6

8

0.0

0.5

1.0

1.5

Delta

A−Loss

2−means

2−medoids

(55)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Asymptotic relative classification efficiencies

A measure of the price one needs to pay in error rate for

protection against the outliers when using a robust

procedure instead of the classical one is

ARCE(Robust,Classical)

=

A-Loss(Classical)

(56)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Asymptotic relative classification efficiencies

A measure of the price one needs to pay in error rate for

protection against the outliers when using a robust

procedure instead of the classical one is

ARCE(Robust,Classical)

=

A-Loss(Classical)

A-Loss(Robust)

.

More generally, the ARCE of a method (Method 1) w.r.t.

another one (Method 2) is given by

ARCE(Method 1,Method 2)

=

A-Loss(Method 2)

A-Loss(Method 1)

.

(57)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

ARCE of 2-medoids w.r.t. 2-means

0

2

4

6

8

0.60

0.65

0.70

Delta

ARCE

(58)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

ARCE of classification procedures w.r.t.

2-means

0.4

0.6

0.8

1.0

1.2

1.4

0.0

0.2

0.4

0.6

Delta

ARCE

Fisher

Logistic

(59)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Outline

1

Introduction

2

Statistical functionals

3

Error Rate

4

Influence functions

5

Asymptotic Loss

6

Breakdown point

7

Some improvements

(60)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Intuitive definition

The breakdown point (BDP) is the minimal fraction of

outliers that needs to be added (addition BDP) or replaced

(replacement BDP) in order to destroy completely the

estimator, i.e. to get an estimation

at infinity (Hampel, 1971);

at the bounds of the support of the estimator (He and

Simpson, 1992);

which is restricted to a finite set while it could lie in an

infinite set without contamination (Genton and Lucas,

2003);

(61)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

One entire group of the test sample (F ) is badly

classified while the other is well classified;

ER

(

F

ε

,

F

) = π

1

or ER

(

F

ε

,

F

) = π

2

for any sample;

ER has broken down in the sense of Genton and Lucas

(2003);

The BDP of the ER is 1

/

n which tends to zero as

(62)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

One entire group of the test sample (F ) is badly

classified while the other is well classified;

ER

(

F

ε

,

F

) = π

1

or ER

(

F

ε

,

F

) = π

2

for any sample;

ER has broken down in the sense of Genton and Lucas

(2003);

The BDP of the ER is 1

/

n which tends to zero as

(63)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

One entire group of the test sample (F ) is badly

classified while the other is well classified;

ER

(

F

ε

,

F

) = π

1

or ER

(

F

ε

,

F

) = π

2

for any sample;

ER has broken down in the sense of Genton and Lucas

(2003);

The BDP of the ER is 1

/

n which tends to zero as

(64)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

One entire group of the test sample (F ) is badly

classified while the other is well classified;

ER(

F

ε

,

F

) = π

1

or ER(

F

ε

,

F

) = π

2

for any sample;

ER has broken down in the sense of Genton and Lucas

(2003);

The BDP of the ER is 1

/

n which tends to zero as

(65)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

One entire group of the test sample (F ) is badly

classified while the other is well classified;

ER(

F

ε

,

F

) = π

1

or ER(

F

ε

,

F

) = π

2

for any sample;

ER has broken down in the sense of Genton and Lucas

(2003);

The BDP of the ER is 1

/

n which tends to zero as

(66)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

One entire group of the test sample (F ) is badly

classified while the other is well classified;

ER(

F

ε

,

F

) = π

1

or ER(

F

ε

,

F

) = π

2

for any sample;

ER has broken down in the sense of Genton and Lucas

(2003);

The BDP of the ER is 1/

n which tends to zero as

(67)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Outline

1

Introduction

2

Statistical functionals

3

Error Rate

4

Influence functions

5

Asymptotic Loss

6

Breakdown point

7

Some improvements

(68)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

The trimming approach

Idea:

Delete extreme observations!

Problem:

How can we detect extreme observations?

Solution:

Impartial trimming

k the fixed number of clusters;

α ∈ [0,

1[

the trimming size;

X

n

= {

x

1

, . . . ,

x

n

} ∈ R

p

a dataset that is not

concentrated on k points after removing a mass equal

to

α;

Optimization over partitions

R = {

R

1

, . . . ,

R

k

}

of

{1, . . . ,

n

}

with

n

(1

− α)⌉

observations;

(69)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

The generalized trimmed k -means method

Cuesta-Albertos et al., 1997

The clusters centers

(

T

1

n

, . . . ,

T

k

n

)

are solutions of the

double minimization problem

min

R

{

t

1

,...,

min

t

k

}⊂R

p

X

x

i

∈R



inf

1

j

k

k

x

i

t

j

k



;

The classification rule:

x

C

j

n



k

x

T

j

n

k =

min

1

i

k

k

x

T

i

n

k

(70)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

The generalized trimmed k -means method

Cuesta-Albertos et al., 1997

The clusters centers

(

T

1

n

, . . . ,

T

k

n

)

are solutions of the

double minimization problem

min

R

{

t

1

,...,

min

t

k

}⊂R

p

X

x

i

∈R



inf

1

j

k

k

x

i

t

j

k



;

The classification rule:

x

C

j

n



k

x

T

j

n

k =

min

1

i

k

k

x

T

i

n

k

(71)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Properties of the generalized trimmed 2-means

García-Escudero and Gordaliza, 1999

Bounded IF whatever

Ω;

Better breakdown behavior.

Ruwet and Haesbroeck (unpublished result)

If the Conjecture also holds for the generalized trimmed

2-means, this procedure is optimal under the model F

O

.

(72)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Problem of all "k -means type" procedures

−20 −10 0 10 20 30 −20 −10 0 10

Simulated dataset

(73)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Problem of all "k -means type" procedures

−20 −10 0 10 20 30 −20 −10 0 10

Simulated dataset

−20 −10 0 10 20 30 −20 −10 0 10

(74)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

The TCLUST procedure

García-Escudero et al., 2008

Optimization also over the scatter matrices S

n

j

and the

weights p

j

n

such that

P

k

j

=

1

p

j

=

1;

Maximization of

k

X

j

=

1

X

i

R

j

log p

j

ϕ

x

i

;

T

j

,

S

j



where

ϕ

is the pdf of the Gaussian distribution;

Eigenvalues-ratio restriction:

M

n

m

n

=

max

j

=

1

,...,

k

max

l

=

1

,...,

p

λ

l

(

S

j

)

min

j

=

1

,...,

k

min

l

=

1

,...,

p

λ

l

(

S

j

)

c

for a constant c

1 and where

λ

l

(

S

j

)

are the

eigenvalues of S

j

, l

=

1, . . . ,

p and j

=

1, . . . ,

k .

(75)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Improvement with the TCLUST procedure

−20 −10 0 10 20 30 −20 −10 0 10

Simulated dataset

−20 −10 0 10 20 30 −20 −10 0 10

Estimation by TCLUST

(76)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Current and future work

Robustness properties of the TCLUST procedure:

The influence function (Ruwet et al., Submitted)

The breakdown behavior

Références

Documents relatifs

[r]

Lyc´ ee Paul Constans, Montlu¸ con, 2018-2019 Programme de colle semaine 5 - du 15/10 au 19/10 2 Lorsqu’on r´ eduit l’intervalle d’´ etude d’une fonction grˆ ace ` a une

[r]

[r]

[r]

Lepic arrive de Paris ce matin même pour que ses fils puissent le rencontrer.. Lepic

sous – total IV Techniques de fabrication (production libre) / 20. sous – total V Dressage et envoi

On pourra également poser des exercices portant sur l’algèbre linéaire de première année, en particu- lier : sous-espaces vectoriels (et bases) et applications