Robustesse des classifications obtenues par clustering

(1)

Robustness of classification based on clustering Ch. Ruwet Introduction Statistical functionals Error Rate Influence functions Asymptotic Loss Breakdown point Some improvements

Robustness of classification based on

clustering

Ch. Ruwet

D

EPARTMENT OF

M

ATHEMATICS

- U

NIVERSITY OF

L

IÈGE

(2)

Outline

1 Introduction

2 Statistical functionals

3 Error Rate

4 Influence functions

5 Asymptotic Loss

6 Breakdown point

7 Some improvements

(3)

Outline

1 Introduction

2 Statistical functionals

3 Error Rate

4 Influence functions

5 Asymptotic Loss

6 Breakdown point

7 Some improvements

(4)

Classification

0

5

10

15

0

5

10

15 Algèbre

Analyse

Succès

Echec

(5)

Clustering

0

5

10

15

0

5

10

15 Algèbre

Analyse

(6)

The k -means clustering method

X

n

= {

x

1 , . . . ,

x

n

}

a dataset in p dimensions;

Aim of clustering

: Group similar observations in k

clusters C

1 , . . . ,

C

k

;

The k -means algorithm constructs clusters in order to

minimize the within cluster sum of squared distances

The clusters centers

(

T

₁

n

, . . . ,

T

_k

n

)

are solutions of

min

{

t

1

,...,

t

k}⊂Rp

n

X

i

=

1 inf

1 ≤

j

≤

k

x

i

−

t

j

k

2 ;

The classification rule:

x

∈

C

_j

n

⇔ k

x

−

T

_j

n

k =

min

1 ≤

i

≤

k

x

−

T

n

i

k;

Let us focus on k

=

2 groups:

C

₁

n

=

x

∈ R

p

: (

T

₁

n

−

T

₂

n

)

t

x

−

1

2 k

T

₁

n

k

2 − k

T

₂

n

k

2 >

0 .

(7)

The k -means clustering method

X

n

= {

x

1 , . . . ,

x

n

}

a dataset in p dimensions;

Aim of clustering

: Group similar observations in k

clusters C

1 , . . . ,

C

k

;

The k -means algorithm constructs clusters in order to

minimize the within cluster sum of squared distances

The clusters centers

(

T

₁

n

, . . . ,

T

_k

n

)

are solutions of

min

{

t

1

,...,

t

k

}⊂R

p

n

X

i

=

1 inf

1 ≤

j

≤

k

x

i

−

t

j

k

2 ;

The classification rule:

x

∈

C

_j

n

⇔ k

x

−

T

_j

n

k =

min

1 ≤

i

≤

k

x

−

T

n

i

k;

Let us focus on k

=

2 groups:

C

₁

n

=

x

∈ R

p

: (

T

₁

n

−

T

₂

n

)

t

x

−

1

2 k

T

₁

n

k

2 − k

T

₂

n

k

2 >

0 .

(8)

The k -means clustering method

X

n

= {

x

1 , . . . ,

x

n

}

a dataset in p dimensions;

Aim of clustering

: Group similar observations in k

clusters C

1 , . . . ,

C

k

;

The k -means algorithm constructs clusters in order to

minimize the within cluster sum of squared distances

The clusters centers

(

T

₁

n

, . . . ,

T

_k

n

)

are solutions of

min

{

t

1

,...,

t

k

}⊂R

p

n

X

i

=

1 inf

1 ≤

j

≤

k

x

i

−

t

j

k

2 ;

The classification rule:

x

∈

C

_j

n

⇔ k

x

−

T

_j

n

k =

min

1 ≤

i

≤

k

x

−

T

n

i

k;

Let us focus on k

=

2 groups:

C

₁

n

=

x

∈ R

p

: (

T

₁

n

−

T

₂

n

)

t

x

−

1

2 k

T

₁

n

k

2 − k

T

₂

n

k

2 >

0 .

(9)

Result of 2-means

0

5

10

15

20

0

5

10

15

20 Algèbre

Analyse

T

1

T

2

C

2

C

1

(10)

Introduction of contamination

0

5

10

15

20

0

5

10

15

20 Algèbre

Analyse

(11)

Result of 2-means with contamination

0

5

10

15

20

0

5

10

15

20 Algèbre

Analyse

(12)

The generalized 2-means clustering method

The clusters centers

(

T

n

1 ,

T

2 n

)

are solution of

min

{

t

1

,

t

2

}⊂R

p

n

X

i

=

1 Ω

inf

1 ≤

j

≤

2 k

x

i

−

t

j

k

for an increasing penalty function

_{Ω : R}

+

_{→ R}

+

such

that

Ω(0) =

0. Classical penalty functions:

Ω(

x

) =

x

2 →

2-means method

Ω(

x

) =

x

→

2-medoids method

The classification rule:

x

∈

C

₁

n

⇔ Ω(k

x

−

T

₁

n

k) ≤ Ω(k

x

−

T

₂

n

k)

⇔ k

x

−

T

₁

n

k ≤ k

x

−

T

₂

n

k.

(13)

Classical penalty functions

0

1

2

3

4

5

0

5

10

15

20

25 x

2−means

2−medoids

Ω

(

x

)

(14)

Result of 2-medoids

0

5

10

15

20

0

5

10

15

20 Algèbre

Analyse

(15)

Result of 2-medoids with contamination

0

5

10

15

20

0

5

10

15

20 Algèbre

Analyse

(16)

Outline

1 Introduction

2 Statistical functionals

3 Error Rate

4 Influence functions

5 Asymptotic Loss

6 Breakdown point

7 Some improvements

(17)

Statistical functionals

The empirical

distribution F

n

is

replaced by a

cumulative distribution

F

∈ F

;

A statistical functional

T

: F → R

l

:

F

7→

T

(

F

)

such that T

(

F

n

) =

T

n

.

−3 −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 x onction de répar tition Fn F F

(18)

Classification setting

Suppose

X

∼

F arises from G

1 and G

2 with

π

i

(

F

) =

IP

F

[

X

∈

G

i

]

then

F is a mixture of two distributions

F

= π

1 (

F

)

F

1 + π

2 (

F

)

F

2 with

π

1 + π

2 =

1 and where F

1 and F

2 are the conditional

distributions under G

1 and G

2 with densities f

1 and f

2 .

(19)

Univariate mixture distribution

−4

−2

0

2

4

0.00

0.05

0.10

0.15

0.20 F

x

0 .

4

0 .

6

(20)

The generalized 2-means statistical functionals

The clusters centers

(

T

1 (

F

),

T

2 (

F

))

are solution of

min

{

t

1

,

t

2

}⊂R

p

Z

Ω

inf

1 ≤

j

≤

2 k

x

−

t

j

k

dF

(

x

)

for a suitable increasing penalty function

Ω;

The classification rule is

R

F

:

x

7→

j

⇔ k

x

−

T

j

(

F

)k =

min

1 ≤

i

≤

2 k

x

−

T

i

(

F

)k;

The clusters are

C

1 (

F

) =

x

_{∈ R}

p

:

A

(

F

)

t

x

+

b

(

F

) >

0 C

2 (

F

) = R

p

\

C

1 (

F

)

with A(

F

) =

T

1 (

F

) −

T

2 (

F

)

and b(

F

) = −

1 ₂

k

T

1 (

F

)k

2 − k

T

2 (

F

)k

2 .

(21)

Mixture of spherically symmetric distributions

X

_∼

F

_µ,σ

2

if its density is

f

_µ,σ

2

(

x

) =

K

σ

p

g

(

x

− µ)

t

₍

_x

_{− µ)}

σ

2 where g is a non-increasing generator function and with K a

constant such that the honesty condition holds.

Examples:

Multivariate Normal distribution: g(

r

) =

exp(−

r

2 )

Multivariate Student distribution: g(

r

) =

1 +

r

ν

(22)

Mixture of spherically symmetric distributions

X

_∼

F

_µ,σ

2

if its density is

f

_µ,σ

2

(

x

) =

K

σ

p

g

(

x

− µ)

t

₍

_x

_{− µ)}

σ

2 where g is a non-increasing generator function and with K a

constant such that the honesty condition holds.

Examples:

Multivariate Normal distribution: g(

r

) =

exp(−

r

2 )

Multivariate Student distribution: g(

r

) =

1 +

r

ν

−

ν+₂p

Model (M):

(M)

F

M

= π

1 F

−µ,σ

2

+ π

₂

F

µ,σ

2

with

µ = µ

1 e

1 and

µ

1 >

0 .

(23)

Representation

−4

−2

0

2

4 −4

−2

0

2

4 x1

x2

F

_M

(24)

Position of T

1 (F

M

)

and T

2 (F

M

)

2-means:

Proposition (Kurata and Qiu, 2011)

Under the model distribution (M), the 2-means centers are

on the first axis.

Generalized 2-means:

Conjecture (Ruwet and Haesbroeck, 2011)

Under the model distribution (M), the generalized 2-means

centers are on the first axis.

(25)

Position of T

1 (F

M

)

and T

2 (F

M

)

2-means:

Proposition (Kurata and Qiu, 2011)

Under the model distribution (M), the 2-means centers are

on the first axis.

Generalized 2-means:

Conjecture (Ruwet and Haesbroeck, 2011)

Under the model distribution (M), the generalized 2-means

centers are on the first axis.

(26)

Outline

1 Introduction

2 Statistical functionals

3 Error Rate

4 Influence functions

5 Asymptotic Loss

6 Breakdown point

7 Some improvements

(27)

Error rate of the 2-means

0

5

10

15

20

0

5

10

15

20 Algèbre

Analyse

Succès

Echec

(28)

Error rate

Training sample according to F : estimation of the rule

Test sample according to F

m

: evaluation of the rule

In ideal circumstances : F

=

F

m

Probability to misclassify data coming from F

m

:

ER(

F

,

F

m

) = π

1 (

F

m

)IP

F

m

[

R

F

(

X

) 6=

1|

G

1 ]

+ π

2 (

F

m

)IP

F

m

[

R

F

(

X

) 6=

2|

G

2 ]

=

2 X

j

=1

π

j

(

F

m

)IP

F

m

R

F

(

X

) 6=

j

|

G

j

(29)

Error rate

Training sample according to F : estimation of the rule

Test sample according to F

m

: evaluation of the rule

In ideal circumstances : F

=

F

m

Probability to misclassify data coming from F

m

:

ER(

F

,

F

m

) = π

1 (

F

m

)IP

F

m

[

R

F

(

X

) 6=

1|

G

1 ]

+ π

2 (

F

m

)IP

F

m

[

R

F

(

X

) 6=

2|

G

2 ]

=

2 X

j

=

1 π

j

(

F

m

)IP

F

m

R

F

(

X

) 6=

j

|

G

j

(30)

Optimality in classification

A classification rule is optimal if the corresponding error

rate is minimal;

The optimal classification rule is the Bayes rule :

x

∈

C

1 (

F

) ⇔ π

1 (

F

)

f

1 (

x

) > π

2 (

F

)

f

2 (

x

)

(Anderson, 1958).

(31)

Optimality in classification

A classification rule is optimal if the corresponding error

rate is minimal;

The optimal classification rule is the Bayes rule :

x

∈

C

1 (

F

) ⇔ π

1 (

F

)

f

1 (

x

) > π

2 (

F

)

f

2 (

x

)

(Anderson, 1958).

Proposition (Ruwet and Haesbroeck, 2011)

The 2-means procedure is optimal under the model

F

O

=

0.5 F

−µ

,

σ

2

+

0.5 F

µ

,

σ

2

with

µ = µ

1 e

1 and

µ

1 >

0. With the Conjecture, the generalized 2-means procedures

are also optimal under F

_O

.

(32)

Contaminated distribution

A contaminated distribution is defined by

F

ε

""

D

zzuu

uu

1 − ε :

F

ε :

G

(33)

Contaminated distribution

A contaminated distribution is defined by

F

ε

""

D

zzuu

uu

1 − ε :

F

ε :

G

where G is a arbitrary distribution function.

To see the influence of one singular point x , G

= ∆

x

leading

to

(34)

Contaminated mixture

−4

−2

0

2

4

0.00

0.05

0.10

0.15

0.20 x

(1−0.05) 0.6

(1−0.05) 0.4

0.05 F

ε,

x

(35)

Error rate under contamination

Contaminated training sample according to F

ε

:

estimation of the rule

Test sample according to F

m

: evaluation of the rule

ER(

F

ε,

F

m

) =

2 X

j

=

1 π

j

(

F

m

)IP

F

m

R

Fε

(

X

) 6=

j

|

G

j

(36)

Outline

1 Introduction

2 Statistical functionals

3 Error Rate

4 Influence functions

5 Asymptotic Loss

6 Breakdown point

7 Some improvements

(37)

Definition and properties of the first order

influence function

Hampel et al. (1986) : For any statistical functional T and

any distribution F ,

IF(

x

;

T,

F

) =

lim

ε→

0 T((1

− ε)

F

+ ε∆

x

) −

T(

F

)

ε

=

∂

∂ε

T((1

− ε)

F

+ ε∆

x

)

ε=

0 (under condition of existence);

E

F

[

IF

(

X

;

T

,

F

)] =

0;

First order Taylor expansion of T at F :

T

(

F

ε,

x

) ≈

T

(

F

) + ε

IF

(

x

;

T

,

F

)

(38)

Definition and properties of the first order

influence function

Hampel et al. (1986) : For any statistical functional T and

any distribution F ,

IF(

x

;

T,

F

) =

lim

ε→

0 T((1

− ε)

F

+ ε∆

x

) −

T(

F

)

ε

=

∂

∂ε

T((1

− ε)

F

+ ε∆

x

)

ε=

0 (under condition of existence);

E

F

[IF(

X

;

T,

F

)] =

0;

First order Taylor expansion of T at F :

T(

F

ε,

x

) ≈

T(

F

) + εIF(

x

;

T,

F

)

(39)

First order influence function of the error rate

(40)

First order influence function of the error rate

Now, the training sample is distributed as F

ε,

x

.

If the model distribution is F

O

,

ER(

F

ε,

x

,

F

O

) ≈

ER(

F

O

,

F

O

) + εIF(

x

;

ER,

F

O

)

ER(

F

ε,

x

,

F

O

) ≥

ER(

F

O

,

F

O

)

(41)

First order influence function of the error rate

Now, the training sample is distributed as F

ε,

x

.

If the model distribution is F

O

,

ER(

F

ε,

x

,

F

O

) ≈

ER(

F

O

,

F

O

) + εIF(

x

;

ER,

F

O

)

ER(

F

ε,

x

,

F

O

) ≥

ER(

F

O

,

F

O

)

E

F

O

[IF(

X

;

ER,

F

O

)] =

0 ⇒

IF(

x

;

ER,

F

O

) ≡

0

(42)

Definition of the second order influence function

For any statistical functional T and any distribution F ,

IF2(

x

;

T,

F

_O

) =

∂

2 ∂ε

2 T((1

− ε)

F

O

+ ε∆

x

)

ε=

0

(43)

Definition of the second order influence function

For any statistical functional T and any distribution F ,

IF2(

x

;

T,

F

_O

) =

∂

2 ∂ε

2 T((1

− ε)

F

O

+ ε∆

x

)

ε=

0 (under condition of existence).

Second order Taylor expansion of ER at F

_O

:

ER

(

F

ε,x

,

F

O

) ≈

ER

(

F

O

,

F

O

) +

ε

2

2 IF2

(

x

;

ER

,

F

O

)

for

ε

small enough.

(44)

First order influence function of the error rate

under F

M

Proposition (Ruwet and Haesbroeck, 2011)

Under F

M

, the first order influence function of the error rate

of the generalized 2-means classification procedure is given

by

IF(

x

;

ER,

F

M

) =

π

2 − π

1

2 f

µ,σ

2

(0)

IF(

x

;

T

1 ,

F

M

)+IF(

x

;

T

2 ,

F

M

)

t

e

1 for all x such that A

(

F

M

)

t

x

+

b

(

F

M

) 6=

0. This influence function is bounded as soon as the influence

functions of the generalized 2-means centers (see next

slide) are bounded.

The influence function is also available for any model

distribution F .

(45)

First order influence function of the generalized

2-means centers

Proposition (García-Escudero and Gordaliza, 1999)

The influence function of the generalized 2-means centers

T

1 and T

2 is given by

IF(

x

;

T

1 ,

F

m

)

IF(

x

;

T

2 ,

F

m

)

=

M

−

1 ω

1 (

x

)

ω

2 (

x

)

where

ω

i

(

x

) = −

grad

y

Ω(k

y

k)

y

=

x

−

T

i

(

F

m

)

I(

x

∈

C

i

(

F

m

))

and

where the matrix M depends only on the distribution F

m

.

This influence function is bounded as soon as M

−

1 exists

and as soon as the gradient of

Ω

is bounded.

(46)

Univariate second order influence function of

the error rate under F

O

Proposition (Ruwet and Haesbroeck, 2011)

Under F

_O

, the univariate second order influence function of

the error rate of the generalized 2-means classification

procedure is given by

IF2(

x

;

ER,

F

O

) = −

1

4 f

′

−µ,σ

2

(0)

IF(

x

;

T

1 ,

F

O

)+IF(

x

;

T

2 ,

F

O

)

2 for all x such that A

(

F

_O

)

x

+

b

(

F

_O

) 6=

0. The influence function is also available for multivariate

distributions.

(47)

Graph of IF(x

;

ER,

F

M

)

−4

−2

0

2

4 −0.2

−0.1

0.0

0.1

0.2 x

IF ER

2−means

2−medoids

(48)

Graph of IF(x

;

ER,

F

M

)

−4

−2

0

2

4 −2.0

−1.0

0.0

0.5 2−means

x

IF ER

π

1

=

0 .

4 π

1

=

0 .

6 π

1

=

0 .

2 π

1

=

0 .

8

(49)

Graph of IF(x

;

ER,

F

M

)

−4

−2

0

2

4 −0.2

−0.1

0.0

0.1

0.2 2−medoids

x

IF ER

∆=

1 ∆=

3 ∆=

5

(50)

Graph of IF2(x

;

ER,

F

O

)

−4

−2

0

2

4

0

1

2

3

4

5

6 x

IF2 ER

2−means

2−medoids

(51)

Outline

1 Introduction

2 Statistical functionals

3 Error Rate

4 Influence functions

5 Asymptotic Loss

6 Breakdown point

7 Some improvements

(52)

Asymptotic loss

Under optimality (F

O

), a measure of the expected increase

in error rate when estimating the optimal clustering rule from

a finite sample with empirical cdf F

n

is

A-Loss

=

lim

n

→∞

n E

F

O

[ER(

F

n

,

F

O

) −

ER(

F

O

,

F

O

)].

As in Croux et al. (2008) :

Proposition

Under some regularity conditions of the clusters centers

estimators,

A-Loss

=

1

(53)

Asymptotic loss

Proposition (Ruwet and Haesbroeck, 2011)

Under an optimal mixture of normal distributions, F

N

, with

µ = ∆/2 e

1 , the asymptotic loss of the generalized 2-means

procedure is given by

A-Loss

=

∆

16σ

3 _τ

2 f

0 ,

1 ∆

2σ

τ

2 [ASV(

T

21 ) +

ASV(

T

11 )

+

2ASC(

T

11 ,

T

21 )]

+ σ

2 [ASV(

T

12 ) +

ASV(

T

22 ) −

2ASC(

T

12 ,

T

22 )]

where ASV and ASC stand for the asymptotic variance and

covariance of their component (at the model distribution).

(54)

Graph of the asymptotic loss

0

2

4

6

8

0.0

0.5

1.0

1.5 Delta

A−Loss

2−means

2−medoids

(55)

Asymptotic relative classification efficiencies

A measure of the price one needs to pay in error rate for

protection against the outliers when using a robust

procedure instead of the classical one is

ARCE(Robust,Classical)

=

A-Loss(Classical)

(56)

Asymptotic relative classification efficiencies

A measure of the price one needs to pay in error rate for

protection against the outliers when using a robust

procedure instead of the classical one is

ARCE(Robust,Classical)

=

A-Loss(Classical)

A-Loss(Robust)

.

More generally, the ARCE of a method (Method 1) w.r.t.

another one (Method 2) is given by

ARCE(Method 1,Method 2)

=

A-Loss(Method 2)

A-Loss(Method 1)

.

(57)

ARCE of 2-medoids w.r.t. 2-means

0

2

4

6

8

0.60

0.65

0.70 Delta

ARCE

(58)

ARCE of classification procedures w.r.t.

2-means

0.4

0.6

0.8

1.0

1.2

1.4

0.0

0.2

0.4

0.6 Delta

ARCE

Fisher

Logistic

(59)

Outline

1 Introduction

2 Statistical functionals

3 Error Rate

4 Influence functions

5 Asymptotic Loss

6 Breakdown point

7 Some improvements

(60)

Intuitive definition

The breakdown point (BDP) is the minimal fraction of

outliers that needs to be added (addition BDP) or replaced

(replacement BDP) in order to destroy completely the

estimator, i.e. to get an estimation

at infinity (Hampel, 1971);

at the bounds of the support of the estimator (He and

Simpson, 1992);

which is restricted to a finite set while it could lie in an

infinite set without contamination (Genton and Lucas,

2003);

(61)

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

⇒

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

⇒

One entire group of the test sample (F ) is badly

classified while the other is well classified;

⇒

ER

(

F

ε

,

F

) = π

1 or ER

(

F

ε

,

F

) = π

2 for any sample;

⇒

ER has broken down in the sense of Genton and Lucas

(2003);

⇒

The BDP of the ER is 1

/

n which tends to zero as

(62)

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

⇒

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

⇒

One entire group of the test sample (F ) is badly

classified while the other is well classified;

⇒

ER

(

F

ε

,

F

) = π

1 or ER

(

F

ε

,

F

) = π

2 for any sample;

⇒

ER has broken down in the sense of Genton and Lucas

(2003);

⇒

The BDP of the ER is 1

/

n which tends to zero as

(63)

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

⇒

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

⇒

One entire group of the test sample (F ) is badly

classified while the other is well classified;

⇒

ER

(

F

ε

,

F

) = π

1 or ER

(

F

ε

,

F

) = π

2 for any sample;

⇒

ER has broken down in the sense of Genton and Lucas

(2003);

⇒

The BDP of the ER is 1

/

n which tends to zero as

(64)

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

⇒

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

⇒

One entire group of the test sample (F ) is badly

classified while the other is well classified;

⇒

ER(

F

ε

,

F

) = π

1 or ER(

F

ε

,

F

) = π

2 for any sample;

⇒

ER has broken down in the sense of Genton and Lucas

(2003);

⇒

The BDP of the ER is 1

/

n which tends to zero as

(65)

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

⇒

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

⇒

One entire group of the test sample (F ) is badly

classified while the other is well classified;

⇒

ER(

F

ε

,

F

) = π

1 or ER(

F

ε

,

F

) = π

2 for any sample;

⇒

ER has broken down in the sense of Genton and Lucas

(2003);

⇒

The BDP of the ER is 1

/

n which tends to zero as

(66)

BDP of the error rate of the generalized

2-means

Let the observation x of the training sample (F

ε,

x

) tend to

infinity

⇒

It becomes the center of a cluster with this observation

only even if

Ω(

x

) =

x (García-Escudero and Gordaliza,

1999);

⇒

One entire group of the test sample (F ) is badly

classified while the other is well classified;

⇒

ER(

F

ε

,

F

) = π

1 or ER(

F

ε

,

F

) = π

2 for any sample;

⇒

ER has broken down in the sense of Genton and Lucas

(2003);

⇒

The BDP of the ER is 1/

n which tends to zero as

(67)

Outline

1 Introduction

2 Statistical functionals

3 Error Rate

4 Influence functions

5 Asymptotic Loss

6 Breakdown point

7 Some improvements

(68)

The trimming approach

Idea:

Delete extreme observations!

Problem:

How can we detect extreme observations?

Solution:

Impartial trimming

k the fixed number of clusters;

α ∈ [0,

1[

the trimming size;

X

n

= {

x

1 , . . . ,

x

n

} ∈ R

p

a dataset that is not

concentrated on k points after removing a mass equal

to

α;

Optimization over partitions

R = {

R

1 , . . . ,

R

k

}

of

{1, . . . ,

n

}

with

⌈

n

(1

− α)⌉

observations;

(69)

The generalized trimmed k -means method

Cuesta-Albertos et al., 1997

The clusters centers

(

T

₁

n

, . . . ,

T

_k

n

)

are solutions of the

double minimization problem

min

R

{

t

1

,...,

min

t

k

}⊂R

p

X

x

i

∈R

Ω

inf

1 ≤

j

≤

k

x

i

−

t

j

k

;

The classification rule:

x

∈

C

_j

n

⇔

k

x

−

T

_j

n

k =

min

1 ≤

i

≤

k

x

−

T

i

n

k

(70)

The generalized trimmed k -means method

Cuesta-Albertos et al., 1997

The clusters centers

(

T

₁

n

, . . . ,

T

_k

n

)

are solutions of the

double minimization problem

min

R

{

t

1

,...,

min

t

k

}⊂R

p

X

x

i

∈R

Ω

inf

1 ≤

j

≤

k

x

i

−

t

j

k

;

The classification rule:

x

∈

C

_j

n

⇔

k

x

−

T

_j

n

k =

min

1 ≤

i

≤

k

x

−

T

i

n

k

(71)

Properties of the generalized trimmed 2-means

García-Escudero and Gordaliza, 1999

Bounded IF whatever

Ω;

Better breakdown behavior.

Ruwet and Haesbroeck (unpublished result)

If the Conjecture also holds for the generalized trimmed

2-means, this procedure is optimal under the model F

O

.

(72)

Problem of all "k -means type" procedures

−20 −10 0 10 20 30 −20 −10 0 10

Simulated dataset

(73)

Problem of all "k -means type" procedures

−20 −10 0 10 20 30 −20 −10 0 10

Simulated dataset

−20 −10 0 10 20 30 −20 −10 0 10

(74)

The TCLUST procedure

García-Escudero et al., 2008

Optimization also over the scatter matrices S

n

j

and the

weights p

_j

n

such that

P

k

j

=

1 p

j

=

1;

Maximization of

k

X

j

=

1 X

i

∈

R

j

log p

j

ϕ

x

i

;

T

j

,

S

j

where

ϕ

is the pdf of the Gaussian distribution;

Eigenvalues-ratio restriction:

M

n

m

n

=

max

j

=

1 ,...,

k

max

l

=

1 ,...,

p

λ

l

(

S

j

)

min

j

=

1 ,...,

k

min

l

=

1 ,...,

p

λ

l

(

S

j

)

≤

c

for a constant c

≥

1 and where

λ

l

(

S

j

)

are the

eigenvalues of S

j

, l

=

1, . . . ,

p and j

=

1, . . . ,

k .

(75)

Improvement with the TCLUST procedure

−20 −10 0 10 20 30 −20 −10 0 10

Simulated dataset

−20 −10 0 10 20 30 −20 −10 0 10

Estimation by TCLUST

(76)

Current and future work

Robustness properties of the TCLUST procedure:

The influence function (Ruwet et al., Submitted)

The breakdown behavior