• Aucun résultat trouvé

The breakdown bahavior of the TCLUST procedure

N/A
N/A
Protected

Academic year: 2021

Partager "The breakdown bahavior of the TCLUST procedure"

Copied!
52
0
0

Texte intégral

(1)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

The breakdown behavior of the TCLUST

procedure

Joint work with L.A. García-Escudero, A. Gordaliza and A. Mayo-Iscar from the University of Valladolid (Spain)

Ch. Ruwet

University of Liège

(2)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Introduction - Simulated dataset

−20 −10 0 10 20 30

−20

−10

0

(3)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Introduction - Estimation by trimmed k -means

−20 −10 0 10 20 30

−20

−10

0

(4)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Outline

Definition of the TCLUST procedure Choice of the different parameters A real example

Breakdown behavior Conclusions

(5)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Outline

Definition of the TCLUST procedure

Choice of the different parameters A real example

Breakdown behavior Conclusions

(6)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Notations

k the fixed number of clusters;

α ∈ [0,1[the trimming size;

(PR) Xn= {x1, . . . ,xn} ∈ Rp a dataset that is not

concentrated on k points after removing a mass equal toα;

R0,R1, . . . ,Rk a partition of{1, . . . ,n}with|R0| = ⌊nα⌋; ϕ (·; µ, Σ)the probability density function (pdf) of the

p-variate normal distribution with meanµand

(7)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Clustering procedures based on trimming (1)

The trimmed k -means:k centers T1, . . . ,Tk that

minimize k X j=1 X iRj kxiTjk2 (Cuesta-Albertos et al., 1997)

The trimmed determinant criterion: k centers T1, . . . ,Tk

and a p×p scatter matrix S that maximize

k X j=1 X iRj logϕ xi;Tj,S 

(8)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Clustering procedures based on trimming (2)

Heterogeneous clustering: k centers T1, . . . ,Tk and k

p×p scatter matrices S1, . . . ,Sk that maximize

k X j=1 X iRj logϕ xi;Tj,Sj 

under the constraint det(S1) = . . . =det(Sk)

(9)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

TCLUST procedure E

c,kGarcía-Escudero et al., 2008

k centers T1, . . . ,Tk, k p×p scatter matrices

S1, . . . ,Sk and k weights pj ∈ [0,1],j =1, . . . ,k with

Pk j=1pj =1 that maximize k X j=1 X iRj log pjϕ xi;Tj,Sj 

Eigenvalues-ratio restriction (ER):

Mn

mn

= maxj=1,...,kmaxl=1,...,pλl(Sj) minj=1,...,kminl=1,...,pλl(Sj)

c

for a constant c ≥1 and whereλl(Sj)are the

eigenvalues of Sj, l =1, . . . ,p and j =1, . . . ,k . Θc = n θ ={pj}kj=1, {Tj}kj=1, {Sj}kj=1  :(ER) is OK o

(10)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

An

R

package

> library(tclust) > tclust(data, k = 3 , alpha = 0.05, restr = "eigen", restr.fact = 12, equal.weights = FALSE)

restris the type of restriction to be applied: "eigen"

(default), "deter" and "sigma"

restr.factis the constant c that constrains the

allowed differences among group scatters

equal.weightsleads to a model without estimation

(11)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example - Simulated dataset

−20 −10 0 10 20 30

−20

−10

0

(12)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example - Trimmed k -means

restr = "eigen", restr.fact = 1, equal.weights = TRUE −20 −10 0 10 20 30 −20 −10 0 10

(13)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example - Trimmed determinant criterion

restr = "sigma", restr.fact = 1, equal.weights = TRUE −20 −10 0 10 20 30 −20 −10 0 10

(14)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example - Heterogeneous clustering

restr = "deter", restr.fact = 1, equal.weights = TRUE −20 −10 0 10 20 30 −20 −10 0 10

(15)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example - TCLUST

restr = "eigen", restr.fact = 50, equal.weights = FALSE −20 −10 0 10 20 30 −20 −10 0 10

(16)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Outline

Definition of the TCLUST procedure

Choice of the different parameters

A real example Breakdown behavior Conclusions

(17)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Choice of c

The choice of c should depend on prior knowledge of type of clusters we are searching for;

Large values of c lead to rather unrestricted solutions; Small values of c yield similarly structured clusters; This constant can be viewed as a "robustness" constant.

(18)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Classification trimmed likelihood function

García-Escudero et al., 201x For fixed c≥1, For k ≥1 andα ∈ [0,1[, Lc(α,k) := max {Rj}kj=0,θ∈Θc k X j=1 X iRj log pjϕ xi;Tj,Sj  ∆c(α,k) = Lc(α,k +1) − Lc(α,k) ≥0 is the "gain" achieved by increasing the number of clusters from k to k +1

kshould be the smallest value of k such that ∆c(α,k) ≈0, except for small values ofα α∗should be the smallest value ofαsuch that ∆c(α,k∗) ≈0 for allα ≥ α∗

(19)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Classification trimmed likelihood function

García-Escudero et al., 201x For fixed c≥1, For k ≥1 andα ∈ [0,1[, Lc(α,k) := max {Rj}kj=0,θ∈Θc k X j=1 X iRj log pjϕ xi;Tj,Sj  ∆c(α,k) = Lc(α,k +1) − Lc(α,k) ≥0 is the "gain"

achieved by increasing the number of clusters from k to

k +1

kshould be the smallest value of k such that ∆c(α,k) ≈0, except for small values ofα α∗should be the smallest value ofαsuch that ∆c(α,k∗) ≈0 for allα ≥ α∗

(20)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Classification trimmed likelihood function

García-Escudero et al., 201x For fixed c≥1, For k ≥1 andα ∈ [0,1[, Lc(α,k) := max {Rj}kj=0,θ∈Θc k X j=1 X iRj log pjϕ xi;Tj,Sj  ∆c(α,k) = Lc(α,k +1) − Lc(α,k) ≥0 is the "gain"

achieved by increasing the number of clusters from k to

k +1

kshould be the smallest value of k such thatc(α,k) ≈0, except for small values ofα

α∗should be the smallest value ofαsuch that ∆c(α,k∗) ≈0 for allα ≥ α∗

(21)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Classification trimmed likelihood function

García-Escudero et al., 201x For fixed c≥1, For k ≥1 andα ∈ [0,1[, Lc(α,k) := max {Rj}kj=0,θ∈Θc k X j=1 X iRj log pjϕ xi;Tj,Sj  ∆c(α,k) = Lc(α,k +1) − Lc(α,k) ≥0 is the "gain"

achieved by increasing the number of clusters from k to

k +1

kshould be the smallest value of k such thatc(α,k) ≈0, except for small values ofα

α∗should be the smallest value ofαsuch that ∆c(α,k∗) ≈0 for allα ≥ α∗

(22)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

The

R

package

R > ctl <- ctlcurves (data, k = 1:4, alpha = seq (0, 0.2, length = 6),restr.fact = 50) R > plot(ctl)

(23)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example - Simulated dataset

0.00 0.05 0.10 0.15 −14000 −13000 −12000 −11000 CTL−Curves Objectiv e Function V alue α Restriction Factor = 50 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 4 5 5 5 5 5 5 5

(24)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Outline

Definition of the TCLUST procedure Choice of the different parameters

A real example

Breakdown behavior Conclusions

(25)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Swiss bank notes data (1)

Flury and Riedwyl, 1988

6 variables (measurements on the bank notes)

200 observations divided in 2 groups: 100 genuine and 100 forged old Swiss 1000-franc bank notes

(26)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Swiss bank notes data (2)

R > plot(ctlcurves(Swiss, k = 1:4 , alpha = seq(0,0.3,by=0.025)) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 −900 −800 −700 −600 −500 −400 −300 CTL−Curves Objectiv e Function V alue α Restriction Factor = 50 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4

(27)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Swiss bank notes data (3)

R > plot(tclust(Swiss, k = 2 , alpha = 0.15, restr.fact = 50)

Classification

First discriminant coord.

Second discr

iminant coord.

(28)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Outline

Definition of the TCLUST procedure Choice of the different parameters A real example

Breakdown behavior

(29)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Intuition about Breakdown

Breakdown point (BDP): the fraction of outliers needed to bring the estimator to its bounds

Replacement BDP (RBDP): observations are replaced by outliers

Addition BDP (ABDP): outliers are added

Explosion of the centers ˆ

pj =0 (sign of a badly chosen k )

Implosion or explosion of the scatter matrices

Some of them : impossible due to (ER)

All of them : impossible due to existence under (PR) (García-Escudero et al., 2008)

(30)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Replacement breakdown point

Proposition

The replacement breakdown point of the TCLUST procedure satisfies the optimistic relation

RBDP≤min ⌊nα⌋ +1 n ,j=min1,...,k |Cj| n  . Data dependent

Same upper bound as the trimmed k -means (García-Escudero and Gordaliza, 1999) even if we expect a smaller RBDP for the TCLUST (estimation of weights and scatters)

(31)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Ideal model of "well-clustered" data sets

Hennig, 2004

n1< . . . <nk and Ajm= {x(nj−1+1),m, . . . ,xnj,m},

j =1, . . . ,k ;

Xm =Skj=1Ajm is said to be"well k -clustered"if∃b< ∞

s.t.,∀m∈ N, (1) max 1≤jkxi max ,m,xl,mA j m kxi,mxl,mk <b (2) lim m→∞x min i,mA h m,xl,mA j m,j6=h kxi,mxl,mk = ∞; Addition of r outliers y1,m, . . . ,yr,m: (3) lim m→∞minkyi,mxl,mk = ∞ (4) lim m→∞mini6=l kyi,myl,mk = ∞.

(32)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Addition breakdown point

Proposition

Let Xm,m∈ N,be an ideal sequence of data sets inRpthat

are "well k -clustered" in clusters A1m, . . . ,Akm verifying

conditions(1)and(2). The addition of r ≤ ⌊nα⌋outliers

verifying conditions(3)and(4)does not break down the

TCLUST procedure with trimming sizeα:

ABDP≥ ⌊nα⌋

n+ ⌊nα⌋.

Better than fitting mixtures of t distributions or adding a noise component in normal mixtures (Hennig, 2004).

(33)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Dissolution point

Hennig, 2008

Ec,k,α(Xn)the TCLUST clustering of Xn;

Ec,k(Xn+g)the clustering of Xninduced by

Ec,k,α(Xn+g); P a partition of Xn; For C ∈ P1and D∈ P2,γ(C,D) = |CD| |CD|; A cluster C ∈ P1is dissolved inP2if max D∈P2 γ(C,D) ≤ 1 2.

(34)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example (1)

3 4 5 6 7 3 4 5 6

(35)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example (2)

3 4 5 6 7 3 4 5 6 C

(36)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example (3)

P 3 4 5 6 7 3 4 5 6

(37)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example (4)

|CD′| =6 and |CD′| =10+8 3 4 5 6 7 3 4 5 6 D’

(38)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example (5)

|CD′′| =4 and |CD′′| =10+13 3 4 5 6 7 3 4 5 6 D’’

(39)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example (6)

|CD′′′| =0 and |CD′′′| =10+8 3 4 5 6 7 3 4 5 6 D’’’

(40)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example (7)

max D∈P γ(C,D) =1/3<1/2 → C is dissolved inP 3 4 5 6 7 3 4 5 6

(41)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Dissolution point

Hennig, 2008

For CEc,k,α(Xn), thedissolution point of C is given by

∆(Ec,k,α,Xn,C) =min g ( g |C| +g : ∃xn+1, . . . ,xn+g : max DEc,k,α(Xn+g) γ(C,D) ≤1/2 ) .

(42)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Intuition about the dissolution point theorem

g ≤ ⌊nα⌋

Xna dataset for which there is no high concentration in

Xn+gwhatever the g added outliers

CEc,k,α(Xn)with|C| >g

If there are g points among the trimmed observations that are fitted well enough by the TCLUST clustering, then the cluster C can not be dissolved by the addition of g outliers.

(43)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Isolation robustness

Hennig, 2008

A clustering procedure is said to beisolation robustif for any dataset Xnand for any "well-isolated" cluster C of the

partition,

C is be stable under the addition of points, i.e. for all g,

any cluster of the partition of Xn+g should not join

observations of C and Xn\C

and

there is at least one cluster in the new partition containing some observations of C.

(44)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Example

−10 −5 0 5 10 15 −10 −5 0 5 10 15 X1 X2

(45)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Choice of k and

α

0.00 0.05 0.10 0.15 0.20 −7000 −6000 −5000 −4000 −3000 CTL−Curves Objectiv e Function V alue α Restriction Factor = 12 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4

(46)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

k

=

3,

α =

0

.

1, c

=

12

−10 −5 0 5 10 15 −10 −5 0 5 10 15 X1 X2

(47)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

The TCLUST E

c,k

is not isolation robust

0 20 40 60 80 100 −100 −80 −60 −40 −20 0 X1 X2

(48)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

The "2-steps" procedure E

c

is isolation robust !

0.00 0.05 0.10 0.15 0.20 −10000 −8000 −6000 −4000 CTL−Curves Objectiv e Function V alue α Restriction Factor = 12 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 0 20 40 60 80 100 −100 −80 −60 −40 −20 0 X1 X2

(49)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Outline

Definition of the TCLUST procedure Choice of the different parameters A real example

Breakdown behavior

(50)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

Conclusions

A flexible clustering procedure; A completeRpackage;

A graphical tool to chose the parameters;

Good breakdown behavior under the ideal model of "well-clustered" dataset;

Isolation robustness of the "2-steps" procedure. Moreover, the influence functions (not presented here) are bounded.

(51)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

References (1)

Cuesta-Albertos J.A., Gordaliza A. and Matrán C. (1997) Trimmed k-means: an attempt to robustify quantizers. Ann. Statist., 25(2):553-576

Gallegos M.T. (2002) Maximum likelihood clustering with outliers. Classification, clustering, and data analysis (Cracow, 2002), Stud. Classification Data Anal. Knowledge Organ., pages 247-255

Gallegos M.T. and Ritter G. (2005) A robust method for cluster analysis. Ann. Statist., 33(1):347-380

García-Escudero L.A. and Gordaliza A. (1999) Robustness properties of k -means and trimmed k -means. J. Amer. Statist. Assoc., 94(447):956-969

García-Escudero L.A., Gordaliza A., Matrán C. and

Mayo-Iscar A. (2008) A general trimming approach to robust cluster analysis. Ann. Statist., 36(3):1324-1345

(52)

The breakdown behavior of the TCLUST procedure Ch. Ruwet Definition Parameters A real example Breakdown Conclusions

References (2)

García-Escudero L.A., Gordaliza A., Matrán C. and Mayo-Iscar A. (201x) Exploring the number of groups in robust model-based clustering. Stat Comput

Flury B. and Riedwyl H. (1988) Multivariate Statistics. A practical approach. Chapman and Hall, London

Hennig C. (2004) Breakdown points for maximum likelihood estimators of location-scale mixtures. Ann. Statist.,

32(4):1313-1340

Hennig C. (2008) Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. J. Multivariate Anal., 99(6):1154-1176

RDevelopment Core Team (2010). R: A language and

environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.

Références

Documents relatifs

Coercive investigative measures with or without prior judicial authorisation may be appealed in accordance with Rules 31 (general rules for coercive measures without prior

This coarse registration process is carried out in three steps: (1) Creating preliminary line matching A ini between Data and Model sets; (2) Applying the

For this we consider the set of trimmed distributions obtained from the initial distribution of the data and look for an automatic rule that reduces the classification error of

This is done using a simple description of the structure through.. the establishment of two tables giving respectively the values of the geometrical and elastic parameters related

of Educational Psychology 6-138f Education II Building University of Alberta Edmonton, Alberta T6G 2G5 Associate Editor/ Adjoint au Rédacteur Lloyd W. of Educational

Using the programming environment of our choice, we can process the information in this file to produce a transformed image in two steps: compute the locations of the distorted

Using this information, we develop a procedure to estimate the drift rate of thermistors deployed in the field, where no reference temperature measurements

More precisely, at each node of the search tree, we solve a robust Basic Cyclic Scheduling Problem (BCSP), which corresponds to the CJSP without resource constraints, using a