Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Detection of influential observations on
the error rate based on the generalized
k-means clustering procedure
Joint work with G. HaesbroeckCh. Ruwet
Department of Mathematics - University of Liège
14 October 2009
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Outline
1 Introduction 2 Error rate3 Influence function of the error rate 4 Diagnostic plot
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Outline
1 Introduction 2 Error rate3 Influence function of the error rate 4 Diagnostic plot
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Statistical clustering
LetX ∼ F arises from G1 and G2 with πi(F ) = IPF[X ∈ Gi]
then
F is a mixture of two distributions
F = π1(F )F1+ π2(F )F2
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
The generalized 2-means clustering method
Aim of clustering : Find estimations C1(F ) and C2(F )
(called clusters) of the two underlying groups. The clusters’ centers (T1(F ), T2(F )) are solutions of
min {t1,t2}⊂Rp Z Ω inf 1≤j≤2kx − tjk dF(x)
for a suitable nondecreasing penalty function Ω : R+→ R+.
Classical penalty functions :
Ω(x) = x2 → 2-means method Ω(x) = x → 2-medoids method
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Classification rule
The classification rule is
RF(x) = Cj(F ) ⇔ Ω(kx −Tj(F )k) = min
1≤i≤2Ω(kx −Ti(F )k).
The clusters are half spaces delimited by the hyperplane :
C =nx ∈ Rp: A(F )Tx + b(F ) = 0o with A(F ) = T1(F ) − T2(F ) b(F ) = −1 2 kT1(F )k 2− kT 2(F )k2 .
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Example: Exams results in 1BM (n = 40)
0 5 10 15 20 0 5 10 15 20
Exams results in 1BM
Analysis EnglishDetection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Example: Exams results in 1BM (n = 40)
0 5 10 15 20 0 5 10 15 20
Exams results in 1BM
Analysis English 2−means 2−medoidsDetection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Outline
1 Introduction 2 Error rate3 Influence function of the error rate 4 Diagnostic plot
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Example: Exams results in 1BM (n = 40)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 5 10 15 20 0 5 10 15 20
Exams results in 1BM
Analysis English 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Example: Exams results in 1BM (n = 40)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 10 15 20 0 5 10 15 20
Exams results in 1BM
Analysis English 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2−means 2−medoidsDetection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Example: Exams results in 1BM (n = 40)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 10 15 20 0 5 10 15 20
Exams results in 1BM
Analysis English 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2−means 2−medoids 0 0 0 0 ==> ER=0.1Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Error rate (ER)
Training sample according to F : estimation of the rule Test sample according to Fm : evaluation of the rule
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Error rate (ER)
Training sample according to F : estimation of the rule Test sample according to Fm : evaluation of the rule
In ideal circumstances : F = Fm ER(F , Fm) = 2 X j=1 πj(Fm)IPFm[RF(X ) 6= Cj(F )| Gj]
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Outline
1 Introduction 2 Error rate3 Influence function of the error rate 4 Diagnostic plot
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Contaminated mixture : F
ε,x= (1 − ε)F + ε∆
x −4 −2 0 2 4 0.00 0.05 0.10 0.15 0.20 x 0.6 0.4 F −4 −2 0 2 4 0.00 0.05 0.10 0.15 0.20 x (1−0.05) 0.6 (1−0.05) 0.4 0.05 F0.05,3Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Definition and properties of the IF
Hampel et al (1986) : For any statistical functional T and any distribution F , IF(x; T, F ) = lim ε→0 T((1 − ε)F + ε∆x) − T(F ) ε = ∂ ∂εT((1 − ε)F + ε∆x) ε=0
(under condition of existence);
EF[IF(X ; T, F )] = 0;
First order Taylor expansion of T at F :
T((1 − ε)F + ε∆x) ≈ T(F ) + εIF(x; T, F )
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Definition and properties of the IF
Hampel et al (1986) : For any statistical functional T and any distribution F , IF(x; T, F ) = lim ε→0 T((1 − ε)F + ε∆x) − T(F ) ε = ∂ ∂εT((1 − ε)F + ε∆x) ε=0
(under condition of existence);
EF[IF(X ; T, F )] = 0;
First order Taylor expansion of T at F :
T((1 − ε)F + ε∆x) ≈ T(F ) + εIF(x; T, F )
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Influence function of the error rate (1)
Now, the training sample is distributed as
Fε,x = (1 − ε)F + ε∆x which is a contaminated mixture.
ER(Fε,x, Fm) = 2 X j=1 πj(Fm)IPFmRFε,x(X ) 6= Cj(Fε,x)| Gj = π1(Fm)IPFm,1[X T A(Fε,x) + b(Fε,x) < 0] + π2(Fm)IPFm,2[X TA(F ε,x) + b(Fε,x) > 0]
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Influence function of the error rate (1)
Now, the training sample is distributed as
Fε,x = (1 − ε)F + ε∆x which is a contaminated mixture.
ER(Fε,x, Fm) = 2 X j=1 πj(Fm)IPFmRFε,x(X ) 6= Cj(Fε,x)| Gj = π1(Fm)IPFm,1[X T A(Fε,x) + b(Fε,x) < 0] + π2(Fm)IPFm,2[X TA(F ε,x) + b(Fε,x) > 0]
Taylor expansion : for ε small enough,
ER(Fε,x, Fm) ≈ ER(Fm, Fm) + εIF(x; ER, Fm)
IF(x; ER, Fm) ≥ 0 ⇔ ER(Fε,x, Fm) ≥ ER(Fm, Fm)
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Influence function of the error rate (2)
Decomposition y = (y1, y2T)T with y1 ∈ R;
Notations Ti(Fm) = τi, A(Fm) = (α1, αT2)T
and b(Fm) = β;
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Influence function of the error rate (3)
Proposition
Under these hypotheses and with these notations, the IF of the ER of the generalized2-means classification procedure is given by IF(x; ER, Fm) = Z IF(x; b, F m) α1 + y2T IF(x; A2, Fm) α1 [π1fm,1(0, y2) − π2fm,2(0, y2)] dy2 with
IF(x; b, Fm) = t[IF(x; T21, Fm) + IF(x; T11, Fm)]
IF(x; A2, Fm) = IF(x; T12, Fm) − IF(x; T22, Fm)
Expressions for IF(x; T1, F ) and IF(x; T2, F ) were computed by
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Influence functions of T
1and T
2IF(x; T1, F ) IF(x; T2, F ) = M−1 ω1(x) ω2(x)
where ωi(x) = − gradyΩ(ky k)
y=x−Ti(F )I(x ∈ Ci(F )) and with
M independent of x.
2-means method ωi(x) = −2(x − Ti(F ))I(x ∈ Ci(F ));
2-medoids method ωi(x) = −
x − Ti(F )
kx − Ti(F )k
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Example of a coefficient of the matrix M in 2-D
For the 2-means procedure, M11=
2(T21(F )−T11(F )) kT2(F )−T1(F )k4 R {v ∈R2:v1<0}[(T21(F )−T11(F ))v1+(T12(F )−T22(F ))v2] [f (cv)+(cv 1−T11(F ))f ′ 1(cv)]dv +R {v ∈R2:v1<0}[(T22(F )−T12(F ))v1+(T21(F )−T11(F ))v2](c v 1−T11(F ))f2′(cv)dv − 1 kT2(F )−T1(F )k2 R {v ∈R2:v1<0}(v1+1/2)[f (c v)+(cv 1−T11(F ))f ′ 1(cv)]dv −R {v ∈R2:v1<0}v2(c1v−T11(F ))f ′ 2(cv)dv − R {v ∈R2:v1<0}f(cv)dv with cv= 1 kT2(F )−T1(F )k2 T21(F )−T11(F ) T12(F )−T22(F ) T22(F )−T12(F ) T21(F )−T11(F ) v1 v2 +T1(F )+T2(F )2 .
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
F
m= 0.4 N
2((−2, 0), I
2) + 0.6 N
2((2, 0), I
2)
x1 x2 IF ERIF of ER of the 2−means method
x1
x2
IF ER
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Some comments on the IF of the ER
The IF of the ER is bounded as soon as the gradient of the penalty function is bounded and the first moment of the model distribution exists;
Outliers have a bigger influence in the smallest group; The closer the two groups are, the bigger the influence of some contamination is.
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
IF(x; ER, F
N) ≡ 0
For some models F , IF(x; ER, F ) ≡ 0; It is the case for
(N) FN = 0.5 Np(−µ, Ip) + 0.5 Np(µ, Ip)
with µ = (µ1, 0, . . . , 0)T;
One needs to go one step further in the Taylor expansion;
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
F
N= 0.5 N
2((−2, 0), I
2) + 0.5 N
2((2, 0), I
2)
x1 x2 IF ERDetection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Outline
1 Introduction 2 Error rate3 Influence function of the error rate 4 Diagnostic plot
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Hair-plot in 1-D (Genton and Ruiz-Gazen, 2009)
Let T be an estimator of interest;
Let S = (x1, . . . , xn) be a n-vector of observations;
Define n functions of ξ ∈ R:
S[i, ξ] = S + ξ ei
where ei is the i-th unit vector ∈ Rn;
The hair-plot is the representation of n curves
R → R : ξ 7→ T (S[i, ξ]) for i = 1, . . . , n.
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Empirical influence function (EIF)
Let T be an estimator of interest;
Let S = (x1, . . . , xn) be a dataset of size n;
The EIF with replacement is defined by
R → R : ξ 7→ T(x1, . . . , xn−1, ξ) − T (S)
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Diagnostic plot in 1-D (1)
Let T be an estimator of interest;
Let S = (x1, . . . , xn) be a n-vector of observations;
Define n functions of ξ ∈ R:
S[i, ξ] = S + ξ ei
where ei is the ieth unit vector in Rn;
The diagnostic plot is the representation of n EIF’s
R → R : ξ 7→ T(S[i, ξ]) − T (S) 1/n
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Diagnostic plot in 1-D (2)
Our estimator of interest is ER.
To compute ER, information about memberships is necessary: It is available ⇒ OK
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Robust estimations ?
The 2-medoids method has a bounded IFbutits BDP is null;
It is the same for all generalized 2-means procedures !!! (García-Escudero and Gordaliza, 1999);
Cuesta-Albertos, Gordaliza and Matrán (1997) introduced the trimmed 2-means method :
min Xα min {t1,...,tk}⊂Rp X xi∈Xα Ω inf 1≤j≤kkxi − tjk
where Xα ranges on the set of the subsets of {x1, . . . , xn}
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Example: Long jumps in Olympic games
(García-Escudero, and Gordaliza, 1999)
Long jump (in meters) Men (n=33)
Women (n=25)
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Example: Long jumps in Olympic games
aaaaaaaa aaaaaaaa a aaaa −10 −5 0 5 10 0 5 10 15 20 25 EIF bbbbbbb bbbbb bbbb bbbbb ccccc cc ccc ccccc c cccc c ddddddd ddddd dddd ddddd eeeeeee eeeeeeeee eeeee f f f f f f f f f f f f f f f f f f f f f gggggg ggggg ggggg ggggg hhhhhh hhhhh hhhhh hhhhh i i i i i i i i i i i i i i i i i i i i i j j j j j j j j j j j j j j j j j j j j j kkkkk k kkkk k kkkk k kkkk k l l l l l l l l l l l l l l l l l l l l l m m m m m m m m m m mm m m m m m m m m m nnnnnn nnnnnnnnn nnnnnn oooooo ooooooooo oooooo pppppp ppppp pppp pppppp qqqqqq qqqqq qqqq qqqqqq r r r r r r r r r r r r r r r r r r r r r sssss s ssss sssss sssss s t t t t t t t t t t t t t t t t t t t t t uuuuuu uuuuuuuuu uuuuuu vvvvv v vvvv vvvvv vvvvv v w w w w w w w w w ww w w w w w w w w w w xxxxx x xxxx xxxxx xxxxx x yyyyy y yyyy yyyyy yyyyy y zzzzz zz zzz z zzzz z zzzz z AAAAAA AAAAA AAAA AAAAAA BBBBBB BBBBB BBBB BBBBBB C C C C C C C C C C C C C C C C C C C C C D D D D D D D D D D D D D D D D D D D D D EEEEEE EEEE EEEEE EEEEEE FFFFFF FFFF FFFFF FFFFFF G G G G G G G G G G G G G G G G G G G G G H H H H H H H H H H H H H H H H H H H H H I I I I I I I I I I I I I I I I I I I I I JJJJJ JJJJJ JJJJJ JJJJJ J KKKKK KKKKK KKKKK KKKKKK LLLLL LLLLL LLLLL LLLLLL M M M M M M M M M M M M M M M M M M M M M N N N N N N N N N N N N N N N N N N N N N O O O O O O O O O O O O O O O O O O O O O PPPPP PPPPP PPPPP PPPPPP Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q R R R R R R R R R R R R R R R R R R R R R SSSSS SSSSS SSSSS SSSSSS TTTTT TTTTT TTTT TTTTTTT U U U U U U U U U U U U U U U U U U U U U VVVVV VVVVV VVVV VVVVVVV W W W W W W W W W W W W W W W W W W W W W XXXXX XXXXX XXXX XXXXXXX YYYYY YYYYY YYYY YYYYYYY ZZZZZ ZZZZZ ZZZZ ZZZZZZZ 11111 11111 1111 1111111 22222 22222 2222 2222222 33333 33333 3333 3333333 44444 44444 4444 4444444 55555 55555 5555 5555555 66666 6666 66666 6666666 ξ aaaaaaaa aaaaaaaa a aaaa −10 −5 0 5 10 0 5 10 15 20 25 30 35 EIF bbbbbbb bbbbbbbbb bbbbb cccc ccc cc ccccc cc ccc cc ddddddd ddddddddd ddddd eeeeeee eeeeeeeee eeeee f f f f f f f f f f f f f f f f f f f f f gggggg gggggggggg ggggg hhhhhh hhhhhhhhhh hhhhh i i i i i i i i i i i i i i i i i i i i i j j j j j j j j j j j j j j j j j j j j j kkkk kk kkk kkkkk kk kkk kk l l l l l l l l l l l l l l l l l l l l l m m m m m m m m m m mm m m m m m m m m m nnnnnn nnnnnnnnn nnnnnn oooooo ooooooooo oooooo pppppp ppppppppp pppppp qqqqqq qqqqqqqqq qqqqqq r r r r r r r r r r r r r r r r r r r r r ssss ss sss sssss s ssss ss t t t t t t t t t t t t t t t t t t t t t uuuuuu uuuuuuuuu uuuuuu vvvv vv vvv vvvvv v vvvv vv w w w w w w w w w w w w w w w w w w w w w xxxx xx xxx x xxxx x xxxx xx yyyy yy yyy y yyyy y yyyy yy zzzz zzz zz zzzzz zz zzz zz AAAAAA AAAAAAAAA AAAAAA BBBBBB BBBBBBBBB BBBBBB C C C C C C C C C C C C C C C C C C C C C D D D D D D D D D D D D D D D D D D D D D EEEEEE EEEE EEEEE EEEEEE FFFFFF FFFF FFFFF FFFFFF G G G G G G G G G G G G G G G G G G G G G H H H H H H H H H H H H H H H H H H H H H I I I I I I I I I I I I I I I I I I I I I JJJJ J JJJJ J JJJJ J JJJJ JJ KKKKK KKKKK KKKKK KKKKKK LLLLL LLLLL LLLLL LLLLLL M M M M M M M M M M M M M M M M M M M M M N N N N N N N N N N N N N N N N N N N N N O O O O O O O O O O O O O O O O O O O O O PPPPP PPPPP PPPPP PPPPPP Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q R R R R R R R R R R R R R R R R R R R R R SSSSS SSSSS SSSSS SSSSSS TTTTT TTTTT TTTT TTTTTTT U U U U U U U U U U U U U U U U U U U U U VVVVV VVVVV VVVV VVVVVVV W W W W W W W W W W W W W W W W W W W W W XXXXX XXXXX XXXX XXXXXXX YYYYY YYYYY YYYY YYYYYYY ZZZZZ ZZZZZ ZZZZ ZZZZZZZ 11111 11111 1111 1111111 22222 22222 2222 2222222 33333 33333 3333 3333333 44444 44444 4444 4444444 55555 55555 5555 5555555 66666 6666 66666 6666666 ξ
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Example: Long jumps in Olympic games with one
outliers
a a a a a a a a a a a a a a a a a a a a a −10 −5 0 5 10 −20 −15 −10 −5 0 5 10 EIF b b b b b b b b b b b b b b b b b b b b b c c c c c c c c c c c c c c c c c c c c c d d d d d d d d d d d d d d d d d d d d d e e e e e e e e e e e e e e e e e e e e e f f f f f f f f f f f f f f f f f f f f f g g g g g g g g g g g g g g g g g g g g g h h h h h h h h h h h h h h h h h h h h h i i i i i i i i i i i i i i i i i i i i i j j j j j j j j j j j j j j j j j j j j j k k k k k k k k k k k k k k k k k k k k k l l l l l l l l l l l l l l l l l l l l l m m m m m m m mm m m m m m m m m m m m m n n n n n n n n n n n n n n n n n n n n n o o o o o o o o o o o o o o o o o o o o o p p p p p p p p p p p p p p p p p p p p p q q q q q q q q q q q q q q q q q q q q q r r r r r r r r r r r r r r r r r r r r r s s s s s s s s s s s s s s s s s s s s s t t t t t t t t t t t t t t t t t t t t t u u u u u u u u u u u u u u u u u u u u u v v v v v v v v v v v v v v v v v v v v v w w w w w w w w w w w w w w w w w w w w w x x x x x x x x x x x x x x x x x x x x x y y y y y y y y y y y y y y y y y y y y y z z z z z z z z z z z z z z z z z z z z z A A A A A A A A A A A A A A A A A A A A A B B B B B B B B B B B B B B B B B B B B B C C C C C C C C C C C C C C C C C C C C C D D D D D D D D D D D D D D D D D D D D D E E E E E E E E E E E E E E E E E E E E E F F F F F F F F F F F F F F F F F F F F F G G G G G G G G G G G G G G G G G G G G G H H H H H H H H H H H H H H H H H H H H H I I I I I I I I I I I I I I I I I I I I I J J J J J J J J J J J J J J J J J J J J J K K K K K K K K K K K K K K K K K K K K K L L L L L L L L L L L L L L L L L L L L L M M M M M M M M M M M M M M M M M M M M M N N N N N N N N N N N N N N N N N N N N N O O O O O O O O O O O O O O O O O O O O O P P P P P P P P P P P P P P P P P P P P P Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q R R R R R R R R R R R R R R R R R R R R R S S S S S S S S S S S S S S S S S S S S S T T T T T T T T T T T T T T T T T T T T T U U U U U U U U U U U U U U U U U U U U U V V V V V V V V V V V V V V V V V V V V V W W W W W W W W W W W W W W W W W W W W W X X X X X X X X X X X X X X X X X X X X X Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z Z 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7597 ξDetection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Example: Long jumps in Olympic games with two
outliers
a a a a a a a a a a a −10 −5 0 5 10 0 2 4 6 8 EIF b b b b b b b b b b b c c c c c c c c c c c d d d d d d d d d d d e e e e e e e e e e e f f f f f f f f f f f g g g g g g g g g g g h h h h h h h h h h h i i i i i i i i i i i j j j j j j j j j j j k k k k k k k k k k k l l l l l l l l l l l m m m m m m m m m m m n n n n n n n n n n n o o o o o o o o o o o p p p p p p p p p p p q q q q q q q q q q q r r r r r r r r r r r s s s s s s s s s s s t t t t t t t t t t t u u u u u u u u u u u v v v v v v v v v v v w w w w w w w w w w w x x x x x x x x x x x y y y y y y y y y y y z z z z z z z z z z z A A A A A A A A A A A B B B B B B B B B B B C C C C C C C C C C C D D D D D D D D D D D E E E E E E E E E E E F F F F F F F F F F F G G G G G G G G G G G H H H H H H H H H H H I I I I I I I I I I I J J J J J J J J J J J K K K K K K K K K K K L L L L L L L L L L L M M M M M M M M M M M N N N N N N N N N N N O O O O O O O O O O O P P P P P P P P P P P Q Q Q Q Q Q Q Q Q Q Q R R R R R R R R R R R S S S S S S S S S S S T T T T T T T T T T T U U U U U U U U U U U V V V V V V V V V V V W W W W W W W W W W W X X X X X X X X X X X Y Y Y Y Y Y Y Y Y Y Y Z Z Z Z Z Z Z Z Z Z Z 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 598 8 608 ξDetection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Adaptation to multiple influential observations
Let S = (x1, . . . , xn) be a dataset of size n;
Define Ck
n functions of ξ ∈ R:
S[i, . . . , j, ξ] = S + ξ (ei + . . . + ej)
where el is the leth unit vector in Rn;
Our diagnostic plot is the representation of Ck n EIF’s
R → R : ξ 7→ ER(S[i, . . . , j, ξ]) − ER(S) 1/n
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Example: Long jumps in Olympic games with two
outliers
−10 −5 0 5 10 −25 −20 −15 −10 −5 0 5 10 EIF 59,60 ξDetection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Outline
1 Introduction 2 Error rate3 Influence function of the error rate 4 Diagnostic plot
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Conclusions
The generalized 2-means procedure can give a more robust estimator of the error rate with a bounded penalty function; But it is not robust w.r.t. all kind of contamination ⇒ introduction of the trimmed 2-means method;
The diagnostic plot presented here can be useful to detect single influential observation or multiple influential
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Future research
Adaptation of the diagnostic plot to multivariate cases; Error rate of the trimmed 2-means procedure : for α ∈ [0, 1], (T1(F ), T2(F )) are solutions of
min
{A:F (A)=1−α}{t1min,t2}⊂R
Z A Ω inf 1≤j≤2kx − tjk dF(x)
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Bibliography
Anderson T.W., An Introduction to Multivariate Statistical Analysis, Wiley, New-York, 1958, pp. 126-133
Croux C., Filzmoser P., and Joossens K. (2008), Classification efficiencies for robust linear discriminant analysis, Statistica Sinica 18, pp. 581-599
Croux C., Haesbroeck G., and Joossens K. (2008), Logistic discrimination using robust estimators : an influence function approach, The Canadian Journal of Statistics, 36, pp. 157-174
Cuesta-Albertos J.A., Gordaliza A., and Matrán C. (1997), Trimmed k-means: an attempt to robustify quantizers, The Annals os Statistics25, pp.553-576
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Bibliography
Fernholz L. T. (2001), On multivariate higher order von Mises expansions, Metrika 53, pp. 123-140
García-Escudero L. A., and Gordaliza A. (1999),
Robustness properties of k-means and trimmed k-means, Journal of the American Statistical Association 94, pp. 956-969
García-Escudero L. A., Gordaliza A., and Matrán C. (2003), Trimming tools in exploratory data analysis, Journal of Computational and Graphical Statistics 12, pp. 434-449 Hampel F.R., Ronchetti E.M., Rousseeuw P.J., and Stahel W.A., Robust statistics: The approach based on influence functions, John Wiley and Sons, New-York, 1986
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Bibliography
Pollard D. (1981), Strong Consistency of k-Means Clustering, The Annals of Probability 9, pp.919-926 Pollard D. (1982), A Central Limit Theorem for k-Means Clustering, The Annals of Probability 10, pp.135-140 Qiu D. and Tamhane A. C. (2007), A comparative study of the k-means algorithm and the normal mixture model for clustering : Univariate case, Journal of Statistical Planning and Inference, 137, pp. 3722-3740.
Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Graph of IF(x; ER, F )
−4 −2 0 2 4 −2.0 −1.0 0.0 0.5
2−means
x IF ER π1=0.4 π1=0.6 π1=0.2 π1=0.8Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ch. Ruwet Introduction Error rate IF of the ER Diagnostic plot Conclusion
Graph of IF(x; ER, F )
−4 −2 0 2 4 −0.2 −0.1 0.0 0.1 0.2