HAL Id: hal-00688250
https://hal.inria.fr/hal-00688250
Submitted on 17 Apr 2012
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires
Invariant and Stable by Projection
Christophe Biernacki, Alexandre Lourme
To cite this version:
Christophe Biernacki, Alexandre Lourme. Gaussian Parsimonious Clustering Models Scale Invariant and Stable by Projection. Statistics and Computing, Springer Verlag (Germany), 2013, pp.21. �hal- 00688250�
0249-6399ISRNINRIA/RR--7932--FR+ENG
RESEARCH REPORT N° 7932
April 2012
Gaussian Parsimonious Clustering Models
Scale Invariant and Stable by Projection
Christophe Biernacki, Alexandre Lourme
RESEARCH CENTRE LILLE – NORD EUROPE
Parc scientifique de la Haute-Borne
Christophe Biernaki
*
,Alexandre Lourme
Projet-TeamModal
ResearhReport n°7932April201221pages
Abstrat: Gaussian mixture model-based lustering is now a standard tool to determine
an hypothetial underlying struture into ontinuous data. However many usual parsimonious
models, despite theirappealing geometrialinterpretation, suer from majordrawbaks assale
dependene orunsustainabilityof the onstraintsby projetion. Inthis work wepresenta new
family of parsimonious Gaussian models based on a variane-orrelation deomposition of the
ovarianematries. Thesenewmodelsarestablebyprojetionintotheanonialplanesand,so,
faithfullyrepresentableinlowdimension. Theyarealsostablebymodiationofthemeasurement
unitsofthedataandsuhamodiationdoesnothangethemodelseletionbasedonlikelihood
riteria. Wehighlightallthesestabilitypropertiesbyaspeigeometrialrepresentationofeah
model. Adetailed GEMalgorithmisalsoprovidedforeverymodelinferene. Then,onbiologial
andgeologialdata,weompareourstablemodelstostandardgeometrialones.
Key-words: Correlation, EM algorithm, Faithful projetion, Maximum-Likelihood, Standard
deviation,Unitindependene
*
UniversityLille1&CNRS&Inria
IUTdépartementdeGénieBiologique,UniversitédePauetdesPaysdel'Adour
Résumé: Lalassiationàbasede modèlesdemélanges gaussiensest maintenantunoutil
standardpourdéterminerunehypothétiquestrutureahéedansunjeu dedonnéesontinues.
Pourtantdenombreuxmodèlesparimonieuxusuels,malgréleurinterprétationgéométriqueon-
viviale,sourent de défautsmajeurs omme ladépendane aux unités demesure ou enorela
violationdes ontraintes par projetion. Dans e travail, nous présentonsune nouvellefamille
demodèlesgaussiensparimonieuxreposantsurunedéompositionvariane-orrélationdesma-
triesdeovariane. Cesnouveauxmodèlessontstables parprojetionsurlesplansanoniques
et, paronséquent, dèlement représentables en faible dimension. Ils sontaussi indépendants
desunitésdemesuredesdonnées,e quisigniequeehoixparfoisarbitrairen'aauuneon-
séquene sur la séletion de modèle reposant sur des ritères à base de vraisemblane. Nous
mettons enévidene toutes es propriétésde stabilité parune représentation géométriquespé-
iqueàhaundesmodèles. Un algorithmeGEMest aussidonnéendétailpourestimerleurs
paramètres. Nous omparons ennnosmodèlesstables et les modèlesgéométriques standards
surdesdonnéesréellesissuesdelabiologieetdelagéologie.
Mots-lés : algorithme EM, orrélation, éart-type, indépendane aux unités, délité de
projetion,maximumdevraisemblane
1 INTRODUCTION
NowadaysGaussianmixturemodelsareommonlyusedforlassifyingontinuousdata. They
allowboth(i)tounambiguouslydeterminethestrutureofadatasetbydeningrigorouslythe
oneptofhomogeneoussubgroupsand(ii)toprovideameaningfulinterpretationoftheinferred
partition. Inordertoreduegraduallythevariabilityofthegeneralheterosedastimodel,Celeux
andGovaert(1995),inspiredbyBaneld andRaftery(1993),denesomegeometrialparsimo-
nious Gaussian mixtures basedon aspetraldeomposition of the ovarianematries. These
modelshavehadaseminalinueneinreentyears(seeBiernaki1997;Biernaki,Celeux,Gov-
aert,andLangrognet2006;Bouveyron2006;Baudry2009;Greselin,Ingrassia,andPunzo2011)
and nowadays theyare verywidespread. They enableBouveyron, Girard, and Shmid (2007)
for example, to detet lasses into the hemial omposition of Mars soil. Theyare employed
by Mihel(2008) to lassifyprodution urvesand to determine thenature of oilelds. They
areusedalsobyMaugisetal. (2009)forseletingvariablesintendedtolarifythegenefuntions.
However some of these geometrial models suer from multiple drawbaks. Projeting a
modelontoaanonialsubspaeforexample,maybreakthemodelstruture. Thensomeofthe
geometri modelsannot berepresentedfaithfullyin lowdimension. In additionthegeometri
models are not stable by hanging the measurement units: suh a modiation may infringe
againthemodelstruture. Anotheronsequeneisthatthemodelseletedwithinthegeometri
familythankstoalassiallikelihood riterionlikeAIC (Akaike1974)orBIC (Shwarz1978)
dependsonthemeasurementunits.Thustheretainedmodeldoesnotreallyreetsomeintrinsi
propertyofthedata.
WedisplayinthisworkanewfamilyofparsimoniousGaussianmixturesbasedonavariane-
orrelation deomposition of the ovariane matries. The parsimony of our models refers to
parametersofstatistialinterpretation(standarddeviation, orrelation,oeientofvariation)
instead of a geometri interpretation (volume, orientation, shape). They own multiple stabil-
ity properties whih make them mathematially onsistent and failitate their interpretation.
Firstly, the harateristionstraintsofeahmodelstill remain in everyanonial plane. This
ensuresthateahparsimoniousmixtureanberepresentedfaithfully indimension2. Seondly,
hanging themeasurementunits does notalter theonstraintsinherent to themodels. Inad-
dition thehoie ofsomepartiular units doesnoteven haveanyeet onthe modelseletion
basedonmanylassialriteria. Espeially rawdataand redueddataleadto seletthesame
model.
WeremindinSubsetion2.1thegeneralframeworkoftheGaussianmixturemodel-basedlus-
tering method and then, in Subsetion2.2, what are thestandardgeometri modelsof Celeux
andGovaert(1995). ThenwedeneournewGaussianmixturesbasedonastatistialinterpre-
tation ofthe lasses(Setion 3); ageometrial representationof them isproposedat the same
time. Setion 4 highlights the stability properties of our new model familywhih are laking
in the geometri family of Celeux and Govaert (1995). We show in Subsetion 4.1 that any
mixtureofthisfamilyanbefaithfullyrepresentedinanyanonialplane. Thenweestablishin
Subsetion 4.2that ourmodelsare stableby hangingthemeasurementunits and thatsuh a
modiationhasnoeetonthemodelseletionwhenthelatterisbasedonlassiallikelihood
riterialikeAIC (Akaike1974), BIC (Shwarz1978)orICL(Biernaki, Celeux,and Govaert
2000). Withinthismodelfamily,theMaximumLikelihoodparameterestimationreliesonaGEM
algorithmwhihisdetailedin Setion5. InSetion6weompareonrealdataourmodelswith
thestandardgeometrialones. First,onaveryfamousdatasetonerningeruptionsoftheOld
Faithfulgeyser,weillustrate(Subsetion6.1)thesaleinvariane(resp. thesaledependene)of
themodelseletionwithinthenewfamily(resp. withinthegeometrifamily). Inthisgeologial
ontextwewill seethat thenewmodelsboth(i)improvethetofthegeometrialmodelsand
(ii)leadto amoreonvininginterpretationof theproperties oftheonditional data. Thenin
Subsetion 6.2 thenew models are used in order to lassifya sample of seabirdsdesribed by
morphologialfeatures. Thenewfamilyenablestoretrievethe birdsubspeiesbetterthanthe
geometrialfamilydoes;moreovertheseletednewmodelallowstointerpretthebirdsubspeies
asarisingstohastiallyfromsomeommonreferenepopulation. AtlastweevoqueinSetion7
someresultsfromadditionalexperimentsandweonsiderseveralperspetivesofournewGaus-
sianmixtures.
2 GEOMETRICAL PARSIMONIOUS MODELS
2.1 General model-based lustering priniple
Unsupervised lassiation aims to (i) deide if the data within some sample x = {xi;i = 1, . . . , n} ⊂Rd arehomogeneousand otherwise(ii)to detetsomeunderlying partition into x.
Soamatrixz= (z1, . . . ,zn)′ hastobedeterminedwheretheith rowzi = (zi1, . . . , ziK)indiates
whether xi belongsto thelass k(zik = 1)ornot(zik = 0). K ∈N∗ representsthe(unknown) lusternumber.
Gaussian model-based lustering assumes that the ouples (xi,zi) are realizations of inde- pendentrandomvetorsidentiallydistributedto(X,Z). Thekth omponentZk ofZ∈ {0,1}K
equals1 (and the other ones 0) with probability πk (0 < πk < 1 and PK
k=1πk = 1) and the
onditionalvetor(X|Zk = 1)isnormal,non-degenerate,withenterµk∈Rd andwithovari- anematrixΣk∈Rd×d. Sotheobserveddataxi areassumedtobedistributedaordingtothe Gaussianmixture:
f(•;ψ) =
K
X
k=1
πkΦd(•;µk,Σk), (1)
Φd(•;µk,Σk) denoting the normal density of enter µk and ovariane matrix Σk, and ψ = {(πk,µk,Σk);k= 1, . . . , K}denotingtheparameterofthemodel. Inadditionthemissing data zi are assumed to be distributed aording to the K-dimensional multinomial distribution of order1andparameter(π1, . . . , πK).
Noting ψˆ theMaximumLikelihood estimateof ψ, then data arelassiedby Maximum A
Posteriori(MAP):ˆzki = 1⇔ ∀j ∈ {1, . . . , K}, tki ≥tji,wheretki istheonditionalprobability
tki = ˆπkΦd(xi; ˆµk,Σˆk)/f(xi; ˆψ). (2)
Solustering basedonGaussian mixturesonsistsoftwosteps: (i)the infereneof amodel
fromtheobserveddataxiandthen(ii)theassessmentoflassesbyestimatingthemissingdata zi.
Thestep (i)isan opportunityto makeompeteseveral parsimonioushypothesesthat is to
onsiderdiverserestritionsoftheparameterspaeΨ. Thisstepenablesalsotoproposeseveral
values of the mixture order K. The BIC riterion (Shwarz 1978) enables to hoose both a
parsimoniousmixturemodelandaluster numberK. Thisriterionisdenedby:
BIC= (η/2) logn−ℓ( ˆψ;x), (3)
whereηdenotesthedimensionofψparameterandℓ( ˆψ;x)itsmaximizedlog-likelihoodomputed on x. As BIC leadssometimesto stronglyoverlappinggroupswhih are diultto interpret, onemaypreferICL=BIC−Pn
i=1
PK
k=1zˆiklogtki (seeBiernaki, Celeux,andGovaert2000).
Indeedthisotherlikelihood-basedriterionfavourswellseparatedgroupsandmoreinterpretable
strutures.
2.2 Spetral deomposition
AstheGaussianomponentsarenon-degenerate,eahovarianematrixΣk issymmetri,de-
nite,positive. ThenΣk anbedeomposedas:
Σk =λkSkΛkS′k, (4)
where: (i)λk =|Σk|1/d (volumeof thelassk), (ii)Sk isan orthogonalmatrixtheolumnsof
whih are Σk eigenvetors(orientation of the lass k) and (iii) Λk is adiagonal denite posi-
tivematrixwith determinant1 andwith diagonaloeientsin dereasingorder(shapeofthe
lassk).
A Gaussian mixture of Celeux and Govaert (1995) is a ombination of parsimonious hy-
pothesesonλk,Sk andΛk parameters. Forexampletheso-denoted[λSkΛS′k]geometrimodel
(illustrated by Figure 1in aseK = 2) assumesthat theGaussian omponents haveidential
shapes,samevolumesandfreeorientations(thismodelisalledhomometrosedastiinGreselin
et al. 2011). But the onstraintsof this model do notremain in the anonial subspaes, as
shownbyFigure1.
Figure 1: A major drawbak of the geometrial models: unsustainability of the struture by
projetion intothe anonial planes.
Foranotherexample,[λkSΛkS′]assumesthattheorientationsofthelassesarehomogeneous whereasthevolumesandtheshapesarefree(thismodelisalledhomotroposedastiinGreselin
et al. 2011). Figure 2a representsin an orthonormalbasis twoGaussian omponentsinferred
under these assumptions on the famous Old Faithful data (desribed in Subsetion 6.1). But
Figure 2b shows that a non-isotropiaxis resaling infringes the hypothesis of homogeneously
orientatedlasses. This illustrates aseond drawbakof thegeometrial models: they arenot
saleinvariant.
−1 0 1 2 3
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
Waiting (hour)
Duration (min)
(a) Gaussian lasses with same orienta-
tioninanorthonormal basis.
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
Waiting (hour)
Duration (min)
(b)Amodiation of x-axissaleinfringestheassumptionof
homogeneousorientations.
Figure2: Anotherdrawbak ofthe geometrial models: unsustainabilityof thestruture bynon-
isotropi axisresaling.
3 NEW PARSIMONIOUS MODELS
3.1 Variane-orrelation deomposition
Astheyaresymmetri,denite,positive,theovarianematriesanbealsodeomposedas:
Σk =TkRkTk (5)
where Tk is theorresponding diagonalmatrix ofonditional standarddeviations and Rk the
assoiated matrix of onditional orrelations. So Tk(i, j) = p
Σk(i, j) if i = j and 0 other-
wise,andRk = (Tk)−1Σk(Tk)−1. Contrarily tomanyother deompositionsasCholesky's, (5) isanonialsinebothTk andRk matriesareunique.
Thedeomposition(5)allowstoonsiderseveralmodelsbyombiningmeaningfulonstraints
onTk andRk parametersbut onµk entersaswell:
Tk (k = 1, . . . , K) matries are diagonal denite positive. We onsider three possible
statesofstandarddeviations: free(noadditionalonstraintonTk matries),isotropially transformed(∀ (k, k′) :Tk′ =ak,k′Tk;ak,k′ ∈ R∗+)orhomogeneous(Tk =T).
Rk (k= 1, . . . , K)matriesaresymmetridenitepositiveandtheirdiagonaloeients
equal1. Weonsidertwopossiblestatesoftheorrelations: free(noadditionalonstraint onRk matries)orhomogeneous(Rk=R).
VetorsVk=T−1k µk (k= 1, . . . , K)theomponentsofwhih areonditionalrst-order-
standardized-momentsare freeor equal (Vk = V). When µk omponents arenon-zero,
theinversesofVk omponentsareonditional oeientsofvariation. SoVk =Vmeans
alsothat theonditionaloeientsofvariationaresupposed tobehomogeneous.
Theso-alledRTVfamilyonsistsofelevenGaussianmixturemodelsobtainedbyombining
the previous onstraints on the onditional orrelations, standard deviations and rst-order-
standardized-moments. Thefamilydoesnotinlude themodelassumingallparametersTk,Rk
andVk ashomogeneousbeausethis ombinationamountsto mergealltheomponentsofthe mixture.
Letusnote twomeaningfuldierenesbetweenTk andRk parameters. Firstlyaonstraint on Tk matries postulatesa model intrinsito eah variable whereasa onstrainton Rk ma-
tries involves a model on ouples of variables. Seondly (Vk,Rk) is a Gaussian parameter
obtainedbynormalizing(µk,Σk)thankstoTk. Indeedthenormalvetorofreduedvariables (Tk)−1(X|Zk = 1)hasenterVk andovarianematrixRk.
Themost generalRTV model assumesRk, Tk and Vk parametersto be free. It is noted [Rk,Tk,Vk] anditorrespondstoastandardheterosedastiGaussianmixture.
Inthehomosedasti aseΣk (k= 1, . . . , K)matriesareequalandsoareTk andRk ma-
triessinethedeomposition(5)isunique. Thenthehomosedastimodelisdenoted[R,T,Vk].
Table1,where [•, akT,•] denotesamodel ofisotropiallytransformedstandarddeviations, indiates theparameterdimensionofeahmodelwithin theRTVfamily.
model dimension.
[Rk,Tk,Vk](general) Kd+Kd(d+ 1)/2 [Rk,Tk,V] d+Kd(d+ 1)/2
[Rk, akT,Vk] Kd+d+ (K−1) +Kd(d−1)/2 [Rk, akT,V] 2d+ (K−1) +Kd(d−1)/2
[Rk,T,Vk] Kd+d+Kd(d−1)/2 [Rk,T,V] 2d+Kd(d−1)/2 [R,Tk,Vk] 2Kd+d(d−1)/2 [R,Tk,V] Kd+d(d+ 1)/2 [R, akT,Vk] Kd+ (K−1) +d(d+ 1)/2
[R, akT,V] (K−1) +d(d+ 3)/2 [R,T,Vk](homosedasti) Kd+d(d+ 1)/2
Table1: Dimension ofthe Gaussian parameterof theRTVmodels.
3.2 Graphial representations
InthissetionweproposeaspeirepresentationofGaussianmixtures,whihenablestohigh-
lightthe homogeneity(or theheterogeneity)of thestatistialparametersinvolvedbytheRTV
models.
WerefernowtoFigure3. TheGaussianparameter(µ,Σ)ofsomenormalrandomvetorY
in R2, anberepresented by: Γ(ρ,µ,Σ) = {x ∈R2; (x−µ)′Σ−1(x−µ) = ρ}. Thelatter is
anellipsisthepointsofwhihare atadistaneρfrom µ,aordingtotheMahalanobismetri
Σ−1. ThesmallestretangleontainingΓ,plotted indashedline, indiatesthedispersionofY
335 340 345 350 355 360 365
120 125 130 135 140 145 150
µ Γ(ρ,µ,Σ) V=T−1µ R
Figure3: Representation of the rst-order-standardized-moments (arrow), ofthe standarddevi-
ations(dashed retangle) andof the orrelation(solid segment),for arandomvetor inR2.
variables,andallows(possibly)toomparethedispersionoftheorrespondingvariablesofsev-
eralGaussianrandomvetors. TheorrelationofY variablesisrepresentedbyasolidsegment enteredinµ. Theanglebetweenthelatterandthehorizontalisproportionaltotheorrelation ofthevariables, andthe oeientofproportionalityisπ/2. Thus, Y variables areevenmore
lose to independene (resp. even more orrelated)as the solid segmentis lose to horizontal
(resp. tovertial). ThesolidarrowwithoriginµrepresentsY rst-order-standardized-moments thatisV =T−1µwhere T isthediagonalmatrixofY standarddeviations. Thedrawnvetor isV/γ where γ=kVk2
ρ
q P2
i=1T(i, i)2/2
isaoeient ofgraphial normalizationby
whihthedimensionsofthesolidarrowand thedashedretanglearelose.
ThenFigures 4ato4k displaytheelevenmodelsoftheRTVfamily(ford= 2and K= 2).
The dashed retangles enable to ompare the onditional standard deviations, the solid seg-
ments, the orrelation of the variables and the arrows, the rst-order-standardized-moments.
ThevetorsrepresentingthelatterareVk/γwherethegraphialnormalizationoeientisnow
γ= PK
k=1kVkk2/K ρ
q PK
k=1
P2
i=1Tk(i, i)2/(2K)
.