• Aucun résultat trouvé

Gaussian Parsimonious Clustering Models Scale Invariant and Stable by Projection

N/A
N/A
Protected

Academic year: 2022

Partager "Gaussian Parsimonious Clustering Models Scale Invariant and Stable by Projection"

Copied!
25
0
0

Texte intégral

(1)

HAL Id: hal-00688250

https://hal.inria.fr/hal-00688250

Submitted on 17 Apr 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires

Invariant and Stable by Projection

Christophe Biernacki, Alexandre Lourme

To cite this version:

Christophe Biernacki, Alexandre Lourme. Gaussian Parsimonious Clustering Models Scale Invariant and Stable by Projection. Statistics and Computing, Springer Verlag (Germany), 2013, pp.21. �hal- 00688250�

(2)

0249-6399ISRNINRIA/RR--7932--FR+ENG

RESEARCH REPORT N° 7932

April 2012

Gaussian Parsimonious Clustering Models

Scale Invariant and Stable by Projection

Christophe Biernacki, Alexandre Lourme

(3)
(4)

RESEARCH CENTRE LILLE – NORD EUROPE

Parc scientifique de la Haute-Borne

Christophe Biernaki

*

,Alexandre Lourme

„

Projet-TeamModal

ResearhReport 7932April201221pages

Abstrat: Gaussian mixture model-based lustering is now a standard tool to determine

an hypothetial underlying struture into ontinuous data. However many usual parsimonious

models, despite theirappealing geometrialinterpretation, suer from majordrawbaks assale

dependene orunsustainabilityof the onstraintsby projetion. Inthis work wepresenta new

family of parsimonious Gaussian models based on a variane-orrelation deomposition of the

ovarianematries. Thesenewmodelsarestablebyprojetionintotheanonialplanesand,so,

faithfullyrepresentableinlowdimension. Theyarealsostablebymodiationofthemeasurement

unitsofthedataandsuhamodiationdoesnothangethemodelseletionbasedonlikelihood

riteria. Wehighlightallthesestabilitypropertiesbyaspeigeometrialrepresentationofeah

model. Adetailed GEMalgorithmisalsoprovidedforeverymodelinferene. Then,onbiologial

andgeologialdata,weompareourstablemodelstostandardgeometrialones.

Key-words: Correlation, EM algorithm, Faithful projetion, Maximum-Likelihood, Standard

deviation,Unitindependene

*

UniversityLille1&CNRS&Inria

„

IUTdépartementdeGénieBiologique,UniversitédePauetdesPaysdel'Adour

(5)

Résumé: Lalassiationàbasede modèlesdemélanges gaussiensest maintenantunoutil

standardpourdéterminerunehypothétiquestrutureahéedansunjeu dedonnéesontinues.

Pourtantdenombreuxmodèlesparimonieuxusuels,malgréleurinterprétationgéométriqueon-

viviale,sourent de défautsmajeurs omme ladépendane aux unités demesure ou enorela

violationdes ontraintes par projetion. Dans e travail, nous présentonsune nouvellefamille

demodèlesgaussiensparimonieuxreposantsurunedéompositionvariane-orrélationdesma-

triesdeovariane. Cesnouveauxmodèlessontstables parprojetionsurlesplansanoniques

et, paronséquent, dèlement représentables en faible dimension. Ils sontaussi indépendants

desunitésdemesuredesdonnées,e quisigniequeehoixparfoisarbitrairen'aauuneon-

séquene sur la séletion de modèle reposant sur des ritères à base de vraisemblane. Nous

mettons enévidene toutes es propriétésde stabilité parune représentation géométriquespé-

iqueàhaundesmodèles. Un algorithmeGEMest aussidonnéendétailpourestimerleurs

paramètres. Nous omparons ennnosmodèlesstables et les modèlesgéométriques standards

surdesdonnéesréellesissuesdelabiologieetdelagéologie.

Mots-lés : algorithme EM, orrélation, éart-type, indépendane aux unités, délité de

projetion,maximumdevraisemblane

(6)

1 INTRODUCTION

NowadaysGaussianmixturemodelsareommonlyusedforlassifyingontinuousdata. They

allowboth(i)tounambiguouslydeterminethestrutureofadatasetbydeningrigorouslythe

oneptofhomogeneoussubgroupsand(ii)toprovideameaningfulinterpretationoftheinferred

partition. Inordertoreduegraduallythevariabilityofthegeneralheterosedastimodel,Celeux

andGovaert(1995),inspiredbyBaneld andRaftery(1993),denesomegeometrialparsimo-

nious Gaussian mixtures basedon aspetraldeomposition of the ovarianematries. These

modelshavehadaseminalinueneinreentyears(seeBiernaki1997;Biernaki,Celeux,Gov-

aert,andLangrognet2006;Bouveyron2006;Baudry2009;Greselin,Ingrassia,andPunzo2011)

and nowadays theyare verywidespread. They enableBouveyron, Girard, and Shmid (2007)

for example, to detet lasses into the hemial omposition of Mars soil. Theyare employed

by Mihel(2008) to lassifyprodution urvesand to determine thenature of oilelds. They

areusedalsobyMaugisetal. (2009)forseletingvariablesintendedtolarifythegenefuntions.

However some of these geometrial models suer from multiple drawbaks. Projeting a

modelontoaanonialsubspaeforexample,maybreakthemodelstruture. Thensomeofthe

geometri modelsannot berepresentedfaithfullyin lowdimension. In additionthegeometri

models are not stable by hanging the measurement units: suh a modiation may infringe

againthemodelstruture. Anotheronsequeneisthatthemodelseletedwithinthegeometri

familythankstoalassiallikelihood riterionlikeAIC (Akaike1974)orBIC (Shwarz1978)

dependsonthemeasurementunits.Thustheretainedmodeldoesnotreallyreetsomeintrinsi

propertyofthedata.

WedisplayinthisworkanewfamilyofparsimoniousGaussianmixturesbasedonavariane-

orrelation deomposition of the ovariane matries. The parsimony of our models refers to

parametersofstatistialinterpretation(standarddeviation, orrelation,oeientofvariation)

instead of a geometri interpretation (volume, orientation, shape). They own multiple stabil-

ity properties whih make them mathematially onsistent and failitate their interpretation.

Firstly, the harateristionstraintsofeahmodelstill remain in everyanonial plane. This

ensuresthateahparsimoniousmixtureanberepresentedfaithfully indimension2. Seondly,

hanging themeasurementunits does notalter theonstraintsinherent to themodels. Inad-

dition thehoie ofsomepartiular units doesnoteven haveanyeet onthe modelseletion

basedonmanylassialriteria. Espeially rawdataand redueddataleadto seletthesame

model.

WeremindinSubsetion2.1thegeneralframeworkoftheGaussianmixturemodel-basedlus-

tering method and then, in Subsetion2.2, what are thestandardgeometri modelsof Celeux

andGovaert(1995). ThenwedeneournewGaussianmixturesbasedonastatistialinterpre-

tation ofthe lasses(Setion 3); ageometrial representationof them isproposedat the same

time. Setion 4 highlights the stability properties of our new model familywhih are laking

in the geometri family of Celeux and Govaert (1995). We show in Subsetion 4.1 that any

mixtureofthisfamilyanbefaithfullyrepresentedinanyanonialplane. Thenweestablishin

Subsetion 4.2that ourmodelsare stableby hangingthemeasurementunits and thatsuh a

modiationhasnoeetonthemodelseletionwhenthelatterisbasedonlassiallikelihood

riterialikeAIC (Akaike1974), BIC (Shwarz1978)orICL(Biernaki, Celeux,and Govaert

2000). Withinthismodelfamily,theMaximumLikelihoodparameterestimationreliesonaGEM

algorithmwhihisdetailedin Setion5. InSetion6weompareonrealdataourmodelswith

thestandardgeometrialones. First,onaveryfamousdatasetonerningeruptionsoftheOld

(7)

Faithfulgeyser,weillustrate(Subsetion6.1)thesaleinvariane(resp. thesaledependene)of

themodelseletionwithinthenewfamily(resp. withinthegeometrifamily). Inthisgeologial

ontextwewill seethat thenewmodelsboth(i)improvethetofthegeometrialmodelsand

(ii)leadto amoreonvininginterpretationof theproperties oftheonditional data. Thenin

Subsetion 6.2 thenew models are used in order to lassifya sample of seabirdsdesribed by

morphologialfeatures. Thenewfamilyenablestoretrievethe birdsubspeiesbetterthanthe

geometrialfamilydoes;moreovertheseletednewmodelallowstointerpretthebirdsubspeies

asarisingstohastiallyfromsomeommonreferenepopulation. AtlastweevoqueinSetion7

someresultsfromadditionalexperimentsandweonsiderseveralperspetivesofournewGaus-

sianmixtures.

2 GEOMETRICAL PARSIMONIOUS MODELS

2.1 General model-based lustering priniple

Unsupervised lassiation aims to (i) deide if the data within some sample x = {xi;i = 1, . . . , n} ⊂Rd arehomogeneousand otherwise(ii)to detetsomeunderlying partition into x.

Soamatrixz= (z1, . . . ,zn) hastobedeterminedwheretheith rowzi = (zi1, . . . , ziK)indiates

whether xi belongsto thelass k(zik = 1)ornot(zik = 0). K N representsthe(unknown) lusternumber.

Gaussian model-based lustering assumes that the ouples (xi,zi) are realizations of inde- pendentrandomvetorsidentiallydistributedto(X,Z). Thekth omponentZk ofZ∈ {0,1}K

equals1 (and the other ones 0) with probability πk (0 < πk < 1 and PK

k=1πk = 1) and the

onditionalvetor(X|Zk = 1)isnormal,non-degenerate,withenterµkRd andwithovari- anematrixΣkRd×d. Sotheobserveddataxi areassumedtobedistributedaordingtothe Gaussianmixture:

f(•;ψ) =

K

X

k=1

πkΦd(•;µk,Σk), (1)

Φd(•;µk,Σk) denoting the normal density of enter µk and ovariane matrix Σk, and ψ = {(πk,µk,Σk);k= 1, . . . , K}denotingtheparameterofthemodel. Inadditionthemissing data zi are assumed to be distributed aording to the K-dimensional multinomial distribution of order1andparameter1, . . . , πK).

Noting ψˆ theMaximumLikelihood estimateof ψ, then data arelassiedby Maximum A

Posteriori(MAP):ˆzki = 1⇔ ∀j ∈ {1, . . . , K}, tki tji,wheretki istheonditionalprobability

tki = ˆπkΦd(xi; ˆµk,Σˆk)/f(xi; ˆψ). (2)

Solustering basedonGaussian mixturesonsistsoftwosteps: (i)the infereneof amodel

fromtheobserveddataxiandthen(ii)theassessmentoflassesbyestimatingthemissingdata zi.

Thestep (i)isan opportunityto makeompeteseveral parsimonioushypothesesthat is to

onsiderdiverserestritionsoftheparameterspaeΨ. Thisstepenablesalsotoproposeseveral

values of the mixture order K. The BIC riterion (Shwarz 1978) enables to hoose both a

parsimoniousmixturemodelandaluster numberK. Thisriterionisdenedby:

BIC= (η/2) lognℓ( ˆψ;x), (3)

(8)

whereηdenotesthedimensionofψparameterandℓ( ˆψ;x)itsmaximizedlog-likelihoodomputed on x. As BIC leadssometimesto stronglyoverlappinggroupswhih are diultto interpret, onemaypreferICL=BICPn

i=1

PK

k=1zˆiklogtki (seeBiernaki, Celeux,andGovaert2000).

Indeedthisotherlikelihood-basedriterionfavourswellseparatedgroupsandmoreinterpretable

strutures.

2.2 Spetral deomposition

AstheGaussianomponentsarenon-degenerate,eahovarianematrixΣk issymmetri,de-

nite,positive. ThenΣk anbedeomposedas:

Σk =λkSkΛkSk, (4)

where: (i)λk =k|1/d (volumeof thelassk), (ii)Sk isan orthogonalmatrixtheolumnsof

whih are Σk eigenvetors(orientation of the lass k) and (iii) Λk is adiagonal denite posi-

tivematrixwith determinant1 andwith diagonaloeientsin dereasingorder(shapeofthe

lassk).

A Gaussian mixture of Celeux and Govaert (1995) is a ombination of parsimonious hy-

pothesesonλk,Sk andΛk parameters. Forexampletheso-denoted[λSkΛSk]geometrimodel

(illustrated by Figure 1in aseK = 2) assumesthat theGaussian omponents haveidential

shapes,samevolumesandfreeorientations(thismodelisalledhomometrosedastiinGreselin

et al. 2011). But the onstraintsof this model do notremain in the anonial subspaes, as

shownbyFigure1.

Figure 1: A major drawbak of the geometrial models: unsustainability of the struture by

projetion intothe anonial planes.

(9)

Foranotherexample,kkS]assumesthattheorientationsofthelassesarehomogeneous whereasthevolumesandtheshapesarefree(thismodelisalledhomotroposedastiinGreselin

et al. 2011). Figure 2a representsin an orthonormalbasis twoGaussian omponentsinferred

under these assumptions on the famous Old Faithful data (desribed in Subsetion 6.1). But

Figure 2b shows that a non-isotropiaxis resaling infringes the hypothesis of homogeneously

orientatedlasses. This illustrates aseond drawbakof thegeometrial models: they arenot

saleinvariant.

−1 0 1 2 3

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

Waiting (hour)

Duration (min)

(a) Gaussian lasses with same orienta-

tioninanorthonormal basis.

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6

Waiting (hour)

Duration (min)

(b)Amodiation of x-axissaleinfringestheassumptionof

homogeneousorientations.

Figure2: Anotherdrawbak ofthe geometrial models: unsustainabilityof thestruture bynon-

isotropi axisresaling.

3 NEW PARSIMONIOUS MODELS

3.1 Variane-orrelation deomposition

Astheyaresymmetri,denite,positive,theovarianematriesanbealsodeomposedas:

Σk =TkRkTk (5)

where Tk is theorresponding diagonalmatrix ofonditional standarddeviations and Rk the

assoiated matrix of onditional orrelations. So Tk(i, j) = p

Σk(i, j) if i = j and 0 other-

wise,andRk = (Tk)−1Σk(Tk)−1. Contrarily tomanyother deompositionsasCholesky's, (5) isanonialsinebothTk andRk matriesareunique.

Thedeomposition(5)allowstoonsiderseveralmodelsbyombiningmeaningfulonstraints

onTk andRk parametersbut onµk entersaswell:

ˆ Tk (k = 1, . . . , K) matries are diagonal denite positive. We onsider three possible

statesofstandarddeviations: free(noadditionalonstraintonTk matries),isotropially transformed( (k, k) :Tk =ak,kTk;ak,k R+)orhomogeneous(Tk =T).

(10)

ˆ Rk (k= 1, . . . , K)matriesaresymmetridenitepositiveandtheirdiagonaloeients

equal1. Weonsidertwopossiblestatesoftheorrelations: free(noadditionalonstraint onRk matries)orhomogeneous(Rk=R).

ˆ VetorsVk=T−1k µk (k= 1, . . . , K)theomponentsofwhih areonditionalrst-order-

standardized-momentsare freeor equal (Vk = V). When µk omponents arenon-zero,

theinversesofVk omponentsareonditional oeientsofvariation. SoVk =Vmeans

alsothat theonditionaloeientsofvariationaresupposed tobehomogeneous.

Theso-alledRTVfamilyonsistsofelevenGaussianmixturemodelsobtainedbyombining

the previous onstraints on the onditional orrelations, standard deviations and rst-order-

standardized-moments. Thefamilydoesnotinlude themodelassumingallparametersTk,Rk

andVk ashomogeneousbeausethis ombinationamountsto mergealltheomponentsofthe mixture.

Letusnote twomeaningfuldierenesbetweenTk andRk parameters. Firstlyaonstraint on Tk matries postulatesa model intrinsito eah variable whereasa onstrainton Rk ma-

tries involves a model on ouples of variables. Seondly (Vk,Rk) is a Gaussian parameter

obtainedbynormalizingk,Σk)thankstoTk. Indeedthenormalvetorofreduedvariables (Tk)−1(X|Zk = 1)hasenterVk andovarianematrixRk.

Themost generalRTV model assumesRk, Tk and Vk parametersto be free. It is noted [Rk,Tk,Vk] anditorrespondstoastandardheterosedastiGaussianmixture.

Inthehomosedasti aseΣk (k= 1, . . . , K)matriesareequalandsoareTk andRk ma-

triessinethedeomposition(5)isunique. Thenthehomosedastimodelisdenoted[R,T,Vk].

Table1,where [•, akT,•] denotesamodel ofisotropiallytransformedstandarddeviations, indiates theparameterdimensionofeahmodelwithin theRTVfamily.

model dimension.

[Rk,Tk,Vk](general) Kd+Kd(d+ 1)/2 [Rk,Tk,V] d+Kd(d+ 1)/2

[Rk, akT,Vk] Kd+d+ (K1) +Kd(d1)/2 [Rk, akT,V] 2d+ (K1) +Kd(d1)/2

[Rk,T,Vk] Kd+d+Kd(d1)/2 [Rk,T,V] 2d+Kd(d1)/2 [R,Tk,Vk] 2Kd+d(d1)/2 [R,Tk,V] Kd+d(d+ 1)/2 [R, akT,Vk] Kd+ (K1) +d(d+ 1)/2

[R, akT,V] (K1) +d(d+ 3)/2 [R,T,Vk](homosedasti) Kd+d(d+ 1)/2

Table1: Dimension ofthe Gaussian parameterof theRTVmodels.

(11)

3.2 Graphial representations

InthissetionweproposeaspeirepresentationofGaussianmixtures,whihenablestohigh-

lightthe homogeneity(or theheterogeneity)of thestatistialparametersinvolvedbytheRTV

models.

WerefernowtoFigure3. TheGaussianparameter(µ,Σ)ofsomenormalrandomvetorY

in R2, anberepresented by: Γ(ρ,µ,Σ) = {x R2; (xµ)Σ−1(xµ) = ρ}. Thelatter is

anellipsisthepointsofwhihare atadistaneρfrom µ,aordingtotheMahalanobismetri

Σ−1. ThesmallestretangleontainingΓ,plotted indashedline, indiatesthedispersionofY

335 340 345 350 355 360 365

120 125 130 135 140 145 150

µ Γ(ρ,µ,Σ) V=T−1µ R

Figure3: Representation of the rst-order-standardized-moments (arrow), ofthe standarddevi-

ations(dashed retangle) andof the orrelation(solid segment),for arandomvetor inR2.

variables,andallows(possibly)toomparethedispersionoftheorrespondingvariablesofsev-

eralGaussianrandomvetors. TheorrelationofY variablesisrepresentedbyasolidsegment enteredinµ. Theanglebetweenthelatterandthehorizontalisproportionaltotheorrelation ofthevariables, andthe oeientofproportionalityisπ/2. Thus, Y variables areevenmore

lose to independene (resp. even more orrelated)as the solid segmentis lose to horizontal

(resp. tovertial). ThesolidarrowwithoriginµrepresentsY rst-order-standardized-moments thatisV =T−1µwhere T isthediagonalmatrixofY standarddeviations. Thedrawnvetor isV where γ=kVk2

ρ

q P2

i=1T(i, i)2/2

isaoeient ofgraphial normalizationby

whihthedimensionsofthesolidarrowand thedashedretanglearelose.

ThenFigures 4ato4k displaytheelevenmodelsoftheRTVfamily(ford= 2and K= 2).

The dashed retangles enable to ompare the onditional standard deviations, the solid seg-

ments, the orrelation of the variables and the arrows, the rst-order-standardized-moments.

ThevetorsrepresentingthelatterareVkwherethegraphialnormalizationoeientisnow

γ= PK

k=1kVkk2/K ρ

q PK

k=1

P2

i=1Tk(i, i)2/(2K)

.

Références

Documents relatifs

Clustering: from modeling to visualizing Mapping clusters as spherical Gaussians Numerical illustrations for functional data Discussion.. Take

The key point is to ex- plicitly force a visualization based on a spherical Gaussian mixture to inherit from the within cluster overlap that is present in the initial

We can thus perform Bayesian model selection in the class of complete Gaussian models invariant by the action of a subgroup of the symmetric group, which we could also call com-

We compared the classification accuracy of the final layer of all models (DCNNs, and HMAX representation) with those of human subjects doing the invariant object categori- zation

Human immunodeficiency virus type 1 (HIV-1) RNA and p24 antigen concentrations were determined in plasma samples from 169 chronically infected patients (median CD4 cell count,

In Section 2, notations, estimation and model selection are fixed for model-based clustering and model-based co- clustering by underlying also the case of various types of

So in this thesis, to simplify the parameters estimation so thus make the binned-EM and bin-EM-CEM algorithms more efficient, and also to better adapt different data distribution

trouve, il peut sans dzflculté contrôler et insensibiliser cette région ... Votre corps s'insensibilise.. tout en respectant le fonctionnement physiologique habituel.. Les