• Aucun résultat trouvé

High-dimensional clustering

N/A
N/A
Protected

Academic year: 2021

Partager "High-dimensional clustering"

Copied!
43
0
0

Texte intégral

(1)

2 High-dimensionallustering

ChristopheBiernaki andCathyMaugis-Rabusseau 1

2.1 Introdution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2.2 HDlustering: Curseorblessing? . . . . . . . . . . . . . . . . . 4

2.2.1 HDdensityestimation: Curse. . . . . . . . . . . . . . . 4

2.2.2 HDlustering: Amix ofurseandblessing . . . . . . . 6

2.2.3 Intermediateonlusion . . . . . . . . . . . . . . . . . . 8

2.3 Non-anonialmodels . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Gaussianmixtureoffatoranalysers . . . . . . . . . . . 10

2.3.2 HDGaussianmixturemodels . . . . . . . . . . . . . . . 11

2.3.3 Funtionaldata. . . . . . . . . . . . . . . . . . . . . . . 12

2.3.4 Intermediateonlusion . . . . . . . . . . . . . . . . . . 16

2.4 Canonialmodels . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.1 Parsimoniousmixturemodels . . . . . . . . . . . . . . . 17

2.4.2 Variable seletionthroughregularization. . . . . . . . . 20

2.4.3 Variable rolemodelling . . . . . . . . . . . . . . . . . . 24

2.4.4 Co-lustering . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4.5 Intermediateonlusion . . . . . . . . . . . . . . . . . . 33

2.5 Futuremethodologialhallenges . . . . . . . . . . . . . . . . . 35

Bibliography 37

(2)
(3)

High-dimensional lustering

Christophe Biernaki and Cathy Maugis-Rabusseau

2.1 Introdution

High-dimensional(HD)datasetsarenowfrequent,mostlymotivated byteh-

nologial reasons whih onern automation in variable aquisition, heaper

availability of data storage and more powerful standardomputers for quik

data management possibility. All elds are impated by this general phe-

nomenon ofvariable numberination, only thedenition ofhigh beingdo-

maindependent. Inmarketing,thisnumberanbeoforder102,inmiroarray

gene expression between 102 and 104, in text mining 103 or more, of order 106 forsinglenuleotidepolymorphism(SNP)data, et. Note alsothat some- timesmuhmorevariablesanbeinvolved,whatanbetypiallytheasewith

disretizedurves,forinstaneurvesomingfrom temporalsequenes.

Here aretworelatedillustrations. Figure 2.1(a)displaysatext miningex-

ample 1

. It mixesMedline (1033medialabstrats) andCraneld (1398aero-

nautial abstrats) making a total of 2431 douments. Furthermore, all the

words(exludingstopwords)areonsideredasfeaturesmakingatotalof9275

uniquewords. Thedatamatrixonsistsof doumentsontherowsandwords

ontheolumnswitheahentrygivingthetermfrequeny,thatisthenumberof

ourrenesoforrespondingword in orrespondingdoument. Figure2.1(b)

displaysaurveexample. ThisKneadingdatasetomesfromDanoneVitapole

ParisResearhCenterandonernsthequalityofookiesandtherelationship

withtheourkneadingproess(Lévéderetal.[2004℄). Itisomposed by115

dierentoursforwhihthedoughresistaneismeasuredduringthekneading

proessfor480seonds. Wenotiethattheequispaedinstantsoftimeinthe

interval [0; 480℄ (here 241 measures) ould be muh more large than 241 if

measuresweremorefrequentlyreorded.

1

(4)

(a) (b)

Figure2.1: Examplesofhigh-dimensionaldatasets: (a) Textmining:

n= 2431doumentsandthefrequenythat d= 9275uniquewordsours in

eahdoument(awhiterellindiatesahigherfrequeny);(b)Curves:

n= 115kneadingurvesobservedatd= 241equispaedinstantsoftimein

theinterval[0;480℄.

Suhatehnologialrevolutionhasahugeimpatinother sientields,

as soietal or also mathematial ones. In partiular, high-dimensional data

managementbrings somenew hallenges to statistiianssinestandard(low-

dimensional)dataanalysismethodsstruggletodiretlyapplytothenew(high-

dimensional)datasets. Thereasonanbetwofold,sometimeslinked,involving

eitherombinatorialdiultiesordisastrouslylargeestimatevarianeinrease.

Dataanalysismethodsareessentialforprovidingasynthetiviewofdatasets,

allowing data summary and data exploratory for future deision making for

instane. Thisneedisevenmoreauteinthehigh-dimensionalsettingsineon

theonehand thelargenumberofvariablessuggeststhat alot ofinformation

is onveyedby databut, in theother hand, suh information maybehidden

behindtheirvolume.

Clusteranalysisisoneofthemaindataanalysismethod. Itaimsat parti-

tioningadatasetx= (x1, . . . ,xn),omposedbynindividuals andlyingin a spaeX ofdimensiondinto K groupsG1, . . . , GK. This partitionis denoted

byz= (z1, . . . ,zn), lying in aspae Z, wherezi = (zi1, . . . , ziK) isavetor

of{0,1}K suhthatzik = 1ifindividualxi belongstothekthgroupGk,and

zik = 0 otherwise(i= 1, . . . , n,k= 1, . . . , K). Figure 2.2givesanillustration of this priniple when d = 2. Model-based lustering allows to reformulate luster analysis as a well-posed estimation problem both for the partition z

andfor thenumberK of groups. Itonsiders datax1, . . . ,xn asni.i.d. real-

izationsofamixture pdff(·;θK) =PK

k=1πkf(·;αk),where f(·;αk)indiates

the pdf, parameterizedbyαk, assoiatedto the groupk, where πk indiates

themixture proportionof this omponent(

PK

k=1πk = 1, πk ≥0)and where θK = (πkk, k= 1, . . . , K)indiatesthewholemixtureparameters. Fromthe wholedatasetxitisthenpossibletoobtainamixtureparameterestimateθˆK

to dedue apartition estimate ˆzfrom the onditional probabilityf(z|x; ˆθK).

(5)

It is also possible to derivean estimatefrom an estimate of the marginal

probabilityfˆ(x|K). Moredetailsonmixturemodels,relatedestimationofθK,

zandKaregiventhroughout Chapter??.

−2 0 2 4

−2024

X1

X2

−2 0 2 4

−2024

X1

X2

x= (x1, . . . ,xn) −→ ˆz= (ˆz1, . . . ,ˆzn), Kˆ = 3

Figure 2.2: Thelusteringpurposeillustratedinthetwo-dimensionalsetting.

Beyond the nie mathematial bakground it provides, model-based lus-

teringhasledalsotonumerousandsigniantpratialsuessesin thelow-

dimensional settingasChapter??relates,withreferenestherein. Extending

thegeneralframeworkofmodel-basedlusteringtothehigh-dimensionalset-

tingisthusanaturalanddesirablepurpose. Inpriniple,themoreinformation

we haveabouteah individual, the better alusteringmethod is expeted to

perform. Howeverthestrutureofinterestmayoftenbeontainedinasubset

oftheavailablevariablesandalotofvariablesmaybeuselessorevenharmful

to detetareasonablelustering struture. It isthus importantto seletthe

relevantvariablesfrom theluster analysis viewpoint. Itis areentresearh

topi in ontrast to variable seletion in regression and lassiation models

(KohaviandJohn[1997℄;GuyonandElissee [2003℄;Miller[1990℄). This new

interestforvariableseletioninlusteringomesfromtheinreasinglyfrequent

useofthesemethodsonhigh-dimensionaldatasets,suhastransriptomedata

sets.

Threetypesofapproahesdealingwithvariableseletioninlusteringhave

beenproposed. The rstoneinludeslustering methods withweightedvari-

ables(seeforinstaneFriedmanandMeulman[2004℄)anddimensionredution

methods. Forthislater,MLahlanetal.[2002℄useamixtureoffatoranalyz-

erstoreduetheextremelyhighdimensionalityofageneexpressionproblem. A

suitableGaussianmixturefamilyisonsideredinBouveyronetal.[2007℄totake

into aountthedimensionredutionand thedata lustering simultaneously.

Inontrastto thisrstmethod type,thelast twoapproahesseletexpliitly

relevantvariables. Theso-alledlterapproahesseletthevariablesbeforea

lusteringanalysis(seeforinstaneDashetal.[2002℄;JouveandNioloyannis

[2005℄). Theirmainweaknessistheinuene ofindependentseletionstepof

(6)

variable seletion and lustering. For distane-based methods, one an ite

Fowlkes et al. [1988℄ for a forward seletion approah with omplete linkage

hierarhiallustering,DevaneyandRam[1997℄whoproposeastepwisealgo-

rithmwhere thequalityof thefeature subsetsis measuredwith theobweb

algorithm or the method of Bruso and Cradit [2001℄based on the adjusted

Rand index for K-means lustering. There exists also wrapper methods in

the model-based lustering setting. Whenthe number of variablesis greater

thanthenumberof individuals,Tadesse etal.[2005℄proposeafullyBayesian

method usingareversiblejump algorithm to simultaneouslyhoosethe num-

berofmixtureomponentsandseletvariables. Kimetal.[2006℄useasimilar

approah byformulatinglustering intermsof Dirihletproess mixtures. In

Gaussian mixture model lustering, Lawet al. [2004℄proposeto evaluatethe

importaneofthevariablesinthelusteringproessviafeaturesalieniesand

use the Minimum Message Length riterion. Raftery and Dean [2006℄reast

theproblem ofomparing twonestedvariable subsetsasamodelomparison

problemandaddressitusingBayesfator. Aninterestingaspetoftheirmodel

formulationis that irrelevantvariables are notrequired to be independentof

thelusteringvariables. Theyavoidthustheunrealistiindependeneassump-

tionbetweentherelevantandirrelevantvariablesforthelustering,onsidered

inTadesseetal.[2005℄,Kimetal.[2006℄andLawetal.[2004℄.Intheirmodel,

the whole irrelevant variable subset depends on the whole relevant variables

throughalinearregressionequation. However,somerelevantvariablesarenot

neessarilyrequired to explain allirrelevant variables in the linearregression

andtheirintrodutioninvolvesadditionalparameterswithoutasigniantin-

rease ofthe loglikelihood. Therelated extensionsproposedby Maugiset al.

[2009a,b℄followthisremark.

Many model proposals already exist, inluding assoiated parameter esti-

mation and, sometimes, spei model seletion strategies. We will divide

these models into anonial and non-anonial ones, indiating if parameter

onstraintsare respetivelydened relatively to the initial data spaeor rel-

ativelyto atransformation (afatorialmappingtypially). Beforepresenting

suhmodels, andtheirrelatedmodelseletionproess, wedrawwhatarethe

pros(blessing) andtheons(urse)ofhavingmanyvariablesforperforminga

lusteranalysisproess.

2.2 HD lustering: Curse or blessing?

2.2.1 HD density estimation: Curse

Inthe previoussetion, weprovided someexamplesofhigh-dimensionaldata

sets. Inthe presentsetion, the aim is to givea somewhat moretheoretial

denitionofwhatahigh-dimensionaldatasetshouldbeinadensityestimation

(7)

ontheparametriases. Italsoreliesonsomeasymptotiarguments. Remind

thatweonsideradatasetx= (x1, . . . ,xn),xibeingdesribedbydvariables.

Non-parametri ase

In the non-parametri situation, usually xi is onsidered to rely in a high- dimensional spaeassoonasn=o ed

,thusassoonasthelogarithmofthe

samplesize,lnn,isnegligiblebesidethespaedimensiond. Arstjustiation

of this laim is given by Bellman [1961℄: To approximate within errorǫ > 0

a (Lipshitz) funtion of d variables, about (1/ǫ)d evaluations (provided by

the sample size n...) on a grid are required. A seond justiation is also

given bySilverman[1986℄: ApproximatingaGaussian distribution withxed

Gaussian kernelsandwith approximateerror ofabout10%requires asample

size log10n(d) ≈ 0.6(d−0.25). For instane, with d = 10, n(10) ≈ 7.105,

implyingalreadyahugesamplesizeforaquitemoderatedimensionalsetting.

Parametri ase

In theparametri situation, let Sm be a model desribed by Dm ontinuous

parameters,likelydependingonthedimensiond. Insuhaase,thedataset xissaidtorelyinahigh-dimensionalspaeassoonasnissmallinomparison

to a partiular funtion g of Dm, namely n =o(g(Dm)). As an illustration for g, we onsider theheterosedasti Gaussian mixture with true parameter

θ and K omponents. Wenote θˆK theGaussian MLE with K omponents.

Inthat situation,g isalinearfuntion fromthe followingresult(Maugisand

Mihel[2012℄): ItexistspositiveonstantsκandA suhthat Ex[d2H(f(·;θ), f(·|θˆKˆ))]≤κ

infK{KL(f(·;θ), f(·; ˆθK)) +pen(K)}+ 1 n

where dH denotes theHellingerdistane, KLtheKullbak-Leiblerdivergene and

pen(K)≥κDK

n

2Alnd+ 1−ln

1∧ DK

n Alnd

.

ThustheHDnon-parametriandparametrisituationsaredrastiallydif-

ferentinmagnitude. However,inpratie,DK anbehighsineDK∼d2/2in

this Gaussiansituation,ombinedwithpotentiallylargeonstants. Forhigh-

lightingthisfat,onsiderthefollowingtwo-omponentmultivariateGaussian

mixture:

π12=1

2, X1|Z11= 1∼N(0,I), X1|Z12= 1∼N(1,I), (2.1)

with a = (a . . . a) a real vetor of size d. An illustration of this setting is displayedinFigure 2.3(a). Notethat thetwoomponentsaremoreandmore

separated when d grows sine k1−0kI = √

d. However, the quality of the

(8)

mixturedensityestimatedegrades(the Kullbak-Leiblerdivergeneinreases)

whendimensioninreasesasitisillustratedinFigure2.3(b)withahomosedas-

timodelandwithequalmixingproportions.

−3 −2 −1 0 1 2 3 4

−4

−3

−2

−1 0 1 2 3 4

x1

x2

1 2 3 4 5 6 7 8 9 10

12 12.5 13 13.5 14 14.5 15 15.5 16 16.5

d

Kullback−Leibler

(a) (b)

Figure2.3: HDurseintheparametridensityestimationontext: (a)A

bivariatedata setexamplewithisodensityofeahomponentand(b)the

Kullbak-Leiblerdivergeneof thedensityestimatewhendinreases.

2.2.2 HD lustering: A mix of urse and blessing

Contrarytodensityestimationwhereinreasingdimensionhasalearnegative

eet, dimensionmay havebothpositiveand negativeeets onthe luster-

ing task. We distinguish now whih fators favorsuh blessing or urse

outomes.

Blessingfators

Weretrievethemodeldesign(2.1). Wedisplayagainaorrespondingsample

in Figure 2.4(a). We have already mentioned that the two omponents are

moreand moreseparatedwhen dinreases. The reasonis that eah variable

uniformly provides its own separation information suh that the assoiated

theoretialerrordereaseswhendgrows. Indeed,thiserrorisequaltoerrtheo= Φ(−√

d/2),whereΦisthedfofN(0,1). Weanseethisdereasewithdbya

dashlineinFigure2.4(b). Aninterestingonsequeneisthenthattheempirial

error rate dereases also with d as it ould be notied in ontinuous line in

Figure 2.4(b). It meansthat inreasing dimensionmayhave apositiveeet

onthe lustering task assoonasallvariables onveymeaningfulinformation

onthehiddenpartition.

Weproposenowtoillustratemoredrastiallythispositiveeet througha

(9)

−3 −2 −1 0 1 2 3 4

−4

−3

−2

−1 0 1 2 3 4

x1

x2

1 2 3 4 5 6 7 8 9 10

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

d

err

Empirical Theoretical

(a) (b)

Figure2.4: HDblessingin thelustering ontextwhenmostvariablesonvey

independentpartitioning information: (a)A bivariatedatasetexamplewith

isodensityofeahomponentand(b)thetheoretial(dashline)andthe

empirial(ontinuousline)errorratewhendinreases.

sians,allmoreandmoreseparatedwhendinreases:

π123= 13,

X1|Z11= 1∼N(0,I), X1|Z12= 1∼N(2,I), X1|Z13= 1∼N(−2,I), .

Then Figure 2.5(a)-(d) displaysa related sample of size n = 1000 for dier-

ent dimensionson the main two axes of the Fatorial Disriminant Analysis

(FDA)mapping. Itlearlyappearsthatomponentsaremoreandmoreeasily

reognizedwhendimensioninreases,althoughitisasimplevisualizationpro-

ess. Atthelimit,noomplexlusteringalgorithmwouldbeenoughtoidentify

lusters...

Cursefators

Infat,inreasingdimensionmayhaveapositiveeet onlustering retrieval

onlyifvariablesinjetsomepartioninginformation. Inaddition,suhinforma-

tionhastobenotredundant. Weillustratenowthesetwopartiularfeatures.

Firstly,weonsidermanyvariableswhihprovidenoseparationinformation.

We retrieve thesame parametersetting as (2.1)exept that the omponents

arenot moreseparatedwhen dgrowssine2−µ1kI = 1, whereµ1 =0is theenter ofthe rstGaussianand where µ2 = (1 0 . . . 0) istheoneofthe

seond,thus(k= 1,2)

X1|Z1k = 1∼Nk,I). (2.2)

A sampleisdisplayedonFigure2.6(a). Figure 2.6(b)showsin dashline that

thetheoretialerrorrateisonstant(itorrespondstoerrtheo= Φ(−12))when

the dimensioninreases, asexpeted. Consequently, the empirial errorrate

(10)

−4 −3 −2 −1 0 1 2 3 4

−2.5

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2 2.5

1st axis FDA

2nd axis FDA

d=2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−3

−2

−1 0 1 2 3 4

1st axis FDA

2nd axis FDA

d=20

(a) (b)

−1.5 −1 −0.5 0 0.5 1 1.5

−4

−3

−2

−1 0 1 2 3 4 5

1st axis FDA

2nd axis FDA

d=200

−1.5 −1 −0.5 0 0.5 1 1.5

−3

−2

−1 0 1 2 3

1st axis FDA

2nd axis FDA

d=400

() (d)

Figure2.5: FatorialDisriminantAnalysis(FDA)onthemain twofatorial

axesofthree Gaussianomponentsmoreandmoreseparatedwhenthespae

dimensioninreases: (a)d= 2,(b)d= 20,() d= 200 andd= 400.

Seondly,weonsideraasewheremanyvariablesprovideseparation,but

redundantinformation,inthefollowingsense: Itisthesameparametersetting

asbeforefortherstdimensionexeptforallotherones

X1j =X11j, where εj

iidN(0,1) (j = 2, . . . , d). (2.3)

SeeadataexampleinFigure2.7(a). Thus,omponentsarenotmoreseparated

whendgrowssine2−µ1kΣ= 1,Σdenotingtheommonovarianematrix of eah Gaussian omponent, and µk denoting the enter of the omponent

k= 1,2(notethatbothµkandΣouldbeeasilyomputedfromEquation(2.2) and(2.3)). Consequently,errtheo= Φ(−12)isonstantandtheempirialerror

inreaseswith d,asillustratedin Figure2.7(b)withpreviousonventions.

2.2.3 Intermediate onlusion

Inasewherevariableshaveimportantblessingonsequenesforthelustering

(11)

−4 −3 −2 −1 0 1 2 3 4

−5

−4

−3

−2

−1 0 1 2 3 4

x1

x2

1 2 3 4 5 6 7 8 9 10

0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44

d

err Empirical

Theoretical

(a) (b)

Figure2.6: HDursein thelusteringontextwhenvariablesonveyno

partitioninginformation: (a)Abivariatedatasetexamplewithisodensityof

eahomponentand(b)thetheoretial(dashline)andtheempirial

(ontinuousline)errorratewhen dinreases.

−3 −2 −1 0 1 2 3 4

−6

−4

−2 0 2 4 6

x1

x2

1 2 3 4 5 6 7 8 9 10

0.3 0.32 0.34 0.36 0.38 0.4 0.42

d

err Empirical

Theoretical

(a) (b)

Figure2.7: HDurseinthelusteringontextwhenvariablesonvey

redundantpartitioninginformation: (a)Abivariatedatasetexamplewith

isodensityofeahomponentand(b)thetheoretial(dashline)andthe

empirial(ontinuousline)errorratewhendinreases.

spae. Inpartiular, lter methods performingvariable seletionbeforethe

lusteringtaskhavetobeexluded,theriskofremovingdisriminantfeatures

beingtoolarge. Theremainingquestionisthenwhihwrappermethodstobe

used? Suhmethodsshouldmanagewithprioritythefatthatsomevariables

have negative eets for lustering. The general answeris to design spei

parsimoniousmodelsforlustering,themostemblemationesrelyingonsome

variable seletion priniple. Wewill see also several alternativestrategies, in

partiularvariablelustering(tonotbemingledwithindividuallustering,our

primarytask), aiming atassigning dierentroles (lusters)to thevariables.

Références

Documents relatifs

Consequently, we assume that the observed data matrix is generated according a mixture where we can decompose the density function into a product of two terms: the first one

The Largest Gaps algorithm gives a consistent estimation of each parameter of the Latent Block Model with a complexity much lower than the other existing algorithms.. Moreover,

After having recalled the bases of model-based clustering, this article will review dimension reduction approaches, regularization-based techniques, parsimonious modeling,

d) Distribution of the acoplanarity angle, all other cuts being applied, for the data (trian- gles with error bars) and the Monte Carlo (shadowed histogram). The full line results

Section 5 is devoted to numerical experiments on both simulated and real data sets to illustrate the behaviour of the proposed estimation algorithms and model selection criteria..

In section 2, we state the global existence results, and focus on the two specific forms of E: The “bistable case” (4) see Theorem 2.2, and the “monostable case” (5), see

In [Nadif and Govaert, 2005], a Poisson latent block model for two-way contin- gency table was proposed and the problem of clustering have been studied using the classification