2 High-dimensionallustering
ChristopheBiernaki andCathyMaugis-Rabusseau 1
2.1 Introdution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.2 HDlustering: Curseorblessing? . . . . . . . . . . . . . . . . . 4
2.2.1 HDdensityestimation: Curse. . . . . . . . . . . . . . . 4
2.2.2 HDlustering: Amix ofurseandblessing . . . . . . . 6
2.2.3 Intermediateonlusion . . . . . . . . . . . . . . . . . . 8
2.3 Non-anonialmodels . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Gaussianmixtureoffatoranalysers . . . . . . . . . . . 10
2.3.2 HDGaussianmixturemodels . . . . . . . . . . . . . . . 11
2.3.3 Funtionaldata. . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Intermediateonlusion . . . . . . . . . . . . . . . . . . 16
2.4 Canonialmodels . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Parsimoniousmixturemodels . . . . . . . . . . . . . . . 17
2.4.2 Variable seletionthroughregularization. . . . . . . . . 20
2.4.3 Variable rolemodelling . . . . . . . . . . . . . . . . . . 24
2.4.4 Co-lustering . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.5 Intermediateonlusion . . . . . . . . . . . . . . . . . . 33
2.5 Futuremethodologialhallenges . . . . . . . . . . . . . . . . . 35
Bibliography 37
High-dimensional lustering
Christophe Biernaki and Cathy Maugis-Rabusseau
2.1 Introdution
High-dimensional(HD)datasetsarenowfrequent,mostlymotivated byteh-
nologial reasons whih onern automation in variable aquisition, heaper
availability of data storage and more powerful standardomputers for quik
data management possibility. All elds are impated by this general phe-
nomenon ofvariable numberination, only thedenition ofhigh beingdo-
maindependent. Inmarketing,thisnumberanbeoforder102,inmiroarray
gene expression between 102 and 104, in text mining 103 or more, of order 106 forsinglenuleotidepolymorphism(SNP)data, et. Note alsothat some- timesmuhmorevariablesanbeinvolved,whatanbetypiallytheasewith
disretizedurves,forinstaneurvesomingfrom temporalsequenes.
Here aretworelatedillustrations. Figure 2.1(a)displaysatext miningex-
ample 1
. It mixesMedline (1033medialabstrats) andCraneld (1398aero-
nautial abstrats) making a total of 2431 douments. Furthermore, all the
words(exludingstopwords)areonsideredasfeaturesmakingatotalof9275
uniquewords. Thedatamatrixonsistsof doumentsontherowsandwords
ontheolumnswitheahentrygivingthetermfrequeny,thatisthenumberof
ourrenesoforrespondingword in orrespondingdoument. Figure2.1(b)
displaysaurveexample. ThisKneadingdatasetomesfromDanoneVitapole
ParisResearhCenterandonernsthequalityofookiesandtherelationship
withtheourkneadingproess(Lévéderetal.[2004℄). Itisomposed by115
dierentoursforwhihthedoughresistaneismeasuredduringthekneading
proessfor480seonds. Wenotiethattheequispaedinstantsoftimeinthe
interval [0; 480℄ (here 241 measures) ould be muh more large than 241 if
measuresweremorefrequentlyreorded.
1
(a) (b)
Figure2.1: Examplesofhigh-dimensionaldatasets: (a) Textmining:
n= 2431doumentsandthefrequenythat d= 9275uniquewordsours in
eahdoument(awhiterellindiatesahigherfrequeny);(b)Curves:
n= 115kneadingurvesobservedatd= 241equispaedinstantsoftimein
theinterval[0;480℄.
Suhatehnologialrevolutionhasahugeimpatinother sientields,
as soietal or also mathematial ones. In partiular, high-dimensional data
managementbrings somenew hallenges to statistiianssinestandard(low-
dimensional)dataanalysismethodsstruggletodiretlyapplytothenew(high-
dimensional)datasets. Thereasonanbetwofold,sometimeslinked,involving
eitherombinatorialdiultiesordisastrouslylargeestimatevarianeinrease.
Dataanalysismethodsareessentialforprovidingasynthetiviewofdatasets,
allowing data summary and data exploratory for future deision making for
instane. Thisneedisevenmoreauteinthehigh-dimensionalsettingsineon
theonehand thelargenumberofvariablessuggeststhat alot ofinformation
is onveyedby databut, in theother hand, suh information maybehidden
behindtheirvolume.
Clusteranalysisisoneofthemaindataanalysismethod. Itaimsat parti-
tioningadatasetx= (x1, . . . ,xn),omposedbynindividuals andlyingin a spaeX ofdimensiondinto K groupsG1, . . . , GK. This partitionis denoted
byz= (z1, . . . ,zn), lying in aspae Z, wherezi = (zi1, . . . , ziK)′ isavetor
of{0,1}K suhthatzik = 1ifindividualxi belongstothekthgroupGk,and
zik = 0 otherwise(i= 1, . . . , n,k= 1, . . . , K). Figure 2.2givesanillustration of this priniple when d = 2. Model-based lustering allows to reformulate luster analysis as a well-posed estimation problem both for the partition z
andfor thenumberK of groups. Itonsiders datax1, . . . ,xn asni.i.d. real-
izationsofamixture pdff(·;θK) =PK
k=1πkf(·;αk),where f(·;αk)indiates
the pdf, parameterizedbyαk, assoiatedto the groupk, where πk indiates
themixture proportionof this omponent(
PK
k=1πk = 1, πk ≥0)and where θK = (πk,αk, k= 1, . . . , K)indiatesthewholemixtureparameters. Fromthe wholedatasetxitisthenpossibletoobtainamixtureparameterestimateθˆK
to dedue apartition estimate ˆzfrom the onditional probabilityf(z|x; ˆθK).
It is also possible to derivean estimate Kˆ from an estimate of the marginal
probabilityfˆ(x|K). Moredetailsonmixturemodels,relatedestimationofθK,
zandKaregiventhroughout Chapter??.
−2 0 2 4
−2024
X1
X2
−2 0 2 4
−2024
X1
X2
x= (x1, . . . ,xn) −→ ˆz= (ˆz1, . . . ,ˆzn), Kˆ = 3
Figure 2.2: Thelusteringpurposeillustratedinthetwo-dimensionalsetting.
Beyond the nie mathematial bakground it provides, model-based lus-
teringhasledalsotonumerousandsigniantpratialsuessesin thelow-
dimensional settingasChapter??relates,withreferenestherein. Extending
thegeneralframeworkofmodel-basedlusteringtothehigh-dimensionalset-
tingisthusanaturalanddesirablepurpose. Inpriniple,themoreinformation
we haveabouteah individual, the better alusteringmethod is expeted to
perform. Howeverthestrutureofinterestmayoftenbeontainedinasubset
oftheavailablevariablesandalotofvariablesmaybeuselessorevenharmful
to detetareasonablelustering struture. It isthus importantto seletthe
relevantvariablesfrom theluster analysis viewpoint. Itis areentresearh
topi in ontrast to variable seletion in regression and lassiation models
(KohaviandJohn[1997℄;GuyonandElissee [2003℄;Miller[1990℄). This new
interestforvariableseletioninlusteringomesfromtheinreasinglyfrequent
useofthesemethodsonhigh-dimensionaldatasets,suhastransriptomedata
sets.
Threetypesofapproahesdealingwithvariableseletioninlusteringhave
beenproposed. The rstoneinludeslustering methods withweightedvari-
ables(seeforinstaneFriedmanandMeulman[2004℄)anddimensionredution
methods. Forthislater,MLahlanetal.[2002℄useamixtureoffatoranalyz-
erstoreduetheextremelyhighdimensionalityofageneexpressionproblem. A
suitableGaussianmixturefamilyisonsideredinBouveyronetal.[2007℄totake
into aountthedimensionredutionand thedata lustering simultaneously.
Inontrastto thisrstmethod type,thelast twoapproahesseletexpliitly
relevantvariables. Theso-alledlterapproahesseletthevariablesbeforea
lusteringanalysis(seeforinstaneDashetal.[2002℄;JouveandNioloyannis
[2005℄). Theirmainweaknessistheinuene ofindependentseletionstepof
variable seletion and lustering. For distane-based methods, one an ite
Fowlkes et al. [1988℄ for a forward seletion approah with omplete linkage
hierarhiallustering,DevaneyandRam[1997℄whoproposeastepwisealgo-
rithmwhere thequalityof thefeature subsetsis measuredwith theobweb
algorithm or the method of Bruso and Cradit [2001℄based on the adjusted
Rand index for K-means lustering. There exists also wrapper methods in
the model-based lustering setting. Whenthe number of variablesis greater
thanthenumberof individuals,Tadesse etal.[2005℄proposeafullyBayesian
method usingareversiblejump algorithm to simultaneouslyhoosethe num-
berofmixtureomponentsandseletvariables. Kimetal.[2006℄useasimilar
approah byformulatinglustering intermsof Dirihletproess mixtures. In
Gaussian mixture model lustering, Lawet al. [2004℄proposeto evaluatethe
importaneofthevariablesinthelusteringproessviafeaturesalieniesand
use the Minimum Message Length riterion. Raftery and Dean [2006℄reast
theproblem ofomparing twonestedvariable subsetsasamodelomparison
problemandaddressitusingBayesfator. Aninterestingaspetoftheirmodel
formulationis that irrelevantvariables are notrequired to be independentof
thelusteringvariables. Theyavoidthustheunrealistiindependeneassump-
tionbetweentherelevantandirrelevantvariablesforthelustering,onsidered
inTadesseetal.[2005℄,Kimetal.[2006℄andLawetal.[2004℄.Intheirmodel,
the whole irrelevant variable subset depends on the whole relevant variables
throughalinearregressionequation. However,somerelevantvariablesarenot
neessarilyrequired to explain allirrelevant variables in the linearregression
andtheirintrodutioninvolvesadditionalparameterswithoutasigniantin-
rease ofthe loglikelihood. Therelated extensionsproposedby Maugiset al.
[2009a,b℄followthisremark.
Many model proposals already exist, inluding assoiated parameter esti-
mation and, sometimes, spei model seletion strategies. We will divide
these models into anonial and non-anonial ones, indiating if parameter
onstraintsare respetivelydened relatively to the initial data spaeor rel-
ativelyto atransformation (afatorialmappingtypially). Beforepresenting
suhmodels, andtheirrelatedmodelseletionproess, wedrawwhatarethe
pros(blessing) andtheons(urse)ofhavingmanyvariablesforperforminga
lusteranalysisproess.
2.2 HD lustering: Curse or blessing?
2.2.1 HD density estimation: Curse
Inthe previoussetion, weprovided someexamplesofhigh-dimensionaldata
sets. Inthe presentsetion, the aim is to givea somewhat moretheoretial
denitionofwhatahigh-dimensionaldatasetshouldbeinadensityestimation
ontheparametriases. Italsoreliesonsomeasymptotiarguments. Remind
thatweonsideradatasetx= (x1, . . . ,xn),xibeingdesribedbydvariables.
Non-parametri ase
In the non-parametri situation, usually xi is onsidered to rely in a high- dimensional spaeassoonasn=o ed
,thusassoonasthelogarithmofthe
samplesize,lnn,isnegligiblebesidethespaedimensiond. Arstjustiation
of this laim is given by Bellman [1961℄: To approximate within errorǫ > 0
a (Lipshitz) funtion of d variables, about (1/ǫ)d evaluations (provided by
the sample size n...) on a grid are required. A seond justiation is also
given bySilverman[1986℄: ApproximatingaGaussian distribution withxed
Gaussian kernelsandwith approximateerror ofabout10%requires asample
size log10n(d) ≈ 0.6(d−0.25). For instane, with d = 10, n(10) ≈ 7.105,
implyingalreadyahugesamplesizeforaquitemoderatedimensionalsetting.
Parametri ase
In theparametri situation, let Sm be a model desribed by Dm ontinuous
parameters,likelydependingonthedimensiond. Insuhaase,thedataset xissaidtorelyinahigh-dimensionalspaeassoonasnissmallinomparison
to a partiular funtion g of Dm, namely n =o(g(Dm)). As an illustration for g, we onsider theheterosedasti Gaussian mixture with true parameter
θ∗ and K omponents. Wenote θˆK theGaussian MLE with K omponents.
Inthat situation,g isalinearfuntion fromthe followingresult(Maugisand
Mihel[2012℄): ItexistspositiveonstantsκandA suhthat Ex[d2H(f(·;θ∗), f(·|θˆKˆ))]≤κ
infK{KL(f(·;θ∗), f(·; ˆθK)) +pen(K)}+ 1 n
where dH denotes theHellingerdistane, KLtheKullbak-Leiblerdivergene and
pen(K)≥κDK
n
2Alnd+ 1−ln
1∧ DK
n Alnd
.
ThustheHDnon-parametriandparametrisituationsaredrastiallydif-
ferentinmagnitude. However,inpratie,DK anbehighsineDK∼d2/2in
this Gaussiansituation,ombinedwithpotentiallylargeonstants. Forhigh-
lightingthisfat,onsiderthefollowingtwo-omponentmultivariateGaussian
mixture:
π1=π2=1
2, X1|Z11= 1∼N(0,I), X1|Z12= 1∼N(1,I), (2.1)
with a = (a . . . a)′ a real vetor of size d. An illustration of this setting is displayedinFigure 2.3(a). Notethat thetwoomponentsaremoreandmore
separated when d grows sine k1−0kI = √
d. However, the quality of the
mixturedensityestimatedegrades(the Kullbak-Leiblerdivergeneinreases)
whendimensioninreasesasitisillustratedinFigure2.3(b)withahomosedas-
timodelandwithequalmixingproportions.
−3 −2 −1 0 1 2 3 4
−4
−3
−2
−1 0 1 2 3 4
x1
x2
1 2 3 4 5 6 7 8 9 10
12 12.5 13 13.5 14 14.5 15 15.5 16 16.5
d
Kullback−Leibler
(a) (b)
Figure2.3: HDurseintheparametridensityestimationontext: (a)A
bivariatedata setexamplewithisodensityofeahomponentand(b)the
Kullbak-Leiblerdivergeneof thedensityestimatewhendinreases.
2.2.2 HD lustering: A mix of urse and blessing
Contrarytodensityestimationwhereinreasingdimensionhasalearnegative
eet, dimensionmay havebothpositiveand negativeeets onthe luster-
ing task. We distinguish now whih fators favorsuh blessing or urse
outomes.
Blessingfators
Weretrievethemodeldesign(2.1). Wedisplayagainaorrespondingsample
in Figure 2.4(a). We have already mentioned that the two omponents are
moreand moreseparatedwhen dinreases. The reasonis that eah variable
uniformly provides its own separation information suh that the assoiated
theoretialerrordereaseswhendgrows. Indeed,thiserrorisequaltoerrtheo= Φ(−√
d/2),whereΦisthedfofN(0,1). Weanseethisdereasewithdbya
dashlineinFigure2.4(b). Aninterestingonsequeneisthenthattheempirial
error rate dereases also with d as it ould be notied in ontinuous line in
Figure 2.4(b). It meansthat inreasing dimensionmayhave apositiveeet
onthe lustering task assoonasallvariables onveymeaningfulinformation
onthehiddenpartition.
Weproposenowtoillustratemoredrastiallythispositiveeet througha
−3 −2 −1 0 1 2 3 4
−4
−3
−2
−1 0 1 2 3 4
x1
x2
1 2 3 4 5 6 7 8 9 10
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4
d
err
Empirical Theoretical
(a) (b)
Figure2.4: HDblessingin thelustering ontextwhenmostvariablesonvey
independentpartitioning information: (a)A bivariatedatasetexamplewith
isodensityofeahomponentand(b)thetheoretial(dashline)andthe
empirial(ontinuousline)errorratewhendinreases.
sians,allmoreandmoreseparatedwhendinreases:
π1=π2=π3= 13,
X1|Z11= 1∼N(0,I), X1|Z12= 1∼N(2,I), X1|Z13= 1∼N(−2,I), .
Then Figure 2.5(a)-(d) displaysa related sample of size n = 1000 for dier-
ent dimensionson the main two axes of the Fatorial Disriminant Analysis
(FDA)mapping. Itlearlyappearsthatomponentsaremoreandmoreeasily
reognizedwhendimensioninreases,althoughitisasimplevisualizationpro-
ess. Atthelimit,noomplexlusteringalgorithmwouldbeenoughtoidentify
lusters...
Cursefators
Infat,inreasingdimensionmayhaveapositiveeet onlustering retrieval
onlyifvariablesinjetsomepartioninginformation. Inaddition,suhinforma-
tionhastobenotredundant. Weillustratenowthesetwopartiularfeatures.
Firstly,weonsidermanyvariableswhihprovidenoseparationinformation.
We retrieve thesame parametersetting as (2.1)exept that the omponents
arenot moreseparatedwhen dgrowssine kµ2−µ1kI = 1, whereµ1 =0is theenter ofthe rstGaussianand where µ2 = (1 0 . . . 0)′ istheoneofthe
seond,thus(k= 1,2)
X1|Z1k = 1∼N(µk,I). (2.2)
A sampleisdisplayedonFigure2.6(a). Figure 2.6(b)showsin dashline that
thetheoretialerrorrateisonstant(itorrespondstoerrtheo= Φ(−12))when
the dimensioninreases, asexpeted. Consequently, the empirial errorrate
−4 −3 −2 −1 0 1 2 3 4
−2.5
−2
−1.5
−1
−0.5 0 0.5 1 1.5 2 2.5
1st axis FDA
2nd axis FDA
d=2
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2
−3
−2
−1 0 1 2 3 4
1st axis FDA
2nd axis FDA
d=20
(a) (b)
−1.5 −1 −0.5 0 0.5 1 1.5
−4
−3
−2
−1 0 1 2 3 4 5
1st axis FDA
2nd axis FDA
d=200
−1.5 −1 −0.5 0 0.5 1 1.5
−3
−2
−1 0 1 2 3
1st axis FDA
2nd axis FDA
d=400
() (d)
Figure2.5: FatorialDisriminantAnalysis(FDA)onthemain twofatorial
axesofthree Gaussianomponentsmoreandmoreseparatedwhenthespae
dimensioninreases: (a)d= 2,(b)d= 20,() d= 200 andd= 400.
Seondly,weonsideraasewheremanyvariablesprovideseparation,but
redundantinformation,inthefollowingsense: Itisthesameparametersetting
asbeforefortherstdimensionexeptforallotherones
X1j =X11+εj, where εj
iid∼ N(0,1) (j = 2, . . . , d). (2.3)
SeeadataexampleinFigure2.7(a). Thus,omponentsarenotmoreseparated
whendgrowssinekµ2−µ1kΣ= 1,Σdenotingtheommonovarianematrix of eah Gaussian omponent, and µk denoting the enter of the omponent
k= 1,2(notethatbothµkandΣouldbeeasilyomputedfromEquation(2.2) and(2.3)). Consequently,errtheo= Φ(−12)isonstantandtheempirialerror
inreaseswith d,asillustratedin Figure2.7(b)withpreviousonventions.
2.2.3 Intermediate onlusion
Inasewherevariableshaveimportantblessingonsequenesforthelustering
−4 −3 −2 −1 0 1 2 3 4
−5
−4
−3
−2
−1 0 1 2 3 4
x1
x2
1 2 3 4 5 6 7 8 9 10
0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44
d
err Empirical
Theoretical
(a) (b)
Figure2.6: HDursein thelusteringontextwhenvariablesonveyno
partitioninginformation: (a)Abivariatedatasetexamplewithisodensityof
eahomponentand(b)thetheoretial(dashline)andtheempirial
(ontinuousline)errorratewhen dinreases.
−3 −2 −1 0 1 2 3 4
−6
−4
−2 0 2 4 6
x1
x2
1 2 3 4 5 6 7 8 9 10
0.3 0.32 0.34 0.36 0.38 0.4 0.42
d
err Empirical
Theoretical
(a) (b)
Figure2.7: HDurseinthelusteringontextwhenvariablesonvey
redundantpartitioninginformation: (a)Abivariatedatasetexamplewith
isodensityofeahomponentand(b)thetheoretial(dashline)andthe
empirial(ontinuousline)errorratewhendinreases.
spae. Inpartiular, lter methods performingvariable seletionbeforethe
lusteringtaskhavetobeexluded,theriskofremovingdisriminantfeatures
beingtoolarge. Theremainingquestionisthenwhihwrappermethodstobe
used? Suhmethodsshouldmanagewithprioritythefatthatsomevariables
have negative eets for lustering. The general answeris to design spei
parsimoniousmodelsforlustering,themostemblemationesrelyingonsome
variable seletion priniple. Wewill see also several alternativestrategies, in
partiularvariablelustering(tonotbemingledwithindividuallustering,our
primarytask), aiming atassigning dierentroles (lusters)to thevariables.