High-dimensional clustering

(1)

2 High-dimensionallustering

ChristopheBiernaki andCathyMaugis-Rabusseau 1

2.1 Introdution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2.2 HDlustering: Curseorblessing? . . . . . . . . . . . . . . . . . 4

2.2.1 HDdensityestimation: Curse. . . . . . . . . . . . . . . 4

2.2.2 HDlustering: Amix ofurseandblessing . . . . . . . 6

2.2.3 Intermediateonlusion . . . . . . . . . . . . . . . . . . 8

2.3 Non-anonialmodels . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Gaussianmixtureoffatoranalysers . . . . . . . . . . . 10

2.3.2 HDGaussianmixturemodels . . . . . . . . . . . . . . . 11

2.3.3 Funtionaldata. . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Canonialmodels . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.1 Parsimoniousmixturemodels . . . . . . . . . . . . . . . 17

2.4.2 Variable seletionthroughregularization. . . . . . . . . 20

2.4.3 Variable rolemodelling . . . . . . . . . . . . . . . . . . 24

2.4.4 Co-lustering . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5 Futuremethodologialhallenges . . . . . . . . . . . . . . . . . 35

Bibliography 37

(2)

(3)

High-dimensional lustering

Christophe Biernaki and Cathy Maugis-Rabusseau

2.1 Introdution

High-dimensional(HD)datasetsarenowfrequent,mostlymotivated byteh-

nologial reasons whih onern automation in variable aquisition, heaper

availability of data storage and more powerful standardomputers for quik

data management possibility. All elds are impated by this general phe-

nomenon ofvariable numberination, only thedenition ofhigh beingdo-

maindependent. Inmarketing,thisnumberanbeoforder10²^,ⁱⁿ^miroarray

gene expression between 10² ând 10⁴^, ⁱⁿ ^text ^mining 10³ ôr ^more, ôf ôrder 10⁶ ^for^single^nuleotidepolymorphism(SNP)data, et. Note alsothat some- timesmuhmorevariablesanbeinvolved,whatanbetypiallytheasewith

disretizedurves,forinstaneurvesomingfrom temporalsequenes.

Here aretworelatedillustrations. Figure 2.1(a)displaysatext miningex-

ample 1

. It mixesMedline (1033medialabstrats) andCraneld (1398aero-

nautial abstrats) making a total of 2431 douments. Furthermore, all the

words(exludingstopwords)areonsideredasfeaturesmakingatotalof9275

uniquewords. Thedatamatrixonsistsof doumentsontherowsandwords

ontheolumnswitheahentrygivingthetermfrequeny,thatisthenumberof

ourrenesoforrespondingword in orrespondingdoument. Figure2.1(b)

displaysaurveexample. ThisKneadingdatasetomesfromDanoneVitapole

ParisResearhCenterandonernsthequalityofookiesandtherelationship

withtheourkneadingproess(Lévéderetal.[2004℄). Itisomposed by115

dierentoursforwhihthedoughresistaneismeasuredduringthekneading

proessfor480seonds. Wenotiethattheequispaedinstantsoftimeinthe

interval [0; 480℄ (here 241 measures) ould be muh more large than 241 if

measuresweremorefrequentlyreorded.

1

(4)

(a) (b)

Figure2.1: Examplesofhigh-dimensionaldatasets: (a) Textmining:

n= 2431^doumentsând^the^frequeny^that d= 9275ûnique^wordsôurs ⁱⁿ

eahdoument(awhiterellindiatesahigherfrequeny);(b)Curves:

n= 115^kneadingûrvesôbservedâtd= 241êquispaedînstantsôf^timeⁱⁿ

theinterval[0;480℄.

Suhatehnologialrevolutionhasahugeimpatinother sientields,

as soietal or also mathematial ones. In partiular, high-dimensional data

managementbrings somenew hallenges to statistiianssinestandard(low-

dimensional)dataanalysismethodsstruggletodiretlyapplytothenew(high-

dimensional)datasets. Thereasonanbetwofold,sometimeslinked,involving

eitherombinatorialdiultiesordisastrouslylargeestimatevarianeinrease.

Dataanalysismethodsareessentialforprovidingasynthetiviewofdatasets,

allowing data summary and data exploratory for future deision making for

instane. Thisneedisevenmoreauteinthehigh-dimensionalsettingsineon

theonehand thelargenumberofvariablessuggeststhat alot ofinformation

is onveyedby databut, in theother hand, suh information maybehidden

behindtheirvolume.

Clusteranalysisisoneofthemaindataanalysismethod. Itaimsat parti-

tioningadatasetx= (x₁, . . . ,x_n)^,ômposed^bynindividuals andlyingin a spaeX ôf^dimensiondînto K ^groupsG1, . . . , GK^. ^This ^partitionîs ^denoted

byz= (z₁, . . . ,z_n)^, ^lying ⁱⁿ â^spae Z^, ^wherez_i = (zi1, . . . , ziK)^′ îsâ^vetor

of{0,1}^K ^suh^thatzik = 1îfîndividualx_i belongstothek^th^groupGk^,ând

zik = 0 ^otherwise⁽i= 1, . . . , n^,k= 1, . . . , K^). ^Figure ^2.2^gives^anillustration of this priniple when d = 2^. Model-based lustering allows to reformulate luster analysis as a well-posed estimation problem both for the partition z

andfor thenumberK ôf ^groups. Îtônsiders ^datax₁, . . . ,x_n asnî.i.d. ^real-

izationsofamixture pdff(·;θK) =PK

k=1πkf(·;αk)^,^where f(·;αk)^indiates

the pdf, parameterizedbyαk^, ^assoiated^to ^the ^groupk^, ^where πk ^indiates

themixture proportionof this omponent(

PK

k=1πk = 1^, πk ≥0⁾^and ^where θK = (πk,αk, k= 1, . . . , K)^indiates^the^whole^mixtureparameters. Fromthe wholedatasetxitisthenpossibletoobtainamixtureparameterestimateθˆK

to dedue apartition estimate ˆzfrom the onditional probabilityf(z|x; ˆθK)^.

(5)

It is also possible to derivean estimate Kˆ ^from ân êstimate ôf ^the ^marginal

probabilityfˆ(x|K)^. ^More^detailsôn^mixture^models,^relatedêstimationôfθK^,

zandK^are^given^throughout ^Chapter^??.

−2 0 2 4

−2024

X¹

X2

−2 0 2 4

−2024

X¹

X2

x= (x₁, . . . ,x_n) −→ ˆz= (ˆz₁, . . . ,ˆz_n)^, Kˆ = 3

Figure 2.2: Thelusteringpurposeillustratedinthetwo-dimensionalsetting.

Beyond the nie mathematial bakground it provides, model-based lus-

teringhasledalsotonumerousandsigniantpratialsuessesin thelow-

dimensional settingasChapter??relates,withreferenestherein. Extending

thegeneralframeworkofmodel-basedlusteringtothehigh-dimensionalset-

tingisthusanaturalanddesirablepurpose. Inpriniple,themoreinformation

we haveabouteah individual, the better alusteringmethod is expeted to

perform. Howeverthestrutureofinterestmayoftenbeontainedinasubset

oftheavailablevariablesandalotofvariablesmaybeuselessorevenharmful

to detetareasonablelustering struture. It isthus importantto seletthe

relevantvariablesfrom theluster analysis viewpoint. Itis areentresearh

topi in ontrast to variable seletion in regression and lassiation models

(KohaviandJohn[1997℄;GuyonandElissee [2003℄;Miller[1990℄). This new

interestforvariableseletioninlusteringomesfromtheinreasinglyfrequent

useofthesemethodsonhigh-dimensionaldatasets,suhastransriptomedata

sets.

Threetypesofapproahesdealingwithvariableseletioninlusteringhave

beenproposed. The rstoneinludeslustering methods withweightedvari-

ables(seeforinstaneFriedmanandMeulman[2004℄)anddimensionredution

methods. Forthislater,MLahlanetal.[2002℄useamixtureoffatoranalyz-

erstoreduetheextremelyhighdimensionalityofageneexpressionproblem. A

suitableGaussianmixturefamilyisonsideredinBouveyronetal.[2007℄totake

into aountthedimensionredutionand thedata lustering simultaneously.

Inontrastto thisrstmethod type,thelast twoapproahesseletexpliitly

relevantvariables. Theso-alledlterapproahesseletthevariablesbeforea

lusteringanalysis(seeforinstaneDashetal.[2002℄;JouveandNioloyannis

[2005℄). Theirmainweaknessistheinuene ofindependentseletionstepof

(6)

variable seletion and lustering. For distane-based methods, one an ite

Fowlkes et al. [1988℄ for a forward seletion approah with omplete linkage

hierarhiallustering,DevaneyandRam[1997℄whoproposeastepwisealgo-

rithmwhere thequalityof thefeature subsetsis measuredwith theobweb

algorithm or the method of Bruso and Cradit [2001℄based on the adjusted

Rand index for K^-means ^lustering. ^There ^exists ^also ^wrapper ^methods ⁱⁿ

the model-based lustering setting. Whenthe number of variablesis greater

thanthenumberof individuals,Tadesse etal.[2005℄proposeafullyBayesian

method usingareversiblejump algorithm to simultaneouslyhoosethe num-

berofmixtureomponentsandseletvariables. Kimetal.[2006℄useasimilar

approah byformulatinglustering intermsof Dirihletproess mixtures. In

Gaussian mixture model lustering, Lawet al. [2004℄proposeto evaluatethe

importaneofthevariablesinthelusteringproessviafeaturesalieniesand

use the Minimum Message Length riterion. Raftery and Dean [2006℄reast

theproblem ofomparing twonestedvariable subsetsasamodelomparison

problemandaddressitusingBayesfator. Aninterestingaspetoftheirmodel

formulationis that irrelevantvariables are notrequired to be independentof

thelusteringvariables. Theyavoidthustheunrealistiindependeneassump-

tionbetweentherelevantandirrelevantvariablesforthelustering,onsidered

inTadesseetal.[2005℄,Kimetal.[2006℄andLawetal.[2004℄.Intheirmodel,

the whole irrelevant variable subset depends on the whole relevant variables

throughalinearregressionequation. However,somerelevantvariablesarenot

neessarilyrequired to explain allirrelevant variables in the linearregression

andtheirintrodutioninvolvesadditionalparameterswithoutasigniantin-

rease ofthe loglikelihood. Therelated extensionsproposedby Maugiset al.

[2009a,b℄followthisremark.

Many model proposals already exist, inluding assoiated parameter esti-

mation and, sometimes, spei model seletion strategies. We will divide

these models into anonial and non-anonial ones, indiating if parameter

onstraintsare respetivelydened relatively to the initial data spaeor rel-

ativelyto atransformation (afatorialmappingtypially). Beforepresenting

suhmodels, andtheirrelatedmodelseletionproess, wedrawwhatarethe

pros(blessing) andtheons(urse)ofhavingmanyvariablesforperforminga

lusteranalysisproess.

2.2 HD lustering: Curse or blessing?

2.2.1 HD density estimation: Curse

Inthe previoussetion, weprovided someexamplesofhigh-dimensionaldata

sets. Inthe presentsetion, the aim is to givea somewhat moretheoretial

denitionofwhatahigh-dimensionaldatasetshouldbeinadensityestimation

(7)

ontheparametriases. Italsoreliesonsomeasymptotiarguments. Remind

thatweonsideradatasetx= (x₁, . . . ,x_n)^,x_ibeingdesribedbyd^variables.

Non-parametri ase

In the non-parametri situation, usually x_i is onsidered to rely in a high- dimensional spaeassoonasn=o e^d

,thusassoonasthelogarithmofthe

samplesize,lnn^,^is^negligible^beside^the^spae^dimensiond^. ^A^rst^justiation

of this laim is given by Bellman [1961℄: To approximate within errorǫ > 0

a (Lipshitz) funtion of d ^variables, âbout (1/ǫ)^d êvâluations ^(provided ^by

the sample size n^.^.^.⁾ ôn â ^grid âre ^required. Â ^seond ^justiation îs âlso

given bySilverman[1986℄: ApproximatingaGaussian distribution withxed

Gaussian kernelsandwith approximateerror ofabout10%requires asample

size log₁₀n(d) ≈ 0.6(d−0.25)^. ^For ^instane, ^with d = 10^, n(10) ≈ 7.10⁵^,

implyingalreadyahugesamplesizeforaquitemoderatedimensionalsetting.

Parametri ase

In theparametri situation, let S^m ^be ^a ^model ^desribed ^by D^m ^ontinuous

parameters,likelydependingonthedimensiond^. În^suhââse,^the^data^set xissaidtorelyinahigh-dimensionalspaeassoonasnîs^smallⁱⁿômparison

to a partiular funtion g ôf D^m^, ^namely n =o(g(D^m))^. Âs ân illustration for g^, ^we ônsider ^theheterosedasti Gaussian mixture with true parameter

θ^∗ ând K ômponents. ^We^note θˆK ^the^Gaussian ^MLE ^with K ômponents.

Inthat situation,g îsâ^linear^funtion ^from^the ^following^result^(Maugisând

Mihel[2012℄): Itexistspositiveonstantsκ^andA ^suh^that Ex[d²_H(f(·;θ^∗), f(·|θˆKˆ))]≤κ

infK{^KL(f(·;θ^∗), f(·; ˆθK)) +^pen(K)}+ 1 n

where dH ^denotes ^the^Hellinger^distane, ^KL^theKullbak-Leiblerdivergene and

pen(K)≥κDK

n

2Alnd+ 1−ln

1∧ DK

n Alnd

.

ThustheHDnon-parametriandparametrisituationsaredrastiallydif-

ferentinmagnitude. However,inpratie,DK ^an^be^high^sineDK∼d²/2ⁱⁿ

this Gaussiansituation,ombinedwithpotentiallylargeonstants. Forhigh-

lightingthisfat,onsiderthefollowingtwo-omponentmultivariateGaussian

mixture:

π1=π2=1

2, X₁|Z11= 1∼^N(0,I), X₁|Z12= 1∼^N(1,I), ^(2.1)

with a = (a . . . a)^′ â ^real ^vetor ôf ^size d^. Ân illustration of this setting is displayedinFigure 2.3(a). Notethat thetwoomponentsaremoreandmore

separated when d ^grows ^sine k1−0k^I = √

d^. ^However, ^the ^quality ^of ^the

(8)

mixturedensityestimatedegrades(the Kullbak-Leiblerdivergeneinreases)

whendimensioninreasesasitisillustratedinFigure2.3(b)withahomosedas-

timodelandwithequalmixingproportions.

−3 −2 −1 0 1 2 3 4

−4

−3

−2

−1 0 1 2 3 4

x1

x2

1 2 3 4 5 6 7 8 9 10

12 12.5 13 13.5 14 14.5 15 15.5 16 16.5

d

Kullback−Leibler

(a) (b)

Figure2.3: HDurseintheparametridensityestimationontext: (a)A

bivariatedata setexamplewithisodensityofeahomponentand(b)the

Kullbak-Leiblerdivergeneof thedensityestimatewhend^inreases.

2.2.2 HD lustering: A mix of urse and blessing

Contrarytodensityestimationwhereinreasingdimensionhasalearnegative

eet, dimensionmay havebothpositiveand negativeeets onthe luster-

ing task. We distinguish now whih fators favorsuh blessing or urse

outomes.

Blessingfators

Weretrievethemodeldesign(2.1). Wedisplayagainaorrespondingsample

in Figure 2.4(a). We have already mentioned that the two omponents are

moreand moreseparatedwhen dînreases. ^The ^reasonîs ^that êah ^variable

uniformly provides its own separation information suh that the assoiated

theoretialerrordereaseswhend^grows. Îndeed,^thisêrrorîsêqual^toerr_theo= Φ(−√

d/2)^,^whereΦîs^the^dfôf^N(0,1)^. ^Weân^see^this^derease^withd^byâ

dashlineinFigure2.4(b). Aninterestingonsequeneisthenthattheempirial

error rate dereases also with d âs ît ôuld ^be ^notied ⁱⁿ ôntinuous ^line ⁱⁿ

Figure 2.4(b). It meansthat inreasing dimensionmayhave apositiveeet

onthe lustering task assoonasallvariables onveymeaningfulinformation

onthehiddenpartition.

Weproposenowtoillustratemoredrastiallythispositiveeet througha

(9)

−3 −2 −1 0 1 2 3 4

−4

−3

−2

−1 0 1 2 3 4

x1

x2

1 2 3 4 5 6 7 8 9 10

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

d

err

Empirical Theoretical

(a) (b)

Figure2.4: HDblessingin thelustering ontextwhenmostvariablesonvey

independentpartitioning information: (a)A bivariatedatasetexamplewith

isodensityofeahomponentand(b)thetheoretial(dashline)andthe

empirial(ontinuousline)errorratewhend^inreases.

sians,allmoreandmoreseparatedwhend^inreases:

π1=π2=π3= ¹₃,

X₁|Z11= 1∼^N(0,I), X₁|Z12= 1∼^N(2,I), X₁|Z13= 1∼^N(−2,I), .

Then Figure 2.5(a)-(d) displaysa related sample of size n = 1000 ^for ^dier-

ent dimensionson the main two axes of the Fatorial Disriminant Analysis

(FDA)mapping. Itlearlyappearsthatomponentsaremoreandmoreeasily

reognizedwhendimensioninreases,althoughitisasimplevisualizationpro-

ess. Atthelimit,noomplexlusteringalgorithmwouldbeenoughtoidentify

lusters...

Cursefators

Infat,inreasingdimensionmayhaveapositiveeet onlustering retrieval

onlyifvariablesinjetsomepartioninginformation. Inaddition,suhinforma-

tionhastobenotredundant. Weillustratenowthesetwopartiularfeatures.

Firstly,weonsidermanyvariableswhihprovidenoseparationinformation.

We retrieve thesame parametersetting as (2.1)exept that the omponents

arenot moreseparatedwhen d^grows^sine kµ2−µ1kÎ = 1^, ^whereµ1 =0is theenter ofthe rstGaussianand where µ2 = (1 0 . . . 0)^′ îs^theôneôf^the

seond,thus(k= 1,2⁾

X₁|Z1k = 1∼^N(µk,I). ^(2.2)

A sampleisdisplayedonFigure2.6(a). Figure 2.6(b)showsin dashline that

thetheoretialerrorrateisonstant(itorrespondstoerr_theo= Φ(−¹2)⁾^when

the dimensioninreases, asexpeted. Consequently, the empirial errorrate

(10)

−4 −3 −2 −1 0 1 2 3 4

−2.5

−2

−1.5

−1

−0.5 0 0.5 1 1.5 2 2.5

1st axis FDA

2nd axis FDA

d=2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−3

−2

−1 0 1 2 3 4

1st axis FDA

2nd axis FDA

d=20

(a) (b)

−1.5 −1 −0.5 0 0.5 1 1.5

−4

−3

−2

−1 0 1 2 3 4 5

1st axis FDA

2nd axis FDA

d=200

−1.5 −1 −0.5 0 0.5 1 1.5

−3

−2

−1 0 1 2 3

1st axis FDA

2nd axis FDA

d=400

() (d)

Figure2.5: FatorialDisriminantAnalysis(FDA)onthemain twofatorial

axesofthree Gaussianomponentsmoreandmoreseparatedwhenthespae

dimensioninreases: (a)d= 2^,^(b)d= 20^,⁽⁾ d= 200 ^andd= 400^.

Seondly,weonsideraasewheremanyvariablesprovideseparation,but

redundantinformation,inthefollowingsense: Itisthesameparametersetting

asbeforefortherstdimensionexeptforallotherones

X_1j =X₁₁+εj, ^where εj

iid∼ ^N(0,1) (j = 2, . . . , d). ^(2.3)

SeeadataexampleinFigure2.7(a). Thus,omponentsarenotmoreseparated

whend^grows^sinekµ2−µ1k^Σ= 1^,Σdenotingtheommonovarianematrix of eah Gaussian omponent, and µk ^denoting ^the ênter ôf ^the ômponent

k= 1,2^(note^that^bothµkândΣouldbeeasilyomputedfromEquation(2.2) and(2.3)). Consequently,err_theo= Φ(−¹2)îsônstantând^theêmpirialêrror

inreaseswith d^,^asillustratedin Figure2.7(b)withpreviousonventions.

2.2.3 Intermediate onlusion

Inasewherevariableshaveimportantblessingonsequenesforthelustering

(11)

−4 −3 −2 −1 0 1 2 3 4

−5

−4

−3

−2

−1 0 1 2 3 4

x1

x2

1 2 3 4 5 6 7 8 9 10

0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44

d

err Empirical

Theoretical

(a) (b)

Figure2.6: HDursein thelusteringontextwhenvariablesonveyno

partitioninginformation: (a)Abivariatedatasetexamplewithisodensityof

eahomponentand(b)thetheoretial(dashline)andtheempirial

(ontinuousline)errorratewhen d^inreases.

−3 −2 −1 0 1 2 3 4

−6

−4

−2 0 2 4 6

x1

x2

1 2 3 4 5 6 7 8 9 10

0.3 0.32 0.34 0.36 0.38 0.4 0.42

d

err Empirical

Theoretical

(a) (b)

Figure2.7: HDurseinthelusteringontextwhenvariablesonvey

redundantpartitioninginformation: (a)Abivariatedatasetexamplewith

isodensityofeahomponentand(b)thetheoretial(dashline)andthe

empirial(ontinuousline)errorratewhend^inreases.

spae. Inpartiular, lter methods performingvariable seletionbeforethe

lusteringtaskhavetobeexluded,theriskofremovingdisriminantfeatures

beingtoolarge. Theremainingquestionisthenwhihwrappermethodstobe

used? Suhmethodsshouldmanagewithprioritythefatthatsomevariables

have negative eets for lustering. The general answeris to design spei

parsimoniousmodelsforlustering,themostemblemationesrelyingonsome

variable seletion priniple. Wewill see also several alternativestrategies, in

partiularvariablelustering(tonotbemingledwithindividuallustering,our

primarytask), aiming atassigning dierentroles (lusters)to thevariables.