HAL Id: hal-02480318
https://hal.archives-ouvertes.fr/hal-02480318
Submitted on 15 Feb 2020
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Bennani, Antoine Cornuéjols
To cite this version:
Jeremie Sublime, Matei Basarab, Guénaël Cabanes, Nistor Grozavu, Younès Bennani, et al.. Entropy Based Probabilistic Collaborative Clustering. Pattern Recognition, Elsevier, 2017, 72, pp.144-157.
�10.1016/j.patcog.2017.07.014�. �hal-02480318�
Entropy Based Probabilistic Collaborative Clustering
Article in Pattern Recognition · December 2017
DOI: 10.1016/j.patcog.2017.07.014
CITATIONS
0
READS
40
6 authors, including:
Some of the authors of this publication are also working on these related projects:
COCLICO (ANR Project) View project
Clustering in Dynamic Data , Detection Concept Change in Dynamic Data View project Basarab Matei
Université Paris 13 Nord
42PUBLICATIONS356CITATIONS
SEE PROFILE
Guénaël Cabanes Université Paris 13 Nord
55PUBLICATIONS205CITATIONS
SEE PROFILE
Nistor Grozavu
Université Paris 13 Nord
52PUBLICATIONS82CITATIONS
SEE PROFILE
Younès Bennani Université Paris 13 Nord
185PUBLICATIONS810CITATIONS
SEE PROFILE
All content following this page was uploaded by Jeremie Sublime on 08 August 2017.
The user has requested enhancement of the downloaded file.
ContentslistsavailableatScienceDirect
Pattern Recognition
journalhomepage:www.elsevier.com/locate/patcog
Entropy based probabilistic collaborative clustering
Jérémie Sublime
a,b,∗, Basarab Matei
b, Guénaël Cabanes
b, Nistor Grozavu
b,
Q1
Younès Bennani
b, Antoine Cornuéjols
caLISITE Laboratory, RDI Team - ISEP 10 rue de Vanves, 92130 Issy Les Moulineaux, France
bUniversité Paris 13, Sorbonne Paris Cité, LIPN - CNRS UMR 7030 99 av. J-B Clément, 93430 Villetaneuse, France
cUMR MIA-Paris, AgroParisTech, INRA Université Paris-Saclay, 75005 Paris, France
a rt i c l e i n f o
Article history:
Received 17 December 2016 Revised 24 April 2017 Accepted 8 July 2017 Available online xxx Keywords:
Collaborative clustering EM algorithms Entropy based methods
a b s t r a c t
Unsupervisedmachinelearningapproachesinvolvingseveralclustering algorithmsworkingtogetherto tackledifficultdatasetsarearecentareaofresearchwithalargenumberofapplicationssuchascluster- ingofdistributeddata,multi-expert clustering,multi-scaleclusteringanalysisormulti-viewclustering.
Mostoftheseframeworkscanberegroupedundertheumbrellaofcollaborativeclustering,theaimof whichistorevealthecommonunderlyingstructuresfoundbythedifferentalgorithmswhileanalyzing thedata.
Withinthiscontext,thepurposeofthisarticleistoproposeacollaborativeframeworkliftingthelimi- tationsofmanyofthepreviouslyproposedmethods:Ourproposedcollaborativelearningmethodmakes possibleforawiderangeofclusteringalgorithmsfromdifferentfamiliestoworktogetherbasedsolely ontheirclusteringsolutions,thusliftingpreviouslimitationrequiringidenticalprototypesbetweenthe differentcollaborators.OurproposedframeworkusesavariationalEMasitstheoreticalbasisforthecol- laborationprocessandcanbeappliedtoanyofthepreviouslymentionedcollaborativecontexts.
Inthisarticle,wegivethemainideasandtheoreticalfoundationsofourmethod,andwedemonstrate itseffectivenessinaseriesofexperimentsonrealdatasetsaswellasdatasetsfromtheliterature.
© 2017ElsevierLtd.Allrightsreserved.
1. Introduction 1
DataClusteringisafundamentaltaskintheprocessofknowl- 2
edgeextractionfromdatabasesthataimsto discovertheintrinsic 3
structuresinasetofobjectsbyformingclustersthatsharesimilar 4
features.Thistaskismoredifficultthansupervisedclassificationas 5
thenumberofclusterstobefoundisgenerallyunknownandcon- 6
sequentlyitisdifficulttoratethequalityofaclusteringpartition.
7
Overthepasttwodecades,thistaskhasbecomeevenmorechal- 8
lenging whenthe available datasets becamemorecomplex with 9
theintroductionofmulti-viewdatasets,distributeddata,anddata 10
set havingdifferentscales ofstructuresofinterest (e.g.hierarchi- 11
calclusters).Thisincreasedcomplexityinanalreadyhardproblem 12
makes it difficult forlone clusteringalgorithms to give competi- 13
tiveresultswithahighdegreeofconfidence.However,verymuch 14
∗ Corresponding author.
E-mail addresses: [email protected] , [email protected] (J.
Sublime), [email protected] (B. Matei), [email protected] paris13.fr (G. Cabanes), [email protected] (N.
Grozavu), [email protected] (Y. Bennani), [email protected] (A. Cornuéjols).
likein the realworld, such problemscan be tackled moreeasily 15 byhavingseveralalgorithmsworkingtogetherinordertoincrease 16 boththequalityoftheresultsandtheirreliability. 17 Approachesbasedonthisideaofseveralalgorithmsworkingto- 18 getherhavebeenwidelystudiedinthecaseofsupervisedlearning 19 [1–4]where they gave birth to the field of Ensemble Learning. 20 Ensemblemethodsare easytoimplementinsupervisedlearn- 21 ingfortworeasons:First,it isstraightforwardtodefinea combi- 22 nationofpredictivefunctionstogetanaggregatedpredictionfunc- 23 tion(for instance,a linearcombinationisused inboosting).Sec- 24 ond, it is simple to measure both the performance of individual 25 predictionfunctionsand thediversity of theset ofthe functions 26 thatare candidateforbeingpartofthecombinedglobaldecision 27 function.Thingsarenot sostraightforwardinunsupervisedlearn- 28 ing.Here,eachindividualsolutionisasoftorhardpartitionofthe 29 dataset.Howtocombinethesepartitionshasnoobviousanswer. 30 In cooperative clustering, each clustering algorithm produces 31 its result independently. The final clustering is computed in a 32 post-processing step, and the only exchange of information is 33 aboutwhen theindividualprocesses arecompleted,so thatpost- 34 processingcanstart.Inthiscase,asetofclusteringalgorithmsare 35 used inparallel on a givendata set. Onceall local computations 36 http://dx.doi.org/10.1016/j.patcog.2017.07.014
0031-3203/© 2017 Elsevier Ltd. All rights reserved.
Please cite this article as: J. Sublime et al., Entropy based probabilistic collaborative clustering, Pattern Recognition (2017),
arecompleted,amasteralgorithmtakescontrolandcombinesthe 37
localresultstogetahopefully betteroverall clustering.Thereso- 38
lutionofthepossibleconflictsbetweenthelocalsolutionsrequires 39
analgorithmthatisabletocompareresultsthatmaydifferintheir 40
format(e.g.differentnumbersofclusters,differentdegreesofbe- 41
lief associatedwith theresults, ...)andto find a consensus solu- 42
tionthat minimizes theoverall violationto the localresults.The 43
cooperative framework is closely related to the ensemble meth- 44
odsdeveloped forsupervisedlearning. Intheseapproaches, aset 45
of(diverse)classifiersislearnedandtheclassificationofnewdata 46
pointsisobtainedbytakinga(weighted)voteoftheirpredictions.
47
Bayesianaveragingcanbe consideredasaprecursormethod.Nu- 48
merousnewoneshavebeendeveloped,fromerror-correctingout- 49
putcodingtoBagging, andBoostingandtheirapplicationinvari- 50
ousdomainshavebecomeroutinewithoftengoodresults.
51
Incollaborativeclustering(Thesequelofthispaper),thegroup 52
solvestogetherproblemsdefinedandimposedbythecentralcon- 53
troller,affectinganindividualtasktoeachlearner.Interactionsare 54
recurrentbetweenteammembers,responsibility iscollective,the 55
actionofeachteammateisgearedtotheperformanceofthegroup 56
andviceversa.Bycontrasttothecooperativeclusteringmodel,the 57
collaborativemodeldoesnotseekanoverallhopefullybetterclus- 58
teringof a given data set through the combinationof individual 59
solutions.Inthecollaborativeframework,thegoalisthateachlo- 60
calcomputation, quitepossibly appliedto distinctdata sets,ben- 61
efits fromthe work done by theother collaborators.Thiscan be 62
donethroughtheexchangeofinformationaboutthelocaldata,or 63
thecurrenthypothesizedlocalclustering,orthevalueofonealgo- 64
rithm’s parameters.The validity of theapproach rests onthe as- 65
sumptionthat usefulinformation can be sharedamong the local 66
tasks.Thisschemeleadsnaturallytodistributedimplementations 67
ofthe computations,but unlike in thecooperative framework, it 68
generallyentailsseveraliterationsateachlocalnodebecausecon- 69
vergenceof theconsensus solution requiresseveralpassesofthe 70
algorithm.Indeed,inadditiontotheproblemofwhatinformation 71
toexchange between collaborators,one question ishow to mea- 72
suretheevolutionateachnodeandonagloballevel.
73
Therearemanyapplicationsinunsupervisedlearningforwhich 74
collaborativeclusteringcanproveuseful:
75
• Multi-scale analysis: In this case several algorithms would be 76
analyzing the same objects, all looking at the same features, 77
but searchingfora different numberof clusters. Thatkind of 78
analysiscanbebeneficialfordatasetsthathaveintrinsicmulti- 79
scalestructuressuchassatelliteimagesforwhichalowerlevel 80
analysis of globallandscape areas (urbanareas, water bodies, 81
forests)oftenhelpstoimproveahigherlevelanalysisofsmaller 82
details(trees,cars,houses,gardens,streets,etc.).
83
• Multi-expert analysis: In this case, all algorithms would be 84
working on the same objects and features of a difficult data 85
set. Given the very high number of existing clustering algo- 86
rithms, all more or lessspecialized andthat mayor maynot 87
give good resultsdependingonthe problem, trying severalof 88
them on a data set and having them exchanging their infor- 89
mationcouldbejustified:mergingtheinformationsonclusters 90
found only bysome clustering algorithms,refining the results 91
basedonclustersthataremoreorlesswellidentifieddepend- 92
ingonthemethod,etc.
93
• Multi-view clustering [5,6]: Different algorithms process differ- 94
ent typesof attributesforthe same objects.For exampleone 95
algorithm forgeometricattributes, one fortextattributes, one 96
forcolors,onefornumericalattributes,etc.Thegoalofthecol- 97
laboration in this case would be to have each attribute type 98
processed by a specialized algorithm while givingthese algo- 99
rithmsamoreglobalpicture ofthedatasetby enablingsome 100
exchangesbetweenthem.
101
• Clustering of distributed data [7]: The same objects have their 102 attributessplit on several databases that can’t exchange their 103 databecauseofprivacyissues.Whilethenameisdifferent,this 104 isinfactverymuchequivalenttomulti-viewclustering. 105
• BigDataClustering[8]:Datasetsthat aretoolargeorhavetoo 106 many attributes to be processed efficiently by a single algo- 107 rithmmaybeeasiertotackleoncetheirattributesaresplitand 108 processedby severalalgorithms.Thistypeofclusteringisuse- 109 fulin thearea ofBig Data analysis andwouldrequire ahigh 110 degreeofcooperationbetweenthealgorithmstogettheglobal 111
picture. 112
Asonecansee,alltheseapplicationshavealotofsimilarities: 113 wehaveseveralalgorithmsworkingonthesamedataorsubsetsof 114 thesamedata,andthat willorcouldatsomepoint trytoaggre- 115 gateortomutuallyexploittheirrespectiveresults.Whilesomeof 116 theseapplicationscouldbeconsideredafieldoftheirownsuchas 117 multi-viewclusteringordistributedclustering[5],allofthemcan 118 beclassifiedashorizontalcollaborativeclusteringframeworks[9– 119 12]:severalalgorithmsworkingonthesamedataeventuallylook- 120 ing foradifferentnumberof clusters,andnot necessarilyhaving 121
accesstothesamefeatures. 122
We generally distinguish between two types of collaborative 123 methods[9,11]:Verticalcollaborationencompassesallcaseswhere 124 severalalgorithmsareworkingondifferentdatathathavesimilar 125 clusters ordistributions. And Horizontal collaboration dealswith 126 caseswhereseveralalgorithmsare collaboratingonthesameob- 127 jects,eventuallydescribedfromdifferentviews.Inthisarticle,we 128 aremostlyinterestedinhorizontalcollaboration. 129 Collaborativemethodsusuallyfollowatwo-stepprocedure[13]: 130 1. Localstep: Eachalgorithm willindividually processthe datait 131 hasaccesstoandproducealocalclusteringpartition. 132 2. Collaborativestep:Thealgorithmssharetheirresultsandtryto 133 confirmorimprovetheirmodelswiththegoalofachievingbet- 134
terclusteringresults. 135
Thesetwostepsaresometimesfollowedbyanaggregationstep 136 whichaimsatreachingaconsensuswiththefinalresultsaftercol- 137 laboration. Inthiswork we willnot addressthe aggregationstep 138 becauseit isafield ofits own,andthatdependingonthe appli- 139 cation it may not always be advisable to aggregate, for instance 140 whenthedifferentviews,sitesorscaleshaveconflictingpartitions 141 [14].Wewillinsteadfocusonthecollaborativestepwheretheal- 142 gorithmsexchange bits ofinformationwitha goalofmutualim- 143
provement. 144
From there,the main difference betweenwhatis traditionally 145 referred as “clustering ensemble learning” [15] and collaborative 146 clustering is that clustering ensemble learning methods aim at 147 finding a single consensus partition, while collaborative cluster- 148 ing doesnot have thisfinal goal.In short,the field of collabora- 149 tiveclusteringisconcernedwithfinding algorithmsandfunctions 150 thatallowalgorithmstoshareinformationandtoimprovetheirre- 151 sultsbasedoneach othersimilarities,whilethefieldofensemble 152 learningismore concernedwithfinding algorithms andmethods 153 tomergethesolutionsorfindaconsensusbetweenthem.Collabo- 154 rativeclusteringcanthereforebeataskofitsown(e.g.multi-view 155 clusteringwhereconsensus is notalways possible noradvisable), 156 ora preliminarystepto an ensemblelearningtask.The methods 157 andtechniquesusedbybothfieldsarethereforenaturallyoverlap- 158 ping, anda good collaborative algorithm mustrespect properties 159 thatareverysimilartotheseofagoodensemblelearningmethod: 160
• Robustness:Thecollaborativeprocess mustleadon averageto 161 partitionsthatarebetterthanthelocalclusteringresults. 162
• Consistency:The updated resultsmustbe somehowsimilar to 163
theoriginallocalresults. 164
Please cite this article as: J. Sublime et al., Entropy based probabilistic collaborative clustering, Pattern Recognition (2017),
• Novelty:Collaborative clusteringmustmakeitpossibletofind 165
solutionsthatwouldhavebeenotherwiseunattainablelocally.
166
• Stability:Resultsthathavealowersensitivitytonoise.
167
Withinthiscontext,inthisarticleweintroduceanewandorig- 168
inalframework forcollaborativeclusteringthatcan be appliedto 169
thevarioustypesofunsupervisedcollaborativelearningtasksthat 170
we havepreviously discussed.Ourproposed methodliftsoff sev- 171
eral limitations of previous ensemble learning and collaborative 172
frameworks: the data need not be shared between the different 173
algorithms,thenumberofclustercanbedifferentbetweentheal- 174
gorithms,andverydifferenttypesofalgorithmscancollaborateto- 175
gether.
176
The theoretical basis of our work is close from the work of 177
BickelandSchefferontheestimationofMixtureModelsusingCo- 178
EM [16,17]. Our proposed method differs from theirs in the fol- 179
lowingpoints:inourcasewearetreating abroadercontextthan 180
multi-viewclustering.Ourmethodmakesitpossibleforalgorithms 181
from different families to work together, and once again we do 182
nothavethelimitationthatallalgorithmsshouldbesearchingfor 183
thesamenumberofclusters. Weproposea variationalversionof 184
their workformulti-viewclusteringbasedontheoptimizationof 185
a differentobjectivefunction.The coreofourproposed approach 186
is adifferent discretizationprocess basedon aparticular classof 187
aposterioridistributionscalled“combinationfunctions” presented 188
inSection3.4.1. 189
Theremainderofthisarticleisorganizedasfollows:
190
In Section2, we propose a state of the art in which we in- 191
troduce some of the pioneer and earlier proposed methods and 192
frameworks for collaborative learning with their strengths and 193
weaknesses.
194
InSection3,weintroduceourproposed methodforhorizontal 195
collaborativeclustering. As statedpreviously,themethodthat we 196
proposeaimsatbeingmoregenericthan thepreviously proposed 197
frameworks.We beginby explainingtheprinciple ofourmethod 198
anditstheoreticalbasis.Thenwestudythestoppingcriterionand 199
parameters tuning of our algorithm. And finally,we demonstrate 200
thatourproposedmethodhasgoodconvergencepropertiessimilar 201
totheseofaEMalgorithm.
202
InSection4,weshowsomeexperimentalresults.Wearemostly 203
interestedinshowingsomepotentialapplicationsofourproposed 204
method applied to multi-scale clustering andmulti-view cluster- 205
206 ing.
Finally, thiswork ends witha conclusion andperspectiveson 207
futureworks.
208
2. Stateoftheartincollaborativeclustering 209
One of the first collaborative clustering algorithm was intro- 210
duced in 2002 by Pedrycz [13,18] under the name “Collaborative 211
FuzzyClustering” (CoFC).Thismethodwasdesignedforthespecific 212
caseof distributed datawhere theinformation cannot be shared 213
betweenthedifferentsites.Thismethodwasbasedonamodified 214
versionoftheFuzzyC-Meansalgorithm[19]. 215
The main limitation of this approach is that it only enables 216
FuzzyC-Meansalgorithmstocollaboratetogether,andfurthermore 217
some methods even require that all of them be looking for the 218
samenumberofclusters.
219
Similar approaches were used to develop several other 220
collaborative-like methods CoEM [17], CoFKM, [20], and another 221
collaborative EM-like algorithm [21] based on Markov Random 222
Fields.
223
All these algorithms display similar limitations: the objective 224
functionsand sometimesthe number ofclustersmust be identi- 225
calforallexchangedinformation.Thisisduetothefactthatthey 226
alltrytooptimizeanobjectivefunctiontheformofwhichis: 227
(
Sopt,opt
)
=Argmax (S,) Lg(
S,)
=Argmax (S,)
J
i=1
L
(
Xi|
Si,i
)
−j=i
τ
j,i·(
i,j
)
(1) whereJisthe numberofcollaborators,S containsallalgorithm’s 228 partitions,their distributionsparameters,Lg(S,)istheglobal 229 likelihoodofthesystem,eachL(Xi|Si,i)isthelocallog-likelihood 230 of a collaborating algorithm, each (i, j) the “collaborative 231 term” is a custom pairwise penalty that compares thedifference 232 betweentheparameters orprototypes oftwo algorithms,andthe 233
τ
j,i which do not exist in all methods are weights given to the 234 collaborative penalties. The definition of the local term L(Xi|Si, 235i) based on which algorithms collaborate together makes the 236 maindifferencebetweenallthesemethods,whiledefinitionofthe 237 penalty(i,j)onlyslightlydiffersdependingonthecollabora- 238 tivemethod.Thislaterparameteristhelimitingonesincecompar- 239 ing prototypes andparameters requiresthat the algorithms have 240 thesametypesofprototypesandsomekindofmappingbetween 241 theclustersofthedifferentalgorithms. 242 TheworkofPedryczontheCoFCalgorithmwasalsoextended 243 tobeadaptedtotheSelf-OrganizingMaps(SOM)[11,22,23]andto 244 theGenerativeTopographicMaps(GTM)[24]. 245 In [23], the classical SOM objective function is modified by 246 addingaspecificextratermforhorizontalcollaborationandadif- 247 ferentone forvertical collaboration. Forthe collaborativeversion 248 oftheGTMalgorithm[24],theprincipleisthesamewiththeM- 249 StepoftheEMalgorithmmappingtheneuronstothefinalclusters 250
beingmodified. 251
Oneproblem withthesetwo methods isthat they do not re- 252 allysolvethemain issueofcollaborationbetweendifferenttypes 253 ofalgorithmssincetheirmodelinonceagainanalogtotheonein 254 Eq.(1).Furthermore,while thenumberofclustersdoesnotmat- 255 terinthecaseofthecollaborativeSOMandcollaborativeGTM,in 256 bothcasesthemapsmusthavethesamenumberofneuronsand 257 be topologicallysimilar to each other.This isactually even more 258 restrainingthanarequirementonthenumberofclusters. 259 TheSAMARAH method[25,26]isanothertype ofcollaborative 260 frameworkthestrengthofwhichisthatitcandealwithanykind 261 ofhardclusteringalgorithmandisnotconcernedwithissuessuch 262 asfitnessfunctions, numberofclusters, orprototypes.Unlikethe 263 previously introduced method, SAMARAH only handles horizon- 264 talcollaborationdueto thelack ofprototypes,andwasdesigned 265 mostlyforclusteringappliedtoimagedata.Itsgoalisverysimple: 266 givenJclusteringresultsforthesame data,theideaisto modify 267 theseresultsinaniterativeandcollaborativewaywiththeaimof 268 reducingtheir diversityinordertomakethefindingofaconsen- 269
sussolutioneasier. 270
Oncetheresultshavebeengeneratedduringthelocalstep,the 271 SAMARAH method maps the clusters of the different algorithms 272 usingprobabilisticconfusionmatrices(PCM).LetSiandSjbetwo 273 clusteringresultsfromtwoalgorithmsAiandAjlookingforKiand 274
Kjclustersrespectively. 275
Then,the probabilistic confusionmatrix(PCM) i,j that maps 276 theclustersfromAitoAjisdefinedasshownbelow: 277
i,j=
⎛
⎜ ⎝
ω
i1,,j1 · · ·ω
i1,,jKj..
. ... ...
ω
iK,ij,1 · · ·ω
Ki,ij,Kj⎞
⎟ ⎠
whereω
ia,,jb=|
Sia∩Sbj|
|
Sia|
(2)InEq.(2),Sia denotesthe athcluster ofalgorithm Ai i.e., Sia= 278
{
x;x∈Xi,x∈abyAi}
and|
Sia|
isthenumberofdatainthisclus- 279ter,and
|
Sia∩Sbj|
isthe number of datalinked to the ath cluster280
ofAi andthe bth cluster ofAj atthe same time. The PCM i,j
281
makesit possibleto knowwhetherornot theobjects oftwo re- 282
sultshavebeengroupedina similarway,orifthetwoclustering 283
resultsare dissimilar. The matrix hasa key role inthe compari- 284
son of two clusteringresults -such as detecting agreements and 285
conflicts-,andhasthemajoradvantageofbeingindependentfrom 286
theclusteringalgorithmusedtogeneratetheresults.
287
TheSAMARAHmethodusesthismatrixtodetectpairwisecon- 288
flicts between the different partitions and reduces them by or- 289
der of perceived importance based on a conflict metric criterion 290
[25]bysplitting,merging,orremovingclusters.Oncethesolutions 291
haveall beenrefined, andareconsequently quite similar toeach 292
other,it proceedswith aggregatingthem using a process similar 293
toamajorityvote[27].Itisthereforeaverycompleteframework 294
thatcoversall3stepsoflocallearning,collaborativelearningand 295
resultaggregationanddoesnotrelyonusersparameter. 296
However, its conflict resolution system certainly is a weak 297
point:itreliesonapairwiseconflictcriterion,andsolvesthecon- 298
flictsonebyonebyorderofperceivedimportance,anditcanlead 299
tosub-optimalresults.Finally,whileitisalsoastrongpointofthe 300
method,thefact thatthealgorithms parametersorprototypes do 301
notplayanyroleoncethelocalstepisovermayconstituteaweak- 302
ness,inthe sense thatthe localmodelis neverrebuiltusing the 303
newpartitionsanddoesnotplayanyactiveroleineitherthecol- 304
laborativesteportheconsensusstep.
305
3. Horizontalcollaborativeclusteringguidedbydiversity 306
3.1.Formalism 307
Inhorizontalcollaborativeclusteringweconsiderafinitegroup 308
ofalgorithms A=
{
A1,...,AJ}
thatareworkingonthesamedata309
elements,albeitpossiblywithaccesstodifferentfeatures,andalso 310
possiblylookingforadifferentnumberofclusters.Noassumptions 311
are madeon the algorithms themselves. LetX=
{
x1,...,xN}
,xn∈ 312Rd bea data set containing Nelements,each of them withdreal 313
numberfeatures.
314
EachclusteringalgorithmAihasitsownparameterstodescribe 315
eitherthe clustersor its model,and produces its own clustering 316
solutionSimadeof Ki clusters,based onthefeatures ofthe data 317
setXi⊆X ithasaccessto.Inthe caseofhardclustering, Sicanbe 318
translatedintoasolutionvectorofsizeN,andforfuzzyclustering 319
intoamatrixofsizeN×Ki.WedenotethislatermatrixSi=(sin,c), 320
where1≤n≤Nand1≤c≤Ki.ThesolutionsSioutputbythealgo- 321
rithmsarethereforetwo-dimensionalmatricesofsizeN×Kiwhere 322
each element sin,c expresses the responsibility (probability) given 323
byalgorithmAitoaclustercforthedataelementxn. 324
EachalgorithmAicomputesthesolutionsSi,asusualbyintro- 325
ducinga latentdiscrete randomvector Zi definedon somelatent 326
spacewiththerange[1,...,Ki],hence computingthe aposteriori 327
distributionofthevariableZiconditionallyonXiandSi. 328
Finally, inorder toquantify thedegree ofinformation coming 329
fromthecollaboration, fora givenalgorithm Ai, we willassume 330
theexistenceofsome weight
τ
j,i∈(0,1),which measuretherel- 331ativeexternalinformationfromthealgorithmj=iacceptedbyAi. 332
Allweights
τ
j,i are storedin a square matrixof size J×J which 333thereforecontainsthestrengthofallcollaborationlinks.Mostno- 334
tationsusedinthisarticlearesummedupinTable1below.
335
3.2.Problemformulation 336
Within the context of horizontal collaboration that we have 337
presented before, the method that we propose takes many ad- 338
vantages of both prototype-based collaborative methods and the 339
SAMARAHmethod,withouttheirissues. 340
OurgoalinthissectionistofindawaytomodifyEq.(1)sothat 341 the collaborativetermwill not depend on theprototypes. There- 342 fore,weproposealikelihoodfunctionbasedonEq.(3)whichuses 343 aglobalconsensustermC(S)basedonthepartitions.Themaindif- 344 ferenceswithEq.(1)arethatweusedamodelbasedonpartitions 345 ratherthanprototypes,ourproposedmodelisconsensusbasedin- 346 steadof divergencebased,andwe usea globalterminsteadofa 347 pairwiseone.Wechosethisglobalmodelbecauseunlikethepair- 348 wiseversion,itdoesnotrequiretoassumethatthealgorithmsare 349 independentfromeachother (whichisofcoursenottrue). 350 In this model,
λ
∈[0, 1] is a weight parameter to bal- 351 ance between the local and collaborative term. The left term 352 Ji=1L(Xi
|
Si,i)iscalledthelocalterm,andtherighttermλ
·C(S) 353 is the collaborative term. Note that the C(·) here stands for 354“consensus”: we havea collaborative termbased on aconsensus 355
function. 356
(
Sopt,opt
)
=Argmax(S,) Lg
(
S,)
=Argmax(S,) J
i=1
L
(
Xi|
Si,i
)
+λ
·C(
S)
(3)Withthismodel,andusingacollaborativetermbasedondiffer- 357 entaposterioridistributionsinsteadofacollaborativetermbased 358 ondistributionsparameters,ourproposedmodelliftsoff thelimi- 359 tationthat onlyidenticalalgorithmslookingforthesamenumber 360 ofclusterscan worktogether.Furthermore,usingourmodeleven 361 non-parametric algorithms-forwhichthedistributions parameter 362
icannotbeexplicitlyformulated-canbeusedinacollaborative 363 setting since our modelis based on the partitions (solution ma- 364 trices or vectors)which are explicit forany clusteringalgorithm. 365 The penalty factor
λ
>0 regularizesthecollaborationpart. Please 366 note that in[28], theauthors have demonstrated that there is a 367 directrelationbetweenreducing thedivergences andmaximizing 368 theconsensus under mildassumptions. Therefore,both strategies 369areequivalent. 370
Analogously to Eq.(3),our ideais to optimizea modifiedfit- 371 ness of the log-likelihood function that considers both the local 372 partitionsandtheinformationcomingfromtheother algorithms’ 373 solutions.ByconsideringonlythepartitionsSiandnottheparam- 374 eters,verymuchlikeintheSAMARAHmethod[25,26],weensure 375
thatourmodelisbothgeneric. 376
As we will demonstrate in the next subsection, this change 377 fromitoSiismadepossiblebecauseweusean alternatemaxi- 378 mizationprocedureinwhichthepartitionsarecomputedfromthe 379 prototypesandthentheprototypesareupdatedbasedonthepar- 380 titions andthedata.Inshort,thepartitionscan beseenasadis- 381 cretizationofthedistributionsdescribedbytheprototypes. 382 Whilethisimprovementwillresultinamoregenericparadigm 383 whenitcomestohorizontalcollaboration,it isworth mentioning 384 thatremovingtheprototypesalsomakesverticalcollaboration(al- 385 gorithmscollaboratingondifferentdatasetswithsimilarclusters) 386 impossiblewhereassomeoftheearliermethodscoveredthiscase 387 of knowledge transferbetween similar data sets [11,13,24], albeit 388
onlybetweenidenticalalgorithms. 389
To optimize (3) we use the Expectation Maximization (EM) 390 strategy. The workflow in Algorithm (1) highlights how our al- 391 gorithmcanindeedbeconsideredasanEMalgorithm.Duringthe 392 E-Step,thepartitionsSareupdatedusingfixed valuesforthedis- 393 tributions parameters.Then, duringthe M-Step,theseparame- 394 tersareupdatedbasedonthenewpartitions. 395 TheexactformofthefunctionalLgisexplainedinthenextsec- 396 tion,whilethesoppingcriterionisdetailedinSection3.5. 397 Please cite this article as: J. Sublime et al., Entropy based probabilistic collaborative clustering, Pattern Recognition (2017),
Table 1 Notations.
Notation Development Comment
X i X i= {x i1, . . . , x iN}, x in∈ R d The subset of the data observed by algorithm A i X X = {X 1, . . . , X J} The full data with all views
i The parameters describing the distributions observed by algorithm A i = {1, . . . , J} The set of distributions parameters for all algorithms
A i A i= {X i, S i, i, K i} An algorithm looking for K iclusters of distribution parameters iin the subset X iand finding a partition S i τj,i τj,i∈ [0, 1] The weight of the collaboration from A jto A i
s in,c s in,c∈ (0 , 1), Kc=1i s in,c= 1 The responsibility given by algorithm A ito the cluster c ∈ [1.. K i] for the data x in S i S i= (s in,c)Ki×Ki The partition found by algorithm A i. For fuzzy clusters, S iis a matrix.
Z i Z i: → [1.. K i] The latent random vector linked to the solutions of algorithm A i P ( Z i| X i, i) the a posteriori distribution of Z iconditionnally to X iand i H See Eq. (16) The global entropy of the collaborative system for all algorithms
ωi,ja,b ωi,a,bj= P(Z nj= b|Z in= a, S , X , ) The percentage of data associated to cluster a by A ithat belong in the cluster b of A j q q = {q 1, · · ·, q J}, ∀i q i∈ [1 ..K i] A combination of clusters (see Section 3.4 )
g i( q , c ) g i( q , c ) ∈ (0, 1), c ∈ [1.. K i] A consensus function assessing the likelihood of having q i= cknowing the rest of q
Algorithm1:Collaborative“EM”.
Initialize,t=0and(0)withthelocalstep whiletheglobalentropyHdecreasesdo
E-Step:S(t)=ArgmaxSLg(S,(t)), M-Step:(t+1)=ArgmaxLg(S(t),), t=t+1
end ReturnS(t) 3.3. Objectivefunction 398
Thefundamentalquestioninhorizontalcollaborativesettingis 399
tofindtherightfunctionaltooptimizesothatwecanproperlyan- 400
swertheproblemofhavingseveralalgorithmsworkingtogetherby 401
exchanging theirinformationwithagoalofmutualimprovement.
402
Todoso,wehavethefollowingconstraints:Wewantafunctional 403
similar to Eq.(3)based on thepartitions insteadof distributions 404
prototypes,whereweattempttobiaseachlocalsolutionSit sothat 405
Sit+1 takesinto accountthe informationfromtheother partitions 406
without using any prototypes. The problem thereforeconsists in 407
findingtherightlocalandcollaborativeterms.
408
Definingthe localtermisrelatively easy andcanbe done us- 409
inganykindoflikelihoodfunctionforprobabilisticalgorithms,and 410
ad-hoc normalizedqualitycriterion forother typesofalgorithms.
411
The literature is also full of potential divergence and consensus 412
functionsbetweenpartitionsforthecollaborativetermthat mea- 413
surethedivergenceorconsensusbetweentwopartitions(NMI,en- 414
tropies,Rand Index,etc.). However,ifweaddthe extra-constraint 415
that thepartitions aremostlynon-binaryandthat Eq.(3)should 416
beoptimizedinareasonableamountoftime,wefacethefollow- 417
ing problem:Forvector partitionsofsize N,mostoftheseopera- 418
torshaveacomplexityinO(N2).Therefore,thefinalcostofupdat- 419
ingallpartitionsfortheJalgorithmslookingonaverageforK¯clus- 420
ters would be equivalent to call these operators J×N×K¯ times, 421
hence afinal complexityofO(N3) justto optimizethe collabora- 422
tiveterm.
423
Sincesuchcomplexityobviouslydoesnotscalewell,inthere- 424
mainder ofthis section we explain howwe re-designeda likeli- 425
hoodfunctionfromscratchusingasolidprobabilisticmodel.Then, 426
in Section3.4, we show how to optimize thisnew function with 427
a lowcomplexityofO(N). Verymuch likeinEq.(3),weconsider 428
that the functional in the collaborative setting is decoupled into 429
two differentterms,the localtermL(S,) computedfromall lo-
430
callog-likelihoodorqualityindexes,andthecollaborativetermC(S) 431
intheformofaglobalconsensusfunctionbetweenthepartitions.
432
Morepreciselythegloballikelihoodfunctionwrites:
433
Lg
(
S,)
=L(
S,)
+λ
·C(
S)
, (4)whereXistheobservedvariable,thesetofparametersandS= 434
(S1,...,SJ)isthesetofallpartitions. 435 InthefirsttermLinEq.(4),justasinEq.(3),weexpressthe 436 log-likelihoodofSbasedonlyonthelocalinformationandmodel 437 ofeach algorithmtakenindividually andthedataxn.We evaluate 438 thenthelog-likelihoodofthecompletedsampleagainsttheapos- 439 terioridistributionof(Zi
|
Xni,i). 440 L(
S,)
=J
i=1
N
n=1
P
(
Zni|
Xni,i
)
·logP(
Xni,Zin|
i)
. (5)ThesecondtermofEq.(4)isdetailedinEq.(6).Itiscomputed 441 from the likelihood that each element xn be linked to the right 442 cluster based on the other algorithms’ partitions and the choice 443 ofcluster forthesame datainthe localview.The difference be- 4 4 4 tween thelocal likelihood andthelikelihood based on theother 445 algorithmsgivesusthecollaborativeterm.ThistermC(S)therefore 446 isthelikelihoodofSbasedonallthesolutions. 447
C
(
S)
= Ji=1
N
n=1
P(
Zin|
Xn\
Xni,S)
−P(
Zin|
Xni,i
)
·logP
(
Xni,Zin|
i)
(6) Then using Eqs. (5) and (6) we obtain following a posteriori 448 probability for the completed sample Xni,Zni corresponding to al- 449
gorithmAi: 450
P
(
Zin=c|
Xni,i,S
)
=(
1−λ )
·P(
Zni =c|
Xni,i
)
+
λ
·P(
Zin=c|
Xn\
Xni,S)
(7)NotethatduetothelackofindependenceP(Zi
|
Xn\
Xni,S)isnot 451 tractable.Nevertheless,inthenextsection weshow tractableup- 452daterulesfortheresponsibilities. 453
3.4.Updaterules 454
Inthis section,we will proceed withthe practical description 455 oftheupdaterulesfortheresponsibilitiessin,c sothatwe canac- 456 tually compute thepartitions that are solutions ofthe functional 457 fromEq. (7). For fuzzyclustering we then infer that the update 458 rulefortheresponsibilityforalldataxnandallclustercfromiter- 459 ationt toiterationt+1duringtheE-stepofAlgorithm(1) isthe 460
following: 461
sin,c
(
t+1)
=(
1−λ )
·sin,c(
t)
+λ
·q∈Q|qi=c
P