Entropy Based Probabilistic Collaborative Clustering

(1)

HAL Id: hal-02480318

https://hal.archives-ouvertes.fr/hal-02480318

Submitted on 15 Feb 2020

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Bennani, Antoine Cornuéjols

To cite this version:

Jeremie Sublime, Matei Basarab, Guénaël Cabanes, Nistor Grozavu, Younès Bennani, et al.. Entropy Based Probabilistic Collaborative Clustering. Pattern Recognition, Elsevier, 2017, 72, pp.144-157.

�10.1016/j.patcog.2017.07.014�. �hal-02480318�

(2)

Entropy Based Probabilistic Collaborative Clustering

Article in Pattern Recognition · December 2017

DOI: 10.1016/j.patcog.2017.07.014

CITATIONS

0

READS

40 6 authors, including:

Some of the authors of this publication are also working on these related projects:

COCLICO (ANR Project) View project

Clustering in Dynamic Data , Detection Concept Change in Dynamic Data View project Basarab Matei

Université Paris 13 Nord

42PUBLICATIONS

356CITATIONS

SEE PROFILE

Guénaël Cabanes Université Paris 13 Nord

55PUBLICATIONS

205CITATIONS

SEE PROFILE

Nistor Grozavu

Université Paris 13 Nord

52PUBLICATIONS

82CITATIONS

SEE PROFILE

Younès Bennani Université Paris 13 Nord

185PUBLICATIONS

810CITATIONS

SEE PROFILE

All content following this page was uploaded by Jeremie Sublime on 08 August 2017.

The user has requested enhancement of the downloaded file.

(3)

ContentslistsavailableatScienceDirect

Pattern Recognition

journalhomepage:www.elsevier.com/locate/patcog

Entropy based probabilistic collaborative clustering

Jérémie Sublime

^a^,^b^,^∗

, Basarab Matei

^b

, Guénaël Cabanes

^b

, Nistor Grozavu

^b

,

Q1

Younès Bennani

^b

, Antoine Cornuéjols

^c

aLISITE Laboratory, RDI Team - ISEP 10 rue de Vanves, 92130 Issy Les Moulineaux, France

bUniversité Paris 13, Sorbonne Paris Cité, LIPN - CNRS UMR 7030 99 av. J-B Clément, 93430 Villetaneuse, France

cUMR MIA-Paris, AgroParisTech, INRA Université Paris-Saclay, 75005 Paris, France

a rt i c l e i n f o

Article history:

Received 17 December 2016 Revised 24 April 2017 Accepted 8 July 2017 Available online xxx Keywords:

Collaborative clustering EM algorithms Entropy based methods

a b s t r a c t

Unsupervisedmachinelearningapproachesinvolvingseveralclustering algorithmsworkingtogetherto tacklediﬃcultdatasetsarearecentareaofresearchwithalargenumberofapplicationssuchascluster- ingofdistributeddata,multi-expert clustering,multi-scaleclusteringanalysisormulti-viewclustering.

Mostoftheseframeworkscanberegroupedundertheumbrellaofcollaborativeclustering,theaimof whichistorevealthecommonunderlyingstructuresfoundbythedifferentalgorithmswhileanalyzing thedata.

Withinthiscontext,thepurposeofthisarticleistoproposeacollaborativeframeworkliftingthelimi- tationsofmanyofthepreviouslyproposedmethods:Ourproposedcollaborativelearningmethodmakes possibleforawiderangeofclusteringalgorithmsfromdifferentfamiliestoworktogetherbasedsolely ontheirclusteringsolutions,thusliftingpreviouslimitationrequiringidenticalprototypesbetweenthe differentcollaborators.OurproposedframeworkusesavariationalEMasitstheoreticalbasisforthecol- laborationprocessandcanbeappliedtoanyofthepreviouslymentionedcollaborativecontexts.

Inthisarticle,wegivethemainideasandtheoreticalfoundationsofourmethod,andwedemonstrate itseffectivenessinaseriesofexperimentsonrealdatasetsaswellasdatasetsfromtheliterature.

1. Introduction 1

DataClusteringisafundamentaltaskintheprocessofknowl- 2

edgeextractionfromdatabasesthataimsto discovertheintrinsic 3

structuresinasetofobjectsbyformingclustersthatsharesimilar 4

features.Thistaskismorediﬃcultthansupervisedclassiﬁcationas 5

thenumberofclusterstobefoundisgenerallyunknownandcon- 6

sequentlyitisdiﬃculttoratethequalityofaclusteringpartition.

7

Overthepasttwodecades,thistaskhasbecomeevenmorechal- 8

lenging whenthe available datasets becamemorecomplex with 9

theintroductionofmulti-viewdatasets,distributeddata,anddata 10

set havingdifferentscales ofstructuresofinterest (e.g.hierarchi- 11

calclusters).Thisincreasedcomplexityinanalreadyhardproblem 12

makes it diﬃcult forlone clusteringalgorithms to give competi- 13

tiveresultswithahighdegreeofconﬁdence.However,verymuch 14

∗ Corresponding author.

E-mail addresses: [email protected] , [email protected] (J.

Sublime), [email protected] (B. Matei), [email protected] paris13.fr (G. Cabanes), [email protected] (N.

Grozavu), [email protected] (Y. Bennani), [email protected] (A. Cornuéjols).

likein the realworld, such problemscan be tackled moreeasily 15 byhavingseveralalgorithmsworkingtogetherinordertoincrease 16 boththequalityoftheresultsandtheirreliability. 17 Approachesbasedonthisideaofseveralalgorithmsworkingto- 18 getherhavebeenwidelystudiedinthecaseofsupervisedlearning 19 [1–4]where they gave birth to the field of Ensemble Learning. 20 Ensemblemethodsare easytoimplementinsupervisedlearn- 21 ingfortworeasons:First,it isstraightforwardtodefinea combi- 22 nationofpredictivefunctionstogetanaggregatedpredictionfunc- 23 tion(for instance,a linearcombinationisused inboosting).Sec- 24 ond, it is simple to measure both the performance of individual 25 predictionfunctionsand thediversity of theset ofthe functions 26 thatare candidateforbeingpartofthecombinedglobaldecision 27 function.Thingsarenot sostraightforwardinunsupervisedlearn- 28 ing.Here,eachindividualsolutionisasoftorhardpartitionofthe 29 dataset.Howtocombinethesepartitionshasnoobviousanswer. 30 In cooperative clustering, each clustering algorithm produces 31 its result independently. The final clustering is computed in a 32 post-processing step, and the only exchange of information is 33 aboutwhen theindividualprocesses arecompleted,so thatpost- 34 processingcanstart.Inthiscase,asetofclusteringalgorithmsare 35 used inparallel on a givendata set. Onceall local computations 36 http://dx.doi.org/10.1016/j.patcog.2017.07.014

Please cite this article as: J. Sublime et al., Entropy based probabilistic collaborative clustering, Pattern Recognition (2017),

(4)

arecompleted,amasteralgorithmtakescontrolandcombinesthe 37

localresultstogetahopefully betteroverall clustering.Thereso- 38

lutionofthepossibleconﬂictsbetweenthelocalsolutionsrequires 39

analgorithmthatisabletocompareresultsthatmaydifferintheir 40

format(e.g.differentnumbersofclusters,differentdegreesofbe- 41

lief associatedwith theresults, ...)andto ﬁnd a consensus solu- 42

tionthat minimizes theoverall violationto the localresults.The 43

cooperative framework is closely related to the ensemble meth- 44

odsdeveloped forsupervisedlearning. Intheseapproaches, aset 45

of(diverse)classiﬁersislearnedandtheclassiﬁcationofnewdata 46

pointsisobtainedbytakinga(weighted)voteoftheirpredictions.

47

Bayesianaveragingcanbe consideredasaprecursormethod.Nu- 48

merousnewoneshavebeendeveloped,fromerror-correctingout- 49

putcodingtoBagging, andBoostingandtheirapplicationinvari- 50

ousdomainshavebecomeroutinewithoftengoodresults.

51

Incollaborativeclustering(Thesequelofthispaper),thegroup 52

solvestogetherproblemsdeﬁnedandimposedbythecentralcon- 53

troller,affectinganindividualtasktoeachlearner.Interactionsare 54

recurrentbetweenteammembers,responsibility iscollective,the 55

actionofeachteammateisgearedtotheperformanceofthegroup 56

andviceversa.Bycontrasttothecooperativeclusteringmodel,the 57

collaborativemodeldoesnotseekanoverallhopefullybetterclus- 58

teringof a given data set through the combinationof individual 59

solutions.Inthecollaborativeframework,thegoalisthateachlo- 60

calcomputation, quitepossibly appliedto distinctdata sets,ben- 61

eﬁts fromthe work done by theother collaborators.Thiscan be 62

donethroughtheexchangeofinformationaboutthelocaldata,or 63

thecurrenthypothesizedlocalclustering,orthevalueofonealgo- 64

rithm’s parameters.The validity of theapproach rests onthe as- 65

sumptionthat usefulinformation can be sharedamong the local 66

tasks.Thisschemeleadsnaturallytodistributedimplementations 67

ofthe computations,but unlike in thecooperative framework, it 68

generallyentailsseveraliterationsateachlocalnodebecausecon- 69

vergenceof theconsensus solution requiresseveralpassesofthe 70

algorithm.Indeed,inadditiontotheproblemofwhatinformation 71

toexchange between collaborators,one question ishow to mea- 72

suretheevolutionateachnodeandonagloballevel.

73

Therearemanyapplicationsinunsupervisedlearningforwhich 74

collaborativeclusteringcanproveuseful:

75

• Multi-scale analysis: In this case several algorithms would be 76

analyzing the same objects, all looking at the same features, 77

but searchingfora different numberof clusters. Thatkind of 78

analysiscanbebeneﬁcialfordatasetsthathaveintrinsicmulti- 79

scalestructuressuchassatelliteimagesforwhichalowerlevel 80

analysis of globallandscape areas (urbanareas, water bodies, 81

forests)oftenhelpstoimproveahigherlevelanalysisofsmaller 82

details(trees,cars,houses,gardens,streets,etc.).

83

• Multi-expert analysis: In this case, all algorithms would be 84

working on the same objects and features of a diﬃcult data 85

set. Given the very high number of existing clustering algo- 86

rithms, all more or lessspecialized andthat mayor maynot 87

give good resultsdependingonthe problem, trying severalof 88

them on a data set and having them exchanging their infor- 89

mationcouldbejustiﬁed:mergingtheinformationsonclusters 90

found only bysome clustering algorithms,reﬁning the results 91

basedonclustersthataremoreorlesswellidentiﬁeddepend- 92

ingonthemethod,etc.

93

• Multi-view clustering [5,6]: Different algorithms process differ- 94

ent typesof attributesforthe same objects.For exampleone 95

algorithm forgeometricattributes, one fortextattributes, one 96

forcolors,onefornumericalattributes,etc.Thegoalofthecol- 97

laboration in this case would be to have each attribute type 98

processed by a specialized algorithm while givingthese algo- 99

rithmsamoreglobalpicture ofthedatasetby enablingsome 100

exchangesbetweenthem.

101

• Clustering of distributed data [7]: The same objects have their 102 attributessplit on several databases that can’t exchange their 103 databecauseofprivacyissues.Whilethenameisdifferent,this 104 isinfactverymuchequivalenttomulti-viewclustering. 105

• BigDataClustering[8]:Datasetsthat aretoolargeorhavetoo 106 many attributes to be processed eﬃciently by a single algo- 107 rithmmaybeeasiertotackleoncetheirattributesaresplitand 108 processedby severalalgorithms.Thistypeofclusteringisuse- 109 fulin thearea ofBig Data analysis andwouldrequire ahigh 110 degreeofcooperationbetweenthealgorithmstogettheglobal 111

picture. 112

Asonecansee,alltheseapplicationshavealotofsimilarities: 113 wehaveseveralalgorithmsworkingonthesamedataorsubsetsof 114 thesamedata,andthat willorcouldatsomepoint trytoaggre- 115 gateortomutuallyexploittheirrespectiveresults.Whilesomeof 116 theseapplicationscouldbeconsideredaﬁeldoftheirownsuchas 117 multi-viewclusteringordistributedclustering[5],allofthemcan 118 beclassiﬁedashorizontalcollaborativeclusteringframeworks[9– 119 12]:severalalgorithmsworkingonthesamedataeventuallylook- 120 ing foradifferentnumberof clusters,andnot necessarilyhaving 121

accesstothesamefeatures. 122

We generally distinguish between two types of collaborative 123 methods[9,11]:Verticalcollaborationencompassesallcaseswhere 124 severalalgorithmsareworkingondifferentdatathathavesimilar 125 clusters ordistributions. And Horizontal collaboration dealswith 126 caseswhereseveralalgorithmsare collaboratingonthesameob- 127 jects,eventuallydescribedfromdifferentviews.Inthisarticle,we 128 aremostlyinterestedinhorizontalcollaboration. 129 Collaborativemethodsusuallyfollowatwo-stepprocedure[13]: 130 1. Localstep: Eachalgorithm willindividually processthe datait 131 hasaccesstoandproducealocalclusteringpartition. 132 2. Collaborativestep:Thealgorithmssharetheirresultsandtryto 133 conﬁrmorimprovetheirmodelswiththegoalofachievingbet- 134

terclusteringresults. 135

Thesetwostepsaresometimesfollowedbyanaggregationstep 136 whichaimsatreachingaconsensuswiththefinalresultsaftercol- 137 laboration. Inthiswork we willnot addressthe aggregationstep 138 becauseit isafield ofits own,andthatdependingonthe appli- 139 cation it may not always be advisable to aggregate, for instance 140 whenthedifferentviews,sitesorscaleshaveconflictingpartitions 141 [14].Wewillinsteadfocusonthecollaborativestepwheretheal- 142 gorithmsexchange bits ofinformationwitha goalofmutualim- 143

provement. 144

From there,the main difference betweenwhatis traditionally 145 referred as “clustering ensemble learning” [15] and collaborative 146 clustering is that clustering ensemble learning methods aim at 147 finding a single consensus partition, while collaborative cluster- 148 ing doesnot have thisfinal goal.In short,the field of collabora- 149 tiveclusteringisconcernedwithfinding algorithmsandfunctions 150 thatallowalgorithmstoshareinformationandtoimprovetheirre- 151 sultsbasedoneach othersimilarities,whilethefieldofensemble 152 learningismore concernedwithfinding algorithms andmethods 153 tomergethesolutionsorfindaconsensusbetweenthem.Collabo- 154 rativeclusteringcanthereforebeataskofitsown(e.g.multi-view 155 clusteringwhereconsensus is notalways possible noradvisable), 156 ora preliminarystepto an ensemblelearningtask.The methods 157 andtechniquesusedbybothfieldsarethereforenaturallyoverlap- 158 ping, anda good collaborative algorithm mustrespect properties 159 thatareverysimilartotheseofagoodensemblelearningmethod: 160

• Robustness:Thecollaborativeprocess mustleadon averageto 161 partitionsthatarebetterthanthelocalclusteringresults. 162

• Consistency:The updated resultsmustbe somehowsimilar to 163

theoriginallocalresults. 164

Please cite this article as: J. Sublime et al., Entropy based probabilistic collaborative clustering, Pattern Recognition (2017),

(5)

• Novelty:Collaborative clusteringmustmakeitpossibletoﬁnd 165

solutionsthatwouldhavebeenotherwiseunattainablelocally.

166

• Stability:Resultsthathavealowersensitivitytonoise.

167

Withinthiscontext,inthisarticleweintroduceanewandorig- 168

inalframework forcollaborativeclusteringthatcan be appliedto 169

thevarioustypesofunsupervisedcollaborativelearningtasksthat 170

we havepreviously discussed.Ourproposed methodliftsoff sev- 171

eral limitations of previous ensemble learning and collaborative 172

frameworks: the data need not be shared between the different 173

algorithms,thenumberofclustercanbedifferentbetweentheal- 174

gorithms,andverydifferenttypesofalgorithmscancollaborateto- 175

gether.

176

The theoretical basis of our work is close from the work of 177

BickelandSchefferontheestimationofMixtureModelsusingCo- 178

EM [16,17]. Our proposed method differs from theirs in the fol- 179

lowingpoints:inourcasewearetreating abroadercontextthan 180

multi-viewclustering.Ourmethodmakesitpossibleforalgorithms 181

from different families to work together, and once again we do 182

nothavethelimitationthatallalgorithmsshouldbesearchingfor 183

thesamenumberofclusters. Weproposea variationalversionof 184

their workformulti-viewclusteringbasedontheoptimizationof 185

a differentobjectivefunction.The coreofourproposed approach 186

is adifferent discretizationprocess basedon aparticular classof 187

aposterioridistributionscalled“combinationfunctions” presented 188

inSection3.4.1. 189

Theremainderofthisarticleisorganizedasfollows:

190

In Section2, we propose a state of the art in which we in- 191

troduce some of the pioneer and earlier proposed methods and 192

frameworks for collaborative learning with their strengths and 193

weaknesses.

194

InSection3,weintroduceourproposed methodforhorizontal 195

collaborativeclustering. As statedpreviously,themethodthat we 196

proposeaimsatbeingmoregenericthan thepreviously proposed 197

frameworks.We beginby explainingtheprinciple ofourmethod 198

anditstheoreticalbasis.Thenwestudythestoppingcriterionand 199

parameters tuning of our algorithm. And ﬁnally,we demonstrate 200

thatourproposedmethodhasgoodconvergencepropertiessimilar 201

totheseofaEMalgorithm.

202

InSection4,weshowsomeexperimentalresults.Wearemostly 203

interestedinshowingsomepotentialapplicationsofourproposed 204

method applied to multi-scale clustering andmulti-view cluster- 205

206 ing.

Finally, thiswork ends witha conclusion andperspectiveson 207

futureworks.

208

2. Stateoftheartincollaborativeclustering 209

One of the ﬁrst collaborative clustering algorithm was intro- 210

duced in 2002 by Pedrycz [13,18] under the name “Collaborative 211

FuzzyClustering” (CoFC).Thismethodwasdesignedforthespeciﬁc 212

caseof distributed datawhere theinformation cannot be shared 213

betweenthedifferentsites.Thismethodwasbasedonamodiﬁed 214

versionoftheFuzzyC-Meansalgorithm[19]. 215

The main limitation of this approach is that it only enables 216

FuzzyC-Meansalgorithmstocollaboratetogether,andfurthermore 217

some methods even require that all of them be looking for the 218

samenumberofclusters.

219

Similar approaches were used to develop several other 220

collaborative-like methods CoEM [17], CoFKM, [20], and another 221

collaborative EM-like algorithm [21] based on Markov Random 222

Fields.

223

All these algorithms display similar limitations: the objective 224

functionsand sometimesthe number ofclustersmust be identi- 225

calforallexchangedinformation.Thisisduetothefactthatthey 226

alltrytooptimizeanobjectivefunctiontheformofwhichis: 227

(

^Sopt,

opt

)

=Argmax (^S,) Lg

(

^S,

)

=Argmax (^S,)

J

i=1

L

(

^Xⁱ

|

^Sⁱ^,

ⁱ

)

⁻

j=i

τ

j,i·

(

ⁱ^,

^j

)

(1) whereJisthe numberofcollaborators,S containsallalgorithm’s 228 partitions,^their distributionsparameters,Lg(S,⁾^is^the^global ²²⁹ likelihoodofthesystem,eachL(Xⁱ|Sⁱ,ⁱ⁾^is^the^locallog-likelihood 230 of a collaborating algorithm, each ⁽ⁱ^, ^j⁾ ^the “collaborative 231 term” is a custom pairwise penalty that compares thedifference 232 betweentheparameters orprototypes oftwo algorithms,andthe 233

τ

j,i which do not exist in all methods are weights given to the 234 collaborative penalties. The deﬁnition of the local term L(Xⁱ|Sⁱ, 235

ⁱ⁾ ^based ôn ^which âlgorithms collaborate together makes the 236 maindifferencebetweenallthesemethods,whiledefinitionofthe 237 penalty⁽ⁱ^,^j⁾ônly^slightly^differs^dependingôn^the^collabora- ²³⁸ tivemethod.Thislaterparameteristhelimitingonesincecompar- 239 ing prototypes andparameters requiresthat the algorithms have 240 thesametypesofprototypesandsomekindofmappingbetween 241 theclustersofthedifferentalgorithms. 242 TheworkofPedryczontheCoFCalgorithmwasalsoextended 243 tobeadaptedtotheSelf-OrganizingMaps(SOM)[11,22,23]andto 244 theGenerativeTopographicMaps(GTM)[24]. 245 In [23], the classical SOM objective function is modified by 246 addingaspecificextratermforhorizontalcollaborationandadif- 247 ferentone forvertical collaboration. Forthe collaborativeversion 248 oftheGTMalgorithm[24],theprincipleisthesamewiththeM- 249 StepoftheEMalgorithmmappingtheneuronstothefinalclusters 250

beingmodiﬁed. 251

Oneproblem withthesetwo methods isthat they do not re- 252 allysolvethemain issueofcollaborationbetweendifferenttypes 253 ofalgorithmssincetheirmodelinonceagainanalogtotheonein 254 Eq.(1).Furthermore,while thenumberofclustersdoesnotmat- 255 terinthecaseofthecollaborativeSOMandcollaborativeGTM,in 256 bothcasesthemapsmusthavethesamenumberofneuronsand 257 be topologicallysimilar to each other.This isactually even more 258 restrainingthanarequirementonthenumberofclusters. 259 TheSAMARAH method[25,26]isanothertype ofcollaborative 260 frameworkthestrengthofwhichisthatitcandealwithanykind 261 ofhardclusteringalgorithmandisnotconcernedwithissuessuch 262 asﬁtnessfunctions, numberofclusters, orprototypes.Unlikethe 263 previously introduced method, SAMARAH only handles horizon- 264 talcollaborationdueto thelack ofprototypes,andwasdesigned 265 mostlyforclusteringappliedtoimagedata.Itsgoalisverysimple: 266 givenJclusteringresultsforthesame data,theideaisto modify 267 theseresultsinaniterativeandcollaborativewaywiththeaimof 268 reducingtheir diversityinordertomaketheﬁndingofaconsen- 269

sussolutioneasier. 270

Oncetheresultshavebeengeneratedduringthelocalstep,the 271 SAMARAH method maps the clusters of the different algorithms 272 usingprobabilisticconfusionmatrices(PCM).LetSⁱandS^jbetwo 273 clusteringresultsfromtwoalgorithmsAⁱandA^jlookingforK_iand 274

K_jclustersrespectively. 275

Then,the probabilistic confusionmatrix(PCM) ⁱ^,^j ^that ^maps ²⁷⁶ theclustersfromAⁱtoA^jisdeﬁnedasshownbelow: 277

ⁱ^,^j=

⎛

⎜ ⎝

ω

ⁱ1^,,^j1 · · ·

ω

ⁱ1^,,^jKj

..

. ... ...

ω

ⁱK^,i^j,1 · · ·

ω

Kⁱ^,i^j,Kj

⎞

⎟ ⎠

^where

ω

ⁱa^,,^jb=

|

^Sⁱa∩S_b^j

|

^Sⁱa

|

⁽²⁾

InEq.(2),Sⁱ_a denotesthe athcluster ofalgorithm Aⁱ i.e., Sⁱ_a= 278

{

^x;x∈Xⁱ,x∈abyAⁱ

}

^and

|

^Sⁱa

|

^is^the^number^of^dataⁱⁿ^this^clus- ²⁷⁹

(6)

ter,and

|

^Sⁱa∩S_b^j

|

îs^the ^number ôf ^data^linked ^to ^the â^th ^cluster

280

ofAⁱ andthe bth cluster ofA^j atthe same time. The PCM ⁱ^,^j

281

makesit possibleto knowwhetherornot theobjects oftwo re- 282

sultshavebeengroupedina similarway,orifthetwoclustering 283

resultsare dissimilar. The matrix hasa key role inthe compari- 284

son of two clusteringresults -such as detecting agreements and 285

conﬂicts-,andhasthemajoradvantageofbeingindependentfrom 286

theclusteringalgorithmusedtogeneratetheresults.

287

TheSAMARAHmethodusesthismatrixtodetectpairwisecon- 288

ﬂicts between the different partitions and reduces them by or- 289

der of perceived importance based on a conﬂict metric criterion 290

[25]bysplitting,merging,orremovingclusters.Oncethesolutions 291

haveall beenreﬁned, andareconsequently quite similar toeach 292

other,it proceedswith aggregatingthem using a process similar 293

toamajorityvote[27].Itisthereforeaverycompleteframework 294

thatcoversall3stepsoflocallearning,collaborativelearningand 295

resultaggregationanddoesnotrelyonusersparameter. 296

However, its conﬂict resolution system certainly is a weak 297

point:itreliesonapairwiseconﬂictcriterion,andsolvesthecon- 298

ﬂictsonebyonebyorderofperceivedimportance,anditcanlead 299

tosub-optimalresults.Finally,whileitisalsoastrongpointofthe 300

method,thefact thatthealgorithms parametersorprototypes do 301

notplayanyroleoncethelocalstepisovermayconstituteaweak- 302

ness,inthe sense thatthe localmodelis neverrebuiltusing the 303

newpartitionsanddoesnotplayanyactiveroleineitherthecol- 304

laborativesteportheconsensusstep.

305

3. Horizontalcollaborativeclusteringguidedbydiversity 306

3.1.Formalism 307

Inhorizontalcollaborativeclusteringweconsideraﬁnitegroup 308

ofalgorithms A=

{

A¹,...,A^J

}

^that^are^working^on^the^same^data

309

elements,albeitpossiblywithaccesstodifferentfeatures,andalso 310

possiblylookingforadifferentnumberofclusters.Noassumptions 311

are madeon the algorithms themselves. LetX=

{

^x1,...,x_N

}

,xn∈ 312

R^d bea data set containing Nelements,each of them withdreal 313

numberfeatures.

314

EachclusteringalgorithmAⁱhasitsownparameterstodescribe 315

eitherthe clustersor its model,and produces its own clustering 316

solutionSⁱmadeof K_i clusters,based onthefeatures ofthe data 317

setXⁱ⊆X ithasaccessto.Inthe caseofhardclustering, Sⁱcanbe 318

translatedintoasolutionvectorofsizeN,andforfuzzyclustering 319

intoamatrixofsizeN×K_i.WedenotethislatermatrixSⁱ=(^sⁱn,c), 320

where1≤n≤Nand1≤c≤K_i.ThesolutionsSⁱoutputbythealgo- 321

rithmsarethereforetwo-dimensionalmatricesofsizeN×K_iwhere 322

each element sⁱ_n_,_c expresses the responsibility (probability) given 323

byalgorithmAⁱtoaclustercforthedataelementxn. 324

EachalgorithmAⁱcomputesthesolutionsSⁱ,asusualbyintro- 325

ducinga latentdiscrete randomvector Zⁱ deﬁnedon somelatent 326

spacewiththerange[1,...,K_i],hence computingthe aposteriori 327

distributionofthevariableZⁱconditionallyonXⁱandSⁱ. 328

Finally, inorder toquantify thedegree ofinformation coming 329

fromthecollaboration, fora givenalgorithm Aⁱ, we willassume 330

theexistenceofsome weight

τ

j,i∈(0,1),which measuretherel- 331

ativeexternalinformationfromthealgorithmj=iacceptedbyAⁱ. 332

Allweights

τ

j,i are storedin a square matrixof size J×J which 333

thereforecontainsthestrengthofallcollaborationlinks.Mostno- 334

tationsusedinthisarticlearesummedupinTable1below.

335

3.2.Problemformulation 336

Within the context of horizontal collaboration that we have 337

presented before, the method that we propose takes many ad- 338

vantages of both prototype-based collaborative methods and the 339

SAMARAHmethod,withouttheirissues. 340

OurgoalinthissectionistoﬁndawaytomodifyEq.(1)sothat 341 the collaborativetermwill not depend on theprototypes. There- 342 fore,weproposealikelihoodfunctionbasedonEq.(3)whichuses 343 aglobalconsensustermC(S)basedonthepartitions.Themaindif- 344 ferenceswithEq.(1)arethatweusedamodelbasedonpartitions 345 ratherthanprototypes,ourproposedmodelisconsensusbasedin- 346 steadof divergencebased,andwe usea globalterminsteadofa 347 pairwiseone.Wechosethisglobalmodelbecauseunlikethepair- 348 wiseversion,itdoesnotrequiretoassumethatthealgorithmsare 349 independentfromeachother (whichisofcoursenottrue). 350 In this model,

λ

∈[0, 1] is a weight parameter to bal- 351 ance between the local and collaborative term. The left term 352 J

i=1L(^Xⁱ

|

^Sⁱ,ⁱ)^is^called^the^local^term^,^and^the^right^term

λ

·C(S) 353 is the collaborative term. Note that the C(·) here stands for 354

“consensus”: we havea collaborative termbased on aconsensus 355

function. 356

(

^S^opt^,

^opt

)

⁼^Argmax

(^S,) Lg

(

^S^,

)

⁼^Argmax

(^S,) J

i=1

L

(

^Xⁱ

|

^Sⁱ^,

ⁱ

)

⁺

λ

^·^C

(

^S

)

(3)

Withthismodel,andusingacollaborativetermbasedondiffer- 357 entaposterioridistributionsinsteadofacollaborativetermbased 358 ondistributionsparameters,ourproposedmodelliftsoff thelimi- 359 tationthat onlyidenticalalgorithmslookingforthesamenumber 360 ofclusterscan worktogether.Furthermore,usingourmodeleven 361 non-parametric algorithms-forwhichthedistributions parameter 362

ⁱ^can^not^be^explicitlyformulated-canbeusedinacollaborative 363 setting since our modelis based on the partitions (solution ma- 364 trices or vectors)which are explicit forany clusteringalgorithm. 365 The penalty factor

λ

>0 regularizesthecollaborationpart. Please 366 note that in[28], theauthors have demonstrated that there is a 367 directrelationbetweenreducing thedivergences andmaximizing 368 theconsensus under mildassumptions. Therefore,both strategies 369

areequivalent. 370

Analogously to Eq.(3),our ideais to optimizea modiﬁedﬁt- 371 ness of the log-likelihood function that considers both the local 372 partitionsandtheinformationcomingfromtheother algorithms’ 373 solutions.ByconsideringonlythepartitionsSⁱandnottheparam- 374 eters,verymuchlikeintheSAMARAHmethod[25,26],weensure 375

thatourmodelisbothgeneric. 376

As we will demonstrate in the next subsection, this change 377 fromⁱ^to^Sⁱîs^made^possible^because^weûseân âlternate^maxi- ³⁷⁸ mizationprocedureinwhichthepartitionsarecomputedfromthe 379 prototypesandthentheprototypesareupdatedbasedonthepar- 380 titions andthedata.Inshort,thepartitionscan beseenasadis- 381 cretizationofthedistributionsdescribedbytheprototypes. 382 Whilethisimprovementwillresultinamoregenericparadigm 383 whenitcomestohorizontalcollaboration,it isworth mentioning 384 thatremovingtheprototypesalsomakesverticalcollaboration(al- 385 gorithmscollaboratingondifferentdatasetswithsimilarclusters) 386 impossiblewhereassomeoftheearliermethodscoveredthiscase 387 of knowledge transferbetween similar data sets [11,13,24], albeit 388

onlybetweenidenticalalgorithms. 389

To optimize (3) we use the Expectation Maximization (EM) 390 strategy. The workflow in Algorithm (1) highlights how our al- 391 gorithmcanindeedbeconsideredasanEMalgorithm.Duringthe 392 E-Step,thepartitionsSareupdatedusingfixed valuesforthedis- 393 tributions parameters^.^Then, ^during^the ^M-Step,^these^parame- ³⁹⁴ tersâreûpdated^basedôn^the^newpartitions. 395 TheexactformofthefunctionalLgisexplainedinthenextsec- 396 tion,whilethesoppingcriterionisdetailedinSection3.5. 397 Please cite this article as: J. Sublime et al., Entropy based probabilistic collaborative clustering, Pattern Recognition (2017),

(7)

Table 1 Notations.

Notation Development Comment

X ⁱ X ⁱ= {^xⁱ1, . . . , x ⁱ_N}^,^xⁱn∈ R ^d The subset of the data observed by algorithm A ⁱ X X = {X ¹, . . . , X ^J} The full data with all views

ⁱ The parameters describing the distributions observed by algorithm A ⁱ = {¹, . . . , ^J} The set of distributions parameters for all algorithms

A ⁱ A ⁱ= {X ⁱ, S ⁱ, ⁱ, K i} An algorithm looking for K iclusters of distribution parameters ⁱin the subset X ⁱand ﬁnding a partition S ⁱ τj,i τj,i∈ [0, 1] The weight of the collaboration from A ^jto A ⁱ

s ⁱ_n,c s ⁱ_n,c∈ (0 , 1), ^K_c=1ⁱ s ⁱ_n,c= 1 The responsibility given by algorithm A ⁱto the cluster c ∈ [1.. K i] for the data x ⁱ_n S ⁱ S ⁱ= (s ⁱ_n,c)Ki×Ki The partition found by algorithm A ⁱ. For fuzzy clusters, S ⁱis a matrix.

Z ⁱ Z ⁱ: → [1.. K i] The latent random vector linked to the solutions of algorithm A ⁱ P ( Z ⁱ| X ⁱ, ⁱ) the a posteriori distribution of Z ⁱconditionnally to X ⁱand ⁱ H See Eq. (16) The global entropy of the collaborative system for all algorithms

ω^i,j_a,b ω^i,_a,b^j⁼^P(Z n^j= b|^Zⁱn= a, S , X , ) The percentage of data associated to cluster a by A ⁱthat belong in the cluster b of A ^j q q = {q 1, · · ·, q J}, ∀i q i∈ [1 ..K i] A combination of clusters (see Section 3.4 )

g ⁱ( q , c ) g ⁱ( q , c ) ∈ (0, 1), c ∈ [1.. K i] A consensus function assessing the likelihood of having q i= cknowing the rest of q

Algorithm1:Collaborative“EM”.

Initialize,t=0and(⁰)^with^the^local^step whiletheglobalentropyHdecreasesdo

E-Step:S(^t)=Argmax_SLg(^S,(^t))^, M-Step:(^t+1)=ArgmaxLg(^S(^t),)^, t=t+1

end ReturnS(^t) 3.3. Objectivefunction 398

Thefundamentalquestioninhorizontalcollaborativesettingis 399

toﬁndtherightfunctionaltooptimizesothatwecanproperlyan- 400

swertheproblemofhavingseveralalgorithmsworkingtogetherby 401

exchanging theirinformationwithagoalofmutualimprovement.

402

Todoso,wehavethefollowingconstraints:Wewantafunctional 403

similar to Eq.(3)based on thepartitions insteadof distributions 404

prototypes,whereweattempttobiaseachlocalsolutionSⁱ_t sothat 405

Sⁱ_t₊₁ takesinto accountthe informationfromtheother partitions 406

without using any prototypes. The problem thereforeconsists in 407

ﬁndingtherightlocalandcollaborativeterms.

408

Deﬁningthe localtermisrelatively easy andcanbe done us- 409

inganykindoflikelihoodfunctionforprobabilisticalgorithms,and 410

ad-hoc normalizedqualitycriterion forother typesofalgorithms.

411

The literature is also full of potential divergence and consensus 412

functionsbetweenpartitionsforthecollaborativetermthat mea- 413

surethedivergenceorconsensusbetweentwopartitions(NMI,en- 414

tropies,Rand Index,etc.). However,ifweaddthe extra-constraint 415

that thepartitions aremostlynon-binaryandthat Eq.(3)should 416

beoptimizedinareasonableamountoftime,wefacethefollow- 417

ing problem:Forvector partitionsofsize N,mostoftheseopera- 418

torshaveacomplexityinO(N²).Therefore,theﬁnalcostofupdat- 419

ingallpartitionsfortheJalgorithmslookingonaverageforK¯clus- 420

ters would be equivalent to call these operators J×N×K¯ times, 421

hence aﬁnal complexityofO(N³) justto optimizethe collabora- 422

tiveterm.

423

Sincesuchcomplexityobviouslydoesnotscalewell,inthere- 424

mainder ofthis section we explain howwe re-designeda likeli- 425

hoodfunctionfromscratchusingasolidprobabilisticmodel.Then, 426

in Section3.4, we show how to optimize thisnew function with 427

a lowcomplexityofO(N). Verymuch likeinEq.(3),weconsider 428

that the functional in the collaborative setting is decoupled into 429

two differentterms,the localtermL(S,⁾ ^computed^from^all ^lo-

430

callog-likelihoodorqualityindexes,andthecollaborativetermC(S) 431

intheformofaglobalconsensusfunctionbetweenthepartitions.

432

Morepreciselythegloballikelihoodfunctionwrites:

433

Lg

(

^S^,

)

⁼^L

(

^S^,

)

⁺

λ

^·C

(

^S

)

^, ⁽⁴⁾

whereXistheobservedvariable,^the^set^of^parameters^and^S= 434

(^S¹,...,S^J)îs^the^setôfâllpartitions. 435 InthefirsttermLinEq.(4),justasinEq.(3),weexpressthe 436 log-likelihoodofSbasedonlyonthelocalinformationandmodel 437 ofeach algorithmtakenindividually andthedataxn.We evaluate 438 thenthelog-likelihoodofthecompletedsampleagainsttheapos- 439 terioridistributionof(^Zⁱ

|

^Xnⁱ,ⁱ)^. ⁴⁴⁰ L

(

^S^,

)

⁼

J

i=1

N

n=1

P

(

^Znⁱ

|

^Xnⁱ,

ⁱ

)

^·^log^P

(

^Xnⁱ,Zⁱ_n

|

ⁱ

)

^. ⁽⁵⁾

ThesecondtermofEq.(4)isdetailedinEq.(6).Itiscomputed 441 from the likelihood that each element xn be linked to the right 442 cluster based on the other algorithms’ partitions and the choice 443 ofcluster forthesame datainthe localview.The difference be- 4 4 4 tween thelocal likelihood andthelikelihood based on theother 445 algorithmsgivesusthecollaborativeterm.ThistermC(S)therefore 446 isthelikelihoodofSbasedonallthesolutions. 447

C

(

^S

)

= J

i=1

N

n=1

P

(

^Zⁱn

|

^Xⁿ

\

^Xnⁱ,S

)

−P

(

^Zⁱn

|

^Xnⁱ,

ⁱ

)

·logP

(

^Xnⁱ,Zⁱ_n

|

ⁱ

)

(6) Then using Eqs. (5) and (6) we obtain following a posteriori 448 probability for the completed sample X_nⁱ,Z_nⁱ corresponding to al- 449

gorithmAⁱ: 450

P

(

^Zⁱn=c

|

^Xnⁱ,

ⁱ,S

)

=

(

¹−

λ )

·P

(

^Znⁱ =c

|

^Xnⁱ,

ⁱ

)

+

λ

^·^P

(

^Zⁱn=c

|

^Xⁿ

\

^Xnⁱ,S

)

⁽⁷⁾

NotethatduetothelackofindependenceP(^Zⁱ

|

^Xⁿ

\

^Xnⁱ,S)^is^not ⁴⁵¹ tractable.Nevertheless,inthenextsection weshow tractableup- 452

daterulesfortheresponsibilities. 453

3.4.Updaterules 454

Inthis section,we will proceed withthe practical description 455 oftheupdaterulesfortheresponsibilitiessⁱ_n,c sothatwe canac- 456 tually compute thepartitions that are solutions ofthe functional 457 fromEq. (7). For fuzzyclustering we then infer that the update 458 rulefortheresponsibilityforalldataxnandallclustercfromiter- 459 ationt toiterationt+1duringtheE-stepofAlgorithm(1) isthe 460

following: 461

sⁱ_n_,_c

(

^t+1

)

=

(

¹−

λ )

·sⁱ_n_,_c

(

^t

)

+

λ

·

q∈Q|^qi=c

P

(

^q

|

^Xn

\

^Xnⁱ,

t

\

ⁱ

(

^t

))

·P

(

^Zⁱn=qi

|

^q

)

(8)