To link to this article : DOI :
10.1016/j.ipm.2016.05.002
URL :
http://dx.doi.org/10.1016/j.ipm.2016.05.002
To cite this version :
Soulier, Laure and Tamine, Lynda and Shah, Chirag
MineRank: Leveraging Users' latent roles for Unsupervised Collaborative
Information Retrieval. (2016) Information Processing & Management, vol.
52 (n° 6). pp. 1122-1141. ISSN 0306-4573
O
pen
A
rchive
T
OULOUSE
A
rchive
O
uverte (
OATAO
)
OATAO is an open access repository that collects the work of Toulouse researchers and
makes it freely available over the web where possible.
This is an author-deposited version published in :
http://oatao.univ-toulouse.fr/
Eprints ID : 18772
Any correspondence concerning this service should be sent to the repository
administrator:
staff-oatao@listes-diff.inp-toulouse.fr
MineRank:
Leveraging
users’
latent
roles
for
unsupervised
collaborative
information
retrieval
Laure
Soulier
a,∗,
Lynda
Tamine
b,
Chirag
Shah
ca Sorbonne Universités, UPMC Univ Paris 06, CNRS UMR 7606, LIP6, F-75005, Paris, France b Toulouse University UPS IRIT 118 route de Narbonne, 31062 Toulouse Cedex 9, France
c School of Communication & Information (SC&I) Rutgers University 4 Huntington St, New Brunswick, NJ 08901, USA
Keywords:
Collaborative information retrieval Unsupervised role mining Latent role
User study
a b s t r a c t
Research on collaborativeinformation retrieval (CIR)has shown positiveimpacts of col-laborationonretrievaleffectivenessinthecaseofcomplexand/orexploratorytasks.The synergiceffectofaccomplishingsomethinggreaterthanthesumofitsindividual compo-nents is reachedthrough the gatheringof collaborators’ complementaryskills. However, these approachesoften lacktheconsiderationthat collaboratorsmight refinetheirskills and actions throughoutthesearch session, and thata flexible systemmediation guided bycollaborators’behaviorsshoulddynamicallyadapttothissituationinordertooptimize searcheffectiveness.Inthisarticle,weproposeanewunsupervisedcollaborativeranking algorithm whichleveragescollaborators’ actionsfor(1)miningtheirlatentrolesinorder toextracttheircomplementarysearchbehaviors;and(2)rankingdocumentswithrespect to thelatentrole ofcollaborators. Experimentsusing twouser studieswith respectively 25and10pairsofcollaboratorsdemonstratethebenefitofsuchanunsupervisedmethod drivenbycollaborators’ behaviorsthroughoutthesearchsession.Also,aqualitative anal-ysisoftheidentifiedlatentroleisproposedtoexplainanover-learningnoticedinoneof thedatasets.
1. Introduction
In recent years, researchers have argued that in addition to creating better algorithms and systems for individualized
search and retrieval, a substantial leap can be taken by incorporating collaborative aspects in Information Retrieval (IR)
(Twidale, Nichols,& Paice, 1997), referredto as CIR (Fidelet al., 2000). However, simplyallowing multiple people
collab-orate on a search task does not guarantee any advantages over a single searcher. To gainadvantages, one needs to look
deeperinto theaspects ofcollaboration that make itsuccessful and investigate how those aspectscan beincorporatedin
asearchsetting. Manyhave foundthat whenthecollaborators bringadiverse set ofskillstoaproject, theycouldachieve
something morethan what they could usingtheir individualskills and contributions (Soulier, Shah,& Tamine, 2014). But
how does one ensure theuse ofsuch diverse skills in search?One approach could be askingthe searchers involved in a
CIR project about theroles (e.g.,query constructor, informationassessor)they wouldlike toplay. However, these
collabo-rators may eithernot knowabout suchskillsthey have ormay beunable to specify any preferences. Therefore, one may
needto automaticallymine theirskillsthrough behavioral features.Recently,thisapproach (Soulieretal., 2014),hasbeen
∗ Corresponding author.
proposed,aimingatdynamicallyidentifying,throughsearchfeatures,thepossiblerolesofcollaborators accordingtoarole
taxonomy(Golovchinsky,Qvarfordt,&Pickens,2009).However,thelabeledrolesarepredefinedregardlessoftheusers,and
accordingly, one shortcoming of the proposed approach is its inability to ensure that the identified roles exactly fit with
collaborators’searchskills.
Totacklethisgap,thiscurrentarticlepresentsanewapproach– calledMineRank– thatminesinrealtimetheunlabeled
role that collaborators play in aCIR context.The objective isto leverage thediversity of thecollaborators’searchskills in
ordertoensurethedivisionoflaborpolicy andtooptimize theoverallperformance.Insteadoffollowingpredefinedlabels
orataxonomyofroles, MineRankisanunsupervisedalgorithm that(1) learnsaboutthecomplementarity ofcollaborators’
unlabeledrolesinanunsupervisedmannerusingvarioussearchbehavior-relatedfeaturesforeachindividualinvolved, and
(2) re-injects these unlabeled roles to collaboratively rank documents. The algorithm isused for variousexperiments and
evaluatedusingretrievaleffectivenessatvariouslevels.Theresultsshowthatthemodelisabletoachievesynergiceffectin
CIRbylearningthelatentroleofcollaborators.
The remainderofthearticleisstructuredasfollows.Section2presentstherelatedwork.In Section3, wemotivateour
approachand introduce theproblem definition.Section 4focuses on ourtwo-stepunsupervised CIR modelrelying on the
collaborators’unlabeledroles. The experimentalevaluation and results aredescribedinSection 5.Section 6concludes the
article.
2. Related work
2.1. CollaborativeInformationRetrieval
CIR is definedas asearch process that involves multiple users solving ashared informationneed (Golovchinskyet al.,
2009).Researchhasfoundthatthissettingisparticularlybeneficial inthecaseofcomplexorexploratoryinformationtasks
(Morris &Horvitz, 2007), inwhich a loneindividual wouldsufferfrom insufficientknowledge orskills. Indeed,
collabora-tionin searchcouldimprove inretrieval effectivenessbyprovidingtheopportunity togathercomplementary skillsand/or
knowledgeinordertosolveaninformationneed,aswellasaddressingmutualbenefitsofcollaboratorsthroughthesynergic
effectofthecollaboration(Shah&González-Ibáñez,2011b).
Collaborationbetween usersissupportedbythreemainprinciples: (1)avoidingredundancybetweenusers’actions
(di-visionoflabor),eitheratthedocumentlevel (Foley&Smeaton,2009),ortherolelevel (Pickens,Golovchinsky,Shah,
Qvar-fordt,&Back,2008;Shah,Pickens,&Golovchinsky,2010);(2)favoringtheinformationflowamongusers(sharingof
knowl-edge), either implicitlybysearch inference (Foley& Smeaton,2010) or explicitly bycollaborative-based interfaces(Morris
&Horvitz,2007); and(3) informingusersofothercollaborators’actions(awareness) (Dourish&Bellotti,1992).Supporting
thesethreeprinciplesremainsachallenge inCIR,oftentackledwithadapted interfaces,revisitedIRtechniques, or
collabo-rativedocumentrankingmodels(Joho,Hannah,&Jose,2009).
In thisarticle, wefocusmore specificallyon thethirdaspect when dealing withCIR models.In previouswork, the
fo-cushasbeenon themediationofcollaborators’actionsand complementarityskillsinordertoenhancethesynergic effect
withincollaboration towardsthesatisfactionoftheshared informationneed(Shah & González-Ibáñez,2011b). Henceforth,
divisionoflaborisapivotalissueforcoordinatingcollaboratorsintermsofsearchactionswithrespecttotheir
complemen-taryskills.Assigningroles isonewayoftacklingthischallengebecauserolesgiveastructuretothesearchprocess(Kelly&
Payne,2013).Beyondsimplyconsideringcollaboratorsaspeersand focusingoninferringtheglobalrelevanceofdocuments
towards allcollaborators (Foley &Smeaton,2009),orpersonalizing document scoresfor eachcollaborator(Morris, Teevan,
& Bush,2008),several works(Pickens etal., 2008; Shah etal., 2010;Soulier,Tamine, &Bahsoun, 2013)propose assigning
asymmetricroles tousers inordertooptimize thecollaborative searcheffectiveness.Golovchinskyetal.(2009),suggested
theserolesinaroletaxonomy.
Pickens etal. (2008), proposed a pair of roles, namelyProspector-Miner, that involved splitting a search taskbetween
the collaborators.The Prospector was responsible for formulatinga search requestthat ensured search diversity,whereas
theMiner wasdevotedtoidentifyinghighlyrelevantdocuments.Similarly,Shahetal.(2010),proposedaCIRmodelrelying
ontheGatherer-Surveyor relationship,inwhich theformer’sgoalwastoquickly scansearchresultsand thelatter focused
ondiversity.Inthesemodels,users’rolesensuredatask-baseddivisionoflabor.
Different fromtheseworks, Soulieretal.(2013),ensured thedivisionoflabor amongcollaborators byconsideringtheir
respective domain expertise asthe core source of a collaborator’s role when aiming to solve a multi-facetedinformation
need.Forthispurpose,theauthorsstructuredcollaborators’actionsbyassigningdocumentstothemostlikelysuitedusers,
aswellasallowinguserstosimultaneouslyexploredistinctdocumentsubsets.
Anewrole-basedapproachhasbeenproposedin(Soulieretal.,2014),whichconsidersthatcollaborators’search
behav-iorswere dynamic and that their role mightevolve throughout thesearch session.This statement hasbeen also outlined
in (Tamine & Soulier, 2015).With thisin mind, collaborators’ predefinedand labeled roles, namely Prospector-Miner and
Gatherer/Surveyor,wereidentified inrealtimeassumingatask-baseddivisionoflaborpolicy basedontheirsearch
behav-ioroppositions. Then, documents are rankedaccording tothe CIR modelsassociated withthe minedroles (Pickens et al.,
2.2. Userbehaviormodelsfordocument retrieval
Theuserbehaviormodelingdomain focuseson theunderstandingoftheusermodel withinthesearchsession.Onone
hand, some work(Evans &Chi, 2010;Yue, Han,& He,2014), only focuses on theuser modelingin ahighly abstractlevel
inorder tobuild generativebehavioral models. For instance, Yue etal.(2014), analyzetemporalsequential data of
collab-orativesearch through a hidden Markovmodel. On the other hand, other research attemptsto model userbehaviors and
to re-inject them into a retrieval model in order to enhance the search effectiveness (Agichtein, Brill, Dumais, & Ragno,
2006). In the latter research domain, which iscloser to our contribution, we distinguishthree main lines ofwork based
on feature-baseddocument relevance prediction models(Agichtein etal.,2006; Radinsky etal.,2013), personalization
ap-proachesthroughusers’preferences(Bennettetal.,2012;Teevan,Dumais,&Horvitz,2005),orroleextraction-basedranking
models(Hendersonetal.,2012;McCallum,Wang,&Corrada-Emmanuel,2007).
Thefirstcategoryof researchthatdeals withprediction modelsanalyzesseveral dimensionsof userbehaviors.Inmost
of these works, a simplistic approach is usually followed that consists of integrating click-through data within the
docu-ment scoring(Joachims, 2002), since thissource of evidence expresses users’ search behaviors.In addition, some authors
(Agichtein et al., 2006), suggest a further abstraction level by proposing a robust user behavior model which takes the
collective behaviors for reducing noise within an individual search session into account. Instead of smoothing individual
behaviorswithcollectivesearchlogs, Radinskyetal.(2013),proposed anotherdimensionofanalysisthat refinesindividual
searchlogsthrough atemporalaspectthat predictsqueriesand click frequencieswithinsearchbehaviors.Thisusermodel
reliedontime-seriesanddynamicallyextracted topicaltrendsre-injectedwithintherankingorthequeryauto-suggestion.
Beyondanalyzing searchbehaviors for thepurpose of document ranking, another line ofwork (Heath & White, 2008;
White&Dumais,2009),exploitssearchbehaviorsforpredictingsearchengineswitchingevents.Theseworks’findingsmay
beused toenhancetheretrieval effectivenessand coverageofaninformation need,thusdiscouraging switchingactivities.
Forinstance, dealingwiththepersonalizationapproach,userprofiles mightbeextracted consideringusers’relevance
feed-back(Bennett etal., 2012;Leung, Lee, Ng, & Fung, 2008; Soulier etal., 2013;Teevanet al.,2005).In an individualsearch
setting,Bennettetal.(2012),proposedtocombineshort-termandlong-termsearchbehaviorstomineusers’interests.They
builtamulti-featureprofilebasedonsearchhistory,querycharacteristics,documenttopic,andusers’searchactions.In
con-trast, Leung etal.(2008),modeled users’profiles through aconcept-basedrepresentation inferred from clickthroughdata.
Each profile is thenused to learn users’ preferenceswith an SVMalgorithm that personalizes their searchresults. Search
personalizationisalsoproposedincollaborativesearchsettings(Morrisetal.,2008;Soulieretal.,2013).Forinstance,Morris
etal.(2008),integratedapersonalizedscore(Teevanetal.,2005),within(1)adocumentsmart-splitting overcollaborators’
rankingsforretrievingindividualrankings; and(2)arelevance summationofrelevancefeedbackforbuildingthefinal
doc-umentlistleveragingthecollectiverelevance.
Inthelast category,previousworkhasproposed tomodel and/ormineusers’roles fromtheirsearch behaviors.In this
context, contributions aim at either statistically identifying predefined roles (Golder & Donath, 2004; Kwak, Lee, Park, &
Moon,2010), ormininglatentroles through probabilisticmodels(Hendersonetal.,2012; McCallumetal.,2007).Thefirst
perspectiverelieson socialnetworkinteractionsforidentifyinglabeledorpredefinedroles,suchas“CelebritiesorRanters,”
throughastatisticalanalysis,(Golder&Donath,2004),orforidentifyingthe“NetworkLeaders” usingaPageRank-like
algo-rithm(Pal&Counts,2011),oraclusteringmethod(Kwaketal.,2010).Thesecondperspectiveoffersaformalwaytoidentify
latentrolesthroughtheanalysisofuserinteractions’similaritiesanddissimilarities(Hendersonetal.,2012;McCallumetal.,
2007). For instance,Henderson etal.(2012), focusedon the transformationofafeature-based multidimensionalmatrix to
identifytheusers’behaviormodelwhileMcCallumetal.(2007),revisedtheLDAalgorithmwithinacommunicationsocial
networktominetheevolvingroles ofusersaccordingtomessagecontents.
2.3. Researchobjectives
Fromtheliterature review, one caninfer that the keychallenge inCIR concernsthe difficultyofranking documents in
order to satisfy both individual and mutual goals with respect to the shared information need. Therefore, this challenge
assumesthat users aredifferentand guided bycomplementary skillsorknowledge (Sonnenwald, 1996). Onepossible way
toconsider users’differences mightbeto assign differentroles withrespectto theirskills.However, CIRmodels basedon
predefinedroles (Pickensetal.,2008;Shahetal.,2010;Soulieretal.,2013)raisetwomainconcerns(Soulieretal.,2014):
1. The roleassignmentassumes thatusersbehavethesamewaythroughout thesessionbyassigningroles tousersatthe
beginning ofthesearchsession.
2. A role might not bein accordancewith a user’sintrinsic skills, and moreparticularly might not consider theways in
whichthese skillsaremosteffective.
Onesolutionistoderiveusers’rolesfromthedifferencesandsimilaritiestheydemonstrateintheirinteractionsinorder
toexploittheseroleswithintheranking.Forthispurpose,twomainapproachescanbetraced,which,unlikeworksfocusing
on user behavior models that mainly deal with users’ intrinsic values (Agichtein et al.,2006; Bennett et al.,2012; Leung
etal.,2008),consider usersrelativetotheirpeersinordertohighlighthowtheyarethemosteffective.Thefirst approach
operateson apoolofpredefinedroles,and consistsofadynamicroleassignmentmonitored bysupervisedlearning,which
Fig. 1. Unsupervised latent role learning methodology.
been identified,the associatedstate-of-the-artCIR modelsareused to solvethequery. However,one limitationthat could
beraised fromthisworkisthat thelabeledroles arepredefinedregardlessoftheusers,thusrestrictingthelikelihoodthat
theidentified roles exactlymatchcollaborators’search skills.Therefore, two mainchallengescould beraised: (1) woulda
user beassigned to the role that best aligns withhis/her behavior, even if it isnot in particular accordancewith his/her
skills?;and (2)whatif usersalignwithmultipleroles?
Thesecondapproach,whichweutilizeinthisarticle,dynamicallycharacterizeslatentroleswithanunsupervised
learn-ing method, rather than classifying roles in a taxonomy. More particularly, in contrast to well-know CIR models (Pickens
etal.,2008;Shah etal.,2010)and inaccordancewith thelimitations raised bySoulieretal.(2014), weapproachherethe
problem ofpredefinedroles, and propose todynamically minethe unlabeledroles of collaborators throughout thesearch
sessioninanunsupervised manner,and subsequently adaptthecollaborative documentranking. As showninFig.1,
unla-beledrolesofbothcollaboratorsareminedeachtimeausersubmitsaquery.Then,featuresmodelingtheseunlabeledroles
arere-injectedinthecollaborativedocumentrankingmodelinordertodisplayarankedlist ofdocumentstothisuser.
In orderto ensure ourtwofoldobjective of (1) miningunlabeled roles withrespect tocollaborators’ behaviorsand (2)
collaborativelyrankingdocuments,werelyonafeaturesetestimatedatthedocumentlevel.Therefore,weintuitthat,ifwe
consideradocument’sreceiptofrelevancefeedbackasagoodindicatorofusers’searchbehaviorsandpreferences(Agichtein
etal.,2006),thosefeaturesassignedtotherelevantdocumentset wouldenableusto(1)minelatentrolesofcollaborators
and(2)re-inject theminedlatentroleswithinaCIRmodel.
Therefore,weaimtoaddressthefollowingresearchquestionsinthisarticle:
• RQ1:Howtoinfercollaborators’unlabeledrolesthrough thedifferencesand complementaritiesintheirbehaviors?
• RQ2:Howtoleveragetheseunlabeledrolesforcollaborativelyrankingdocumentswithrespecttothesharedinformation
need?
Weintroducetheconceptof“latentrole,” whichcapturescollaborators’rolesinrealtimeaccordingtothe
complementar-ityoftheirsearchskills,withoutanyassumptionsofpredefinedroleslabeledorbelongingtoaroletaxonomy(Golovchinsky
etal.,2009).Moreparticularly,guidedbythedivisionoflaborpolicy,theusers’latentrolesleveragetheskillsinwhich
col-laborators are different–the most complementary and the most effective for enhancing the retrieval effectiveness of the
search session. Also, we assume that collaborators’ search skills might be inferred within the persistence of their search
behaviors,sincecollaborators mighthavenoisysearchactionsthat couldbeduetothetask,thetopic,theinterfacedesign,
or collaborators’ engagement within the task. To this end,we are aware that this concept requires the searchsession to
besynchronous,requiring theusersto coordinate their actionsand toexhibit their skillsatthe sametime.Moreover, we
assume that the users are constantly engaged in both generating/modifying the information needs aswell as addressing
them,andthus notbeingactive,whichcouldlead tonoisysearchactionsorbehaviorsassociatedwithinactivity.
3. Themodel
We present here our model based on the latent role of collaborators, assuming that users might refine their search
strategiesandbehaviorsthroughoutthesessionwhiletheyinteract withtheircollaboratorsorassesssearchresults.
Our model, calledMineRank, considers search features modeling collaborators’ behaviorsand aims to rank documents
inacollaborative manner ateachquerysubmission byleveraging collaborators’search skillcomplementarities.For
conve-nience, we callan iteration associated to timestamp tl, the time-window beginning at eachtime user u submits queryq
and endingwhile document list Dtl
u isretrieved to useru. Moreparticularly, an iterationofMineRank relies on two main
stepsillustratedin Fig.2:(1) learningacross timethemost discriminantfeatureset,which maximizes thedifferences
be-tweenusers’behaviorsinsearchresultsinordertodynamicallyminethelatentrole ofcollaborators (Section3.2), and(2)
re-injecting latent roles for collaboratively ranking documents. For thispurpose, we aim atpredicting, through alearning
Fig. 2. Minerank methodology for an iteration. Table 1
Search behavior features.
Feature Description
Query features TitleOverlap (TiO) Fraction of shared words between query and page title TextOverlap (TeO) Fraction of shared words between query and page content AnnotationOverlap (AO) Fraction of shared words between query and page annotation SnippetOverlap (SO) Fraction of shared words between query and snippet of the page VisitedPosition (VP) Position of the URL in visited page order for the query
Page features TimeQueryToPage (TQTP) Time between the query submission and the visit of the page
TimeOnPage (TOP) Page dwell time
TimeOnDomain (TOD) Cumulative time for this domain Readability (Read) Document content readability Specificity (Spec) Document content specificity
Rating (Ra) Rate of this page
3.1. Notations
Weconsiderasynchronouscollaborative searchsessionS involvingapairofusersu1 and u2 forsolvingashared
infor-mation needI duringa timeinterval T. Each useru browses separatelyand formulates his/her queries for accessingtheir
respective documentresultsets. As shownin Fig.1, usershave thepossibilityto perform differentactions throughout the
searchsession.Beyond formulatingaquery, theyinteract withthe retrieveddocumentsbyvisitingtheir content,
annotat-ingwebpageswithcomments,book-markingdocuments,orsnipping piecesofinformation.Therefore,users’actionsmight
becharacterized bysearchbehavior-basedfeatures,noted F=
{
f1,. . .,fk,...,fn}
. Thelatter expresses theset ofnfeaturescaptured duringthesearchsession, detailedinTable1.These features,basedon theliterature (Agichtein etal.,2006), fall
undertwocategories:
• Submitted queryfeatures that capture collaborators’searchexperience withrespectto thequery topic. Forinstance, we
integratefeaturesbasedontheoverlapbetweenthequeryandpiecesofadocument(title,content,annotations/snippets
generatedbyauser).
• Selectedpagefeaturesthatcapturecollaborators’browsingbehaviorsinthesearchsessioninordertohighlighttimespent
on webpages/onaspecific domainaswellasthespecificityorreadability ofdocumentsvisited/annotated/snipped,and
bookmarkedbyagiven user.
Wehighlightthatthefeaturesetisslightlydifferentfromtheoneusedin(Soulieretal.,2014),sincetheintuitionofthis
proposedmodel istore-injectthebehavioralfeatureswithinthecollaborative documentranking.Furthermore, thefeature
setcanbeextendedwithnoimpacton themodel.
FollowingSoulieretal.(2014),werepresentatemporalfeature-baseduser’sbehaviormatrixS(tl)
u ∈Rtl×n,wheretlisthe
timestamp.EachelementS(tl)
u
(
tj,fk)
representstheaveragevalueoffeaturefkforuseruaggregatedoverthesetD(
uj)
(tj) ofdocumentsvisited/annotated/snipped/book-markedduringthetimeinterval [0...tj].Assumingthat users’searchbehaviors
mightberefinedthroughoutthesession,thetemporalmodelingenablesthecharacterizationoftheoverallbehaviorofthe
userattime-stamp tl avoidingthebiasinducedbynoisysearchactions.
AccordingtoSoulier etal.(2014),users’searchskilldifference towardaparticularsearchfeaturefk ∈F isreferredtoas
1
(tl) 1,2(
fk)
, where1
( tl) 1,2(
fk)
=S( tl) u1(
fk)
−S (tl)u2
(
fk)
.Inaddition,to ensurethat bothbehaviorsarefounddifferentinbothusers,andinordertoidentifythosesearchbehaviorsthateachuserisbetteratthantheother,wewouldliketoknowinwhichof
thosesearchbehaviorseachuser isbetter than theother andhighlighttheactsinwhichhe/sheisthemosteffective with
respecttohis/herintrinsicskillsaswellashis/hercollaborator’sskills.Withthisinmind,correlationsbetweencollaborators’
searchfeaturedifferences were estimated,pairbypair, byadding theconstraintthat thedifference between usersfor the
implied featuresissignificant, usingtheKolmogorov-Smirnov test(p-value p
(1
(tl)1,2
(
fk))
). Therefore,complementaritiesandsimilarities between collaborators u1 and u2 with respectto their search behaviorsare emphasized through acorrelation
matrixC(tl)
1,2 ∈Rp×pinwhicheachelementC
(tl)
1,2
(
fk,fk′)
isestimatedas(Soulieretal.,2014):C(tl) 1,2
(
fk,fk′)
=½
ρ
(1
(tl) 1,2(
fk)
,1
(1t,l2)(
fk′))
if p(1
1(t,l2)(
fk))
<θ
and p(1
1(t,l2)(
fk′))
<θ
0 otherwise (1)Fig. 3. Unsupervised latent role learning methodology.
As the goal is to focus on search behavior complementarities between unlabeled roles of collaborators u1 and u2, we
assumethat two featuresfk and fk′ behavesimilarlyif thecorrelation
ρ(1
(1t,l2)(
fk)
,1
(tl)
1,2
(
fk′))
oftheirdifference iscloseto1.Thecloserto−1thecorrelationis,themorecomplementaryarecollaborators’skillstowardsfeaturesfk and fk′.Focusing
on users’difference
1
(tl)1,2
(
fk)
towards searchfeaturefk isnot enough,since itdoesnotensurethat collaborators’roles arecomplementarywithrespecttotwosearchfeatures:oneusercouldbebetterforbothfeatures(Soulieretal.,2014),and,in
thiscase,thereisno needtoleveragetheothercollaboratorasadivisionoflaboractor.
With thisinmind,weintroducetheconceptoflatentrolebasedon thefollowinghypothesis:
• H1: Alatentrole models themostsignificantsimilarities and complementaritiesbetween collaboratorswith respectto
their search behaviorsor skillsthroughout the searchsession inorder to identify skillsin which collaborators are the
mosteffective.
• H2: Complementaritiesand similaritiesare respectivelyexpressedbynegative andpositive correlationsbetweensearch
behaviorfeatures.
Therefore, ateach time-stamp, thelatent role LR(tl)
1,2 highlights, for a pair of collaborators u1 and u2, their search skill
differences and complementarities during the time period [0..tl]; where search skills of a user are inferred from his/her
temporal feature-based behavior matrix S(tl)
u to highlight the persistence of unlabeled roles with respect to thetask, the
topic,theinterfacedesign orusers’engagementwithinthetask.Accordingly,thelatentroleLR(tl)
1,2 involves:
• AkernelK(1tl)
,2 ofasubsetF
(tl)
k1,2⊂F ofpbehavioralfeaturesF,wherep isautomaticallydefinedbythelatentrole mining
algorithm(see Section3.2).Inotherwords, pexpressesthenumberofthemostsignificantfeaturesusedtocharacterize
thelatentroleaccordingtohypothesisH1andH2.
• AcorrelationmatrixC1(tl)
,2 ∈Rp×p thatemphasizes complementaritiesandsimilarities betweenunlabeledrolesof
collab-oratorsu1 andu2withrespecttotheirsearchbehaviors.
3.2. Learningusers’latentrolesincollaborativesearch
The underlying issue ofthe latent role miningconsists ofidentifying themost discriminant featuresfor characterizing
collaborators’searchbehaviorswhichmaximize,forapairofcollaborators,theircomplementarity.Thisleadsustopropose
acollaboration-orientedlatent rolemining approachbasedon afeatureselection. Theintuitionbehind ourcontributionis
illustratedinFig.3.Thefeatureselectionoperatesontheanalysisofusers’searchbehaviors.Onceusershavebeenidentified
asbehavingdifferentlytowardssearchfeatures,theircomplementarybehaviorsaremodeledthroughaweightednetworkin
ordertoidentifythemostimportantand discriminantfeaturesforcharacterizing collaborators’latentrolesover thesearch
session.
Inwhatfollows,wefirstexpresstheoptimizationproblemframeworkandtheunderlyingassumptions.Then,wepropose
amethodtosolvetheproblem.
3.2.1. Latentroledesign
Inspired from workproposed byGeng, Liu, Qin, andLi (2007),and adapted to ourcollaborative latentrole mining, the
featureselection consistsof buildingthelatent role kernel K(tl)
1,2 byidentifying thesmallest subsetF
(tl)
k1,2 of p features(p is
undefined)accordingto threeassumptions:
• A1:theimportanceRec(1t,l2)
(
fk)
offeatures fk∈Fk(1tl),2isdependentontheirabilitiestoprovidegoodindicatorsofthe
doc-umentassignmentto userswithinthecollaborative documentranking. Weassumethat aCIRmodelmightsupport the
divisionoflabor,andthatadocumentmightbeassignedtothemostlikelysuitedcollaborator.AsproposedbyShahetal.
(2010), we formalize thisprinciple through adocument classification relying on relevance feedback collected
collaborators. With thisgoal,we propose to cluster,using a2-means classification, theset D(tl) of selecteddocuments
(throughannotations/snippets/bookmarks)bybothcollaboratorsuntil time-stamptl,accordingtothevalueoffeaturefk.
Theclusterwiththehighestcentroidisassignedtothecollaboratoruj withthehighestvalueS(
tl)
u1
(
fk)
whereastheothercluster is assigned to the other collaborator. We measure the quality of the classification based on feature fk towards
eachcollaboratoru1 andu2usingtherecallmeasureRec(tl)
1,2
(
fk)
: Rec(tl) 1,2(
fk)
= TP(tl) fk TP(tl) fk +FN (tl) fk (2) where TP(tl)fk isthenumber ofdocumentsassigned tothecluster associatedto theuserwhoselected those documents
using the2-means classificationbased onfeature fk.For instance, if u1 selected document d1 before time-stamp tl, we
consider a “TruePositive” action if theclassificationalgorithm attributesdocument d1 to user u1.Accordingly, TPf(tl)
k is
incrementedby 1. Inversely,FN(tl)
fk expressesthe numberofdocuments notassigned to theclusterassociated with the
userwhoselectedthosedocuments.Forinstance,ifdocumentd1isattributedtoclusterofuseru2,FN(ftl)
k isincremented
of1.
• A2: the redundancy between features might be avoided in order to consider only the most discriminant ones for
characterizing latent roles through complementary search behaviors among users, modeled using feature correlations
C(tl)
1,2
(
fk,fk′)
.Weinvestigateherehowtoidentifythemostdiscriminantfeaturesforcharacterizingusers’roles,andmoreparticularlyfeatureshighlighting complementary searchbehaviorsamong users.The mainassumptionisto identifyfor
which skillscollaborators arethemost suited with respectto theirco-collaboratorsfor solving theshared information
need. Weused the correlationC(tl)
1,2
(
fk,fk′)
betweencollaborators’ differences1
(1t,l2)(
fk)
and1
(tl)
1,2
(
fk′)
towards apair ofsearchbehaviorfeaturesfk and fk′.
• A3:thefeatureselectionmustmaximizetheimportanceRec1(t,l2)
(
fk)
oftheselected features fk∈F(tl)k1,2 withinthe
collab-orative document rankingand minimize theredundancyC(tl)
1,2
(
fk,fk′)
between thepairwise selectedfeatures. Thus,weformalizethefeatureselectionalgorithmasthefollowingoptimizationproblem:
maxα n X k=1 Rec(tl) 1,2
(
fk)
·α
k minα n X k=1 n X k′=1 C(tl) 1,2(
fk,fk′)
·α
k·α
k′ sub ject toα
k={
0,1}
; k=1,...,n (3) and n X k=1α
k=pwhere
α
isthevectorofsizenwhereeachelementα
k isaBoolean indicatorspecifyingwhether featurefk isincludedinthefeaturesubsetF(tl)
k1,2 attime-stamptl.
This optimization problem with multi-objectives might betransformed asa uniqueobjective optimization problem by
linearlycombiningbothoptimization functions.
maxα n X k=1 Rec(tl) 1,2
(
fk)
·α
k−γ
Ã
n X k=1 n X k′=1 C(tl) 1,2(
fk,fk′)
·α
k·α
k′!
sub ject toα
k={
0,1}
; k=1,...,n (4) and n X k=1α
k=pwhere
γ
isa decay parameter expressing the level of behavior complementarity takeninto account in the latent rolemining algorithm. This parameter is fixed over the session since we hypothesize that the ratio between the feature
importance andcomplementaritydoesnotdependonthecollaborators’currentlatentroles attime-stamptl.
3.2.2. Latentroleoptimization
Ouroptimizationproblem definedinEq. (4)might beresolvedbyundertakingallthepossible featurecombinations of
sizep,where p=2,...,n.Althoughoptimal,thismethodistime-consumingwithacomplexityofuptoO
(
Pnp=1C p n
)
.We propose, here, a graph-based resolution algorithm attempting to identify the best feature subset which may
Table 2 Notations.
Notation Description
C The feature graph representing the growing clique
P The evolving candidate graph
K The maximum clique satisfying the optimization problem
Nbhd ( C ) The function that returns in a decreasing order the neighboring features of all features belonging to C with non null weight Nds ( K ) The function that returns all the features belonging to K
The main objective is to extractthe smallest featurenode set which enhances the importance ofthe set of retained
fea-tureswhilemaximizingthedifferencesofcollaboratorswithintheirsearchbehaviors.Inthiscontext,werepresent features
throughacollaboration-basedgraphG(tl)
1,2 modelingsearchbehaviorsofcollaboratorsu1and u2attime-stamptl.The graph
G(tl)
1,2=
(
A(tl)
1,2,C
(tl)
1,2
)
, illustratedin Fig.A.10, involvesnodes A(tl)
1,2 which represent eachfeaturefk ∈F,weighted byan
impor-tancemeasureRec(tl)
1,2
(
fk)
withinthecollaborativedocumentranking,andundirectedweightededgesC(tl)
1,2 :RF×F which
rep-resentcollaborators’search behaviorsimilarities orcomplementarities byconsidering thecorrelationC(tl)
1,2
(
fk,fk′)
betweendifferencesofpairwisefeaturesfk and fk′.
The usednotationsaredetailedinTable2.Inwhatfollows,wedescribethealgorithm,calledColl-Clique, forsolvingthe
optimizationproblem(Algorithms1 and2).NotethatanillustrationofouralgorithmispresentedinAppendixA.
Algorithm1:Main Data:G(tl) 1,2=
(
A (tl) 1,2,C (tl) 1,2)
,γ
Result:F(tl) sel begin C={}
; K={}
; P=G(tl) 1,2; K=Coll−Clique(
C,P,γ
,K)
; F(tl) sel =Nds(
K)
; ReturnF(tl) sel ;In orderto solvethe optimization problem,weextended themaximum clique algorithm (Carraghan&Pardalos, 1990),
inordertofitwithourfeatureselectionprobleminacollaborative context.Ourintuitionisthataweightedgraphis
com-plete since it models search behavior complementarities through correlations between pairwise features. The Coll-Clique
algorithm,ratherthanfocusingonanodelevelforidentifyingthebiggestcompletesubgraph(Carraghan&Pardalos,1990),
alsocalledthemaximumclique,aimsatextractingthesubgraphwhichmaximizesthenodeweights,namelythesearch
fea-tureimportances(assumptionA1),andminimizestherelationshipweightsbetweennodes,namelypairwisesearchbehavior
correlationsforbothcollaborators (assumptionA2).
As showninAlgorithm1,Coll-Cliquerelieson twofeaturegraphs:
• ThegrowingcliqueC,candidatetobethemaximum cliqueK.
• The feature graph P, which includes candidatefeatures to be added to the growingclique C. Nodes in P are obtained
throughthefunctionNds(P).
Initially, C isempty and P is the graphincluding allthe features. The algorithm, asshown in Algorithm 2, recursively
incrementsthegrowing cliqueCusing featuresfh involvedwithin graphP builtupon thefunctionNbhd(C),which creates
anew candidatefeaturegraph P that onlyretrieves inadecreasing orderfeaturescharacterized byapositive depreciated
weight.ThisoperationisnotedCfh.Ateachrecursion,theweightRec(tl)
1,2
(
fk′)
oftheother remainingfeatures fk′ isdepre-ciatedbythecorrelationC(tl)
1,2
(
fj,fk′)
withrespecttothelastselectedfeaturefj.Let usdenote W(K)asthesumoffeatureweightswithinthemaximumclique K.Weassumethatthissumreferstothe
indicatorwe wouldlike tomaximize (Eq. (4)) sincethefeature weight(importanceRec(tl)
1,2
(
fk′)
)isrecursively depreciatedwithrespect to theadjacent edgeweight(correlationC(tl)
1,2
(
fj,fk′)
). If theweightW(
C)
+W(
P)
of featureswithinC and Pislower than the weightW(K)within K identifieduntil thecurrent iteration,there isno wayto build aclique from Cby
adding featuresfrom P witha higher weightthan the weightoffeatures inK. Finally, the set ofselected featuresF(tl)
k1,2 of
sizepinferredfromthepnodeswithinthemaximum cliqueKbuildsthelatentrolekernelK(tl)
Algorithm2:Coll-Clique Data:C,P,
γ
,K Result:K begin forallthe fj∈P do W(
C)
=P fk∈Nds(C)Rec (tl) 1,2(
fk)
W(
P)
=P fk∈Nds(P)Rec (tl) 1,2(
fk)
W(
K)
=P fk∈Nds(K)Rec (tl) 1,2(
fk)
if(
W(
C)
+W(
P)
≤W(
K))
then/* Return the maximum clique */
ReturnK
/* Increment the growing clique C */
C=Cfj
/* Depreciate node weights */
forallthe fk′∈Pdo
Rec(tl)
1,2
(
fk′)
=Rec(1t,l2)(
fk′)
−C1(t,l2)(
fj,fk′)
∗2γ
/* Build the candidatenode set */
P′=Nbhd
(
C)
if(P′=
{}
andW(
C)
>W(
K)
)then/* Save the local optimum */
K=C
ifP′6=
{}
then/* Launch new recursion */
Coll-Clique(C,P’,
γ
,K)/* Remove node for new recursion */
C=C
\
fj P=P\
fjWehighlight that our algorithm strictly follows theframework of themaximum clique extraction algorithm proposed
byCarraghanand Pardalos(1990),which istheinitialversion ofthebranch-and-bound algorithmcategory,well-knownto
ensuretheguaranteeoftheoptimalsolution(Wu&Hao,2015).Weaddoneheuristicinordertoconsideraweightedgraph
aimingatsolvinganoptimizationproblem,initiallyproposedbyGengetal.(2007),and adaptedtoourproblem.Therefore,
giventhatthecandidatecliqueCinincrementedbypositively-weightednodesinadecreasingorder,onecouldassumethat
theequation
(
W(
C)
+W(
P)
≤W(
K))
aimsatmaximizingtheweight ofthemaximum cliqueK whereitsweightcould beestimatedasfollows: W
(
K)
= |K| X k=1 Rec(tl) 1,2(
fk)
−γ
Ã
|K| X k=1 |K| X k′6=k;k′=1 C(tl) 1,2(
fk,fk′)
!
(5)Thefirst partoftheequationrefers totheinitialweightofnodesfk intheinitialgraphG whilethesecondpartexpresses
thedepreciationofthenodeweightwithrespecttothenodesbelongingtothemaximumclique.Therefore,maximizingthe
weightofthemaximumcliqueKisequivalenttosolvingtheoptimizationproblempresentedinEq.(4).
3.3. Latentrole-basedcollaborativedocumentranking
In this section, we re-inject the latent role kernel identified in the previous section in order to collaboratively rank
documents to users. The idea is to use the most discriminant features–characterizing search behavior complementarities
of both collaborators–to first assign documents to the most likely suited collaborator, in order to ensure the division of
labor, and then rank thedocuments assigned to eachuser. For this purpose,we usea classifier learningalgorithm which
operateson thedocumentrepresentationrestrictedtofeaturesimplied withinthelatentroleM(tl)
1,2∈Rm×m attimestamptl.
We highlightthat only thedocument list Dtl
u associated withthe classof theuseru whosubmittedquery qis displayed.
Indeed, asexplained in Fig.1,we only attempt tosatisfy theinformation needof theuser whosubmittedthe query. We
assume that the other collaborator u′ might not be interested in query q and is already examining a document list Dtl′
u′
retrievedwithrespecttoapreviouslysubmittedqueryattimestampt′
l<tl.WechoosetousetheLogisticRegressionasthe
Fig. 4. Overview of the collaborative document ranking using latent role of collaborators.
• Stage 1. The learning step considers the set D(tl) of snipped/book-marked/annotated documents by either collaborator
u1 oru2 before time-stamp tl.Documents selected byboth collaborators areremoved fromthisset,since theyarenot
discriminant for the collaborative-based document allocation to collaborators.Each document di∈D(tl) is modeled by
a feature vectorx(tl)
i ∈Rm, estimatedaccording to value of the feature fk∈K
(tl)
1,2 for document di with respectto
col-laborators’ actionsand time-stamp th ofitsassessment, withth≤ tl.Documentdi alsoreceives aclassificationvariable
c(tl)
i ∈
{
0;1}
where values 0 and 1 express the class of collaborators, respectively u1 and u2, who have selected thisdocument.
The objective ofthe document ranking learningfunction isto identify thepredictor weightvector
β
(tl)j ∈Rm in order
to estimate the probability ofallocating documents to collaboratoruj ∈ {u1, u2}. The logistic regression aimsat
maxi-mizing thelikelihoodldetailedinEq.(6)which relieson thelogitfunctionformalizedinEq.(7).Thelatter modelsthe
probabilityPj
(
x(tl)i
)
fordocumentdi belongingtouserclassci∈{0; 1}withrespecttofeaturevectorx(tl) i . max β(tl) j ,β (tl) j X di∈D(tl)
(
ci·ln(
Pj(
x(itl))))
+(
1−ci)
ln(
1−Pj(
xi(tl)))
(6) wherePj(
xi(tl))
= exp(
x(tl) i ·β
(tl) j)
1+exp(
X(tl) j ·β
(tl) j)
(7)• Stage2.The testingstepconsiders thesetDnsel(tl) ofdocumentsnotselected bybothcollaboratorsu1 andu2 before
time-stamp tl. The feature vector x(tl)
i is estimated according to feature average values of document di with respectto the
search logscollected before time-stamp tl, notnecessarilybybothcollaborators. Indeed, thereisno availablevalue for
action-based features, suchas AnnotationOverlap, for thepair of collaborators considering that thedocument hasnot
beencollectedbythepairofusers.Thefittedmodellearntthroughthelogisticregressionalgorithmestimatesthe
prob-ability Pj
(
x(tl)i
)
ofassigning document di′∈Dnsel(tl) to thecollaborator classci with respectto the predictor weightβ
(tl)
j .
Documentdi′ isallocatedtothecollaboratorclasscj withthehighestprobabilityPj
(
x(tl)
i
)
;∀
j∈{
0,1}
,whichisalsousedforrankingdocumentswithinthecollaboratorclass.
Moreover, we add a supplementary layerof division of labor by ensuring that result lists Dtl
u′ and D t′
l
u′ simultaneously
displayed(even ifretrievedatdifferenttime-stampstlandt′
l)tocollaboratorsuandu′includedistinctdocuments.
4. Experimental evaluation
Weperformedanexperimentalevaluationinvestigatingtheimpactofmininglatentrolesofcollaboratorsontheretrieval
effectivenessofacollaborativedocumentrankingmodel.Thefollowingarethehypothesesthat guideourinvestigation:
1. ACIRmodelshouldfitwithcollaborators’complementary skillsinwhichtheyarethemosteffectivewithrespecttothe
collaborative setting,takingintoaccounttheirwholebehaviorsregardlessoftheirskillsorpredefinedroles.
2. ACIRmodelshouldachieve agreatereffectivenessthanusersworkingseparately.
3. ACIR modelshoulddynamicallyassign collaborators’roles inanunsupervisedmanner inthesearchsessioninsteadof
assigning rolesregardlessoftheirskills.
Inwhatfollows,wedescribetheexperimentalprotocolandpresenttheobtainedresults.
4.1. Protocoldesign
4.1.1. Userstudies
Since no well-established benchmark exists in the CIR domain, we used search logs collected from two different
Fig. 5. Coagmento system.
Table 3
Collaborative tasks in user studies US1 and US2. Tasks Guidelines
US1 The mayor of your countryside village must choose between building a huge industrial complex or developing a nature reserve for animal conservation. As forest preservationists, you must raise awareness about the possibility of wildlife extinction surrounding such an industrial complex. Yet, before warning all citizens, including the mayor, you must do extensive research and collect all the facts about the matter. Your objective is to collaboratively create a claim report, outlining all the possible outcomes for wildlife should the industrial complex be built. Your focus is on wildlife extinction. You must investigate the animal species involved, the efforts done by other countries and the association worldwide to protect them and the reasons we, as humans, must protect our environment in order to survive. You must identify all relevant documents, facts, and pieces of information by using bookmarks, annotations, or saving snippets. If one document discusses several pieces of useful information, you must save each piece separately using snippets. Please assume that this research task is preliminary to your writing, enabling you to provide all relevant information to support your claims in your report . US2 A leading newspaper agency has hired your team to create a comprehensive report on the causes, effects, and consequences of the climate
change taking place due to global warming. As a part of your contract, you are required to collect all the relevant information from any available online sources that you can find. To prepare this report, search and visit any website that you want and look for specific aspects as given in the guideline below. As you find useful information, highlight and save relevant snippets. Later, you can use these snippets to compile your report, no longer than 200 lines. Your report on this topic should address the following: Description of global warming, scientific evidence of global warming affecting climate change, causes of global warming, consequences of global warming causing climate change, and measures that different countries around the globe have taken over the years to address this issue, including recent advancements. Also describe different viewpoints people have about global warming (specify at least three different viewpoints you find) and relate those to the the aspects of controversies surrounding this topic.
basedon user mediation.The system allowsusersto browsetheweb and submit queries onindependent searchengines,
mainlyGoogle.Thesystemincludesatool-barandaside-bar,providingafunctionalityinwhichuserscaninteractwiththeir
peersthroughaninstantmessagingsystemaswellasbook-marking,annotatingandsnippingwebpages.Moreover,the
side-barensuresawarenessbyshowingauserwhathe/she,aswellashis/hercollaborator,havebook-marked/snipped/annotated
duringthesession.Inaddition,thesystemtrackscollaborators’activitiesandrecordstheirsearchlogs–suchasvisitedpages,
submittedqueries,and relevance feedback–alloverthesession.Anoverviewoftheused system isillustratedinFig.5. We
outlinethat thissystemensures theawarenessparadigmsincetheside-bar allowscollaborators tobeaware ofother
rele-vancefeedback.
Theuserstudies,US1andUS2,involvedrespectively25non-nativeand10nativeEnglishuserpairs(atotalof70people)
who were recruited from university campuses and received compensation for their involvement within the experiments
($20 per person, with an additional $50 for thethree best performing groups). Accordingly, these participants performed
thetaskofanexploratorysearchproblemwithin30minutesinaco-locatedsetupintheirmothertongue.Duringthetask,
thecollaborators interacted witheachother inorderto identifyasmanyrelevantdocuments aspossible. Theirinteractive
behaviorsgenerally involved discussions about theirsearchstrategies orlinkedsharingexchange, always throughthechat
system. For user study US2, participants also had to write areport usingweb pages saved throughout thesearch session,
whichallowedthemless timetobrowseon theweb.Tasktopicsare, respectively,“tropicalstorms” and “global warming”.
Table 4
Statistics of user studies US1 and US2.
US1 US2
Topic Tropical storm Global warming
Number of dyads 25 10
Total number of visited pages 4734 1935
Total number of book-marked/rated pages 333 –
Total number of snipped pages 306 208
Total number of submitted queries 1174 313
Average number of terms by query 3 .65 4 .73
Statistics ofbothuser studiesare shownin Table4.Ananalysis ofparticipants’ submittedqueries showsthat they are
mainlyareformulationoftopics,aseachtopicalwordoftenoccursinthequeries.Indeed,amongthe1174submittedqueries
forUS1,theterms“tropical” and“storm” areused1077and1023times,respectively,whereastheterms“global” and
“warm-ing” areused254and247timesoverthe313submittedqueriesinUS2.Duringthetask,participantsexamined,respectively,
91and73web-pages forUS1and US2. Wehighlightthat thenumberofsubmittedqueriesand visited pagesby
collabora-tivegroupsarehigherforUS1;thiscouldbeexplainedbytheadditionalobjectiveofparticipantsinUS2,whichconsistedof
writingareport.Thelatter leftparticipantslesstimeforbrowsingtheweb.
4.1.2. Data
Given thedistinct languagesofboth userstudies,we buildtwo separate documentindexes. Respectively,for eachuser
studyUS1andUS2,weaggregatedtherespectivewebpagesseenbythewholeparticipantsetaswellasthetop100search
engine result pages (SERPs) from Google for the submitted query set. We highlight that the SERPs were extracted later
in orderto avoid processing overload. Each web page wasprocessed for extracting< title > and < p > tags.In orderto
increasethe size ofdocument indexes, wecarried outthis protocolfor other proprietary user studies performedin other
collaborativesettings,notconsideredforourexperiments(Shah&González-Ibáñez,2011a).Intheend,theindexesincluded
24,226and74,844documentsforUS1and US2,respectively.
4.1.3. Evaluationprotocol
In orderto avoidabiasthat couldbeinvolvedinare-ranking approachthatreliesonfeaturesmeasuringthesimilarity
of the document with respect to the query, we highlight that the collaborative ranking step is carried out on the whole
documentcollection.Withthisinmind,weconsidertwoversionsofourmodel:
• MineRank(q):ourproposedunsupervised rankingmodel inwhichthelatent rolemining andthecollaborative ranking
stepsaresuccessivelylaunchedateachquerysubmission.
• MineRank(t):ourproposedunsupervisedmodelinwhichthecollaborators’latentrolemining(Section3.2)isperformed
atregulartime-stampstsimilartoSoulieretal.(2014),whilethecollaborativerankingstep (Section3.3)islaunchedat
eachquerysubmissionbyconsideringthelatentroleminedatthelast time-stampt.
Wehighlightthat,similartoSoulieretal.(2014),theBM25modelislaunchedwhennosearchskilldifferencesbetween
collaboratorsaredetected.
For effectivenesscomparison,weranthefollowingbaselinesateachquerysubmission:
• BM25:the BM25 rankingmodel which refers to anindividualsetting. This setting simulates asearchsession inwhich
collaborators performtheirsearchtaskonindependentsearchengines.
TheBM25 rankingmodel(Robertson&Walker,1994),estimatesthesimilarityscorebetweendocumentdi andqueryqh
as: RSV
(
di,qh)
= X tv∈qh N−nv+0.5 nk+0.5 fiv·(
k1+1)
fiv+k1·(
1−b+b· |di| avgdl)
(8)where N expresses the collection size, nv the number of documents including term tv. The frequency of term tv in
documentdiisnotedfiv.|di|representsthelengthofdocumentdiwhereastheaveragedocumentlengthisnotedavgdl.
k1 andbaremodelparameters.
• Logit:theCIRmodelwhichonlyinvolvesthelaststepofouralgorithm(Section3.3)consideringthewholesetoffeatures
inordertomeasuretheeffectivenessofapersonalizedsearchwithoutany considerationofthelatentroles.
• PM: the CIR model refers to a system-based mediation guided by predefined fixed roles of Prospector-Miner (Pickens
etal.,2008).
Accordingto collaborators’roles,thismodelrelieson twodifferentrankingfunctions:
1. Thequerytermsuggestion functionaimsatfavoringtheProspector’ssearchdiversity.For eachtermtk belongingto
documentspreviouslyretrievedinlistsL,itsscoreisestimatedas:
score
(
tk)
=X
Lh∈L
wr
(
Lh)
wf(
Lh)
rl f(
tk,Lh)
(9)2. The document ranking function ensures the relevance of documents not examined by the Prospector towards the
topic.Thescoreofdocumentdi isestimatedasfollows:
score
(
di)
=X
Lh∈L
wr
(
Lh)
wf(
Lh)
borda(
di,Lh)
(10)whereborda
(
di,Lh)
isavotingfunction.Thesetwofunctions arebasedon relevanceandfreshnessfactorsestimatedasfollows:
– The relevance factorwr(Lh)which estimatesthe ratio ofrelevant documentsin list Lh retrieved forquery qh, noted
|rel∈ Lh|withthenumberofnon-relevantdocumentsinthesamelist, noted
|
nonrel∈Lh|
:wr
(
Lh)
=|
rel∈Lh|
|
nonrel∈Lh|
(11)– The freshnessfactorwf(Lh)which estimates theratio ofdocuments notvisited in list Lh, noted|nonvisit ∈ Lh|, with
thenumberofvisiteddocumentsinLh,noted
|
visit∈Lh|
:wf
(
Lh)
=|
nonvisit∈Lh|
|
visit∈Lh|
(12)• GS:theCIRmodelreferstoasystem-basedmediationguidedbypredefinedfixedrolesofGatherer-Surveyor(Shahetal.,
2010).
Thismodelislaunchedafterquerysubmissions(qhandqh′)ofeachcollaborator(respectively,ujanduj′)andconsistsof
two steps:
1. TheMergingstepinwhichdocumentlistsaremergedusingtheCombSUMfunction.
2. The Splitting step in which documents in the merging list are classified usinga 2-means algorithm. Each cluster is
assignedto acollaboratorusingthefollowing criteria:theclusterwith thehighest gravitycenterisassigned to the
GathererwhereastheremainingclusterisassignedtotheSurveyor.
• RoleMining:theuser-drivensystem-mediatedCIRmodel whichminespredefinedrolesof collaboratorsin realtimeand
ranksdocumentsaccordingtotheassociatedstateoftheartCIRmodels(Soulieretal.,2014).Incontrasttoourproposed
approach, thissetting considers roles (namely Gatherer/Surveyorand Prospector/Miner)predefined ina role taxonomy
Golovchinskyetal.(2009),thatmightnotexactlyfitwithusers’skills.
This model exploitsthe correlation matrix denoting collaborators’ behaviorsin order to assign userspredefined roles,
modeled througharole pattern. Inparticular, accordingtoarole pattern pool,therole-basedidentificationassigns the
rolepatterncorrelationmatrixFR1,2 whichisthemostsimilartothecollaborators’correlationmatrixC(tl)
u1,u2,obtainedfor
thepairofusers(u1,u2)atgiventime-stamptl.
argminR1 ,2
||
F R1,2C(tl) u1,u2||
sub ject to:∀
(f j,fk)∈KR1,2F R1,2(
f j,fk)
−Cu(1tl,)u2(
fj,fk))
>−1 (13)where||.|| representstheFrobeniusnormandistheminusoperatordefinedas:
FR1,2
(
f j,fk)
Cu(t1l,)u2(
fj,fk)
=(
FR1,2(
f j,fk)
Cu(t1l,)u2(
fj,fk)
ifFR1,2(
f j,fk)
∈{
−1;1}
0 otherwise4.1.4. Groundtruthandmetrics
Webuiltthegroundtruthusingclick-throughdatafollowingtheassumptionthat implicitrelevancederived fromclicks
is reasonably accurate (Joachims, Granka, Pan, Hembrooke, & Gay, 2005).Specifically, theground truth only relieson the
clickeddocumentsandincludes anagreementlevel, assuggestedin(Shah &González-Ibáñez,2011b).However,incontrast
toSoulieretal.(2014),whoconsideranagreementlevelinvolvingtwousers,wereinforcetheagreementlevelconditionby
accounting for thebias ofintra-group collaboration interactionsviathe notion that participants might belongto different
groups. Indeed, collaboratorsare likelytointeract through thechatsystem inordersharedocument links,assuggestedin
Section4.1.1.Thisresultsinasmallrelevantdocumentset,namely38and20 foruserstudiesUS1andUS2,respectively.
Using(Shah&González-Ibáñez,2011b),weusedwell-knowncollaborative-basedmetricsproposedtoevaluatethesearch
outcomesofcollaborativesearch.Thesemetrics areprecision-and recall-oriented,and areestimatedatthegrouplevel. As
such,theyconsider documentsselected bycollaborators throughoutthesearchsessionin whichsubmittedqueries
consti-tute a collective action, rather than independent actions. In order to evaluate the retrieval effectiveness of our proposed
model,themetricsare appliedtoadocumentset that aggregatesrankingsretrievedthroughoutthesession.Thesemetrics
areestimated atthe group level (as done byShah &González-Ibáñez (2011b)), across allqueries submitted byall
collab-orators. In particular, we (a) considered rankings with respect to their top 20 ranked documents as usually done in the
Fig. 6. Parameter tuning methodology.
retrievedwith respecttoqueries submittedthroughout the searchsessionbyallcollaborators,and then(c)estimated the
collaborativemetricsofthismergeddocument set,namelyatthegrouplevel.
In orderto estimatethecollaborative-based metrics, weadapted theuniverse, relevant universe, coverageand relevant
coveragesetsdefinedin(Shah& González-Ibáñez,2011b):
• Theuniverse Uofweb pagesrepresentsthedocumentdataset.
• Therelevantuniverse Urrefers tothegroundtruth,withUr⊂U.
• ThecoverageCov(g)ofacollaborativegroupgexpressesthetotalnumberofdistinctdocumentsretrievedforallqueries
submittedbycollaborators ofgroupgthroughout thesearchsession.
• The relevant coverage RelCov(g) of a collaborative group g refers to the total number of distinct relevant documents
retrievedforallqueriessubmittedbycollaborators ofgroupgthroughoutthesearchsession.
With thisinmind,weusedthefollowingcollaborativemetricstomeasurethesynergiceffectofacollaborativegroupg:
• TheprecisionPrec
(
g)
estimatedforcollaborative groupg:Prec
(
g)
= RelCov(
g)
Cov
(
g)
(14)• TherecallRecall
(
g)
estimatedforcollaborativegroup g:Recall
(
g)
=RelCov(
g)
Ur
(15)
• TheF-measureF(g)estimatedforcollaborative groupgwhichcombinesbothprecisionand recallmetrics:
F
(
g)
= 2∗Prec(
g)
∗Recall(
g)
Prec
(
g)
+Recall(
g)
(16)Finally,thesemeasuresareaveragedoverthecollaborativegroupsofeachuserstudy,namelythe25collaborativegroups
(US1)onone handandthe10(US2)on theother hand.
Note that thecomputation of these collaborative-oriented metrics is as similar as possible to theprecision and recall
measuresinclassicalinformationretrieval.However,whileclassical IRconsidersarankinganevidence source,werelyhere
on a set built by a merging of the top 20 documents retrieved for all queries submitted over the search session by all
collaborators.
4.2. Results
Thissection reportstheobtainedresultswithrespecttoseveralscopes.First, wepresenttheparametertuningstepand
then,weanalyzetheretrievaleffectiveness.
4.2.1. Parametertuning
Inordertohighlighttheconsistencyofourmodelregardlessofspecificusers,tasks,andtopics,weperformeda
learning-testingapproachintwosteps,illustratedinFig.6:(1)thelearningstep,whichoptimizesthemodelparameter(s)usingone
of our datasets, e.g. US1, and (2) the testing step, which estimates the retrieval effectiveness of our model on the other
Fig. 7. Parameter tuning of MineRank(q).
(a) M ineRank(t) model for US1 (b) M ineRank(t) model for US2
Fig. 8. Parameter tuning of MineRank(t).
Notethatourmodelusesafeaturesettodynamicallyminecollaborators’latentrolesbyleveragingfeatures’importance
withinthecollaborative rankingmodelandcollaborators’complementarities.Withthisinmind,weexpressedthe
assump-tions that the decay parameter
γ
combining these two aspects in the optimization problem (see Eq. (4)) might be fixedover thesession.Accordingly,thetuningphase mainlyconcernsthe
γ
parameter,whichexpresses thecollaborators’com-plementarity.Both versions ofourproposed model, namelyMineRank(q) and MineRank(t), areconcerned withthis tuning,
and weconsidertheF-measureindicatortobethetuned effectivenessmetricbecauseitisacombination ofprecisionand
recall.
Thefirstversionofourmodel,namelyMineRank(q),launchedateachquerysubmissiononly dependsontheparameter
γ
,usedwithinthelatentroleminingstep(Eq.(4)).Thelatterwastunedwithavaluerangeγ
∈[0..1],asillustratedinFig.7.Wecanseethattheoptimalvalueforparameter
γ
isreachedat0.5and0.2for,respectively,userstudiesUS1andUS2,withanF-measurevalue respectivelyequalto0.074and 0.060.Thisdifference suggeststhat theconstraintofthereport-writing
inUS2doesnotallowcollaboratorstofullyemphasizetheirsearchbehaviorcomplementarity.
Thesecondversion ofourmodel,namelyMineRank(t),requires fixingtheparameter
γ
∈[0..1],aspreviouslydone, andalsotimestamptwherethelatentroleminingstepislaunched.Weconsiderthata1-to-5-minutetimewindow formining
latent roles isareasonable range forour experiments.Thisbetter fits with ourmodel’sassumption that search behaviors
evolvethroughout thesearch session.Figs. 8a and 8b illustratethe variationof theF-measure for ourmodel MineRank(t)
withrespecttobothparameters
γ
∈[0..1]andt∈[1..5],for,respectively,userstudiesUS1andUS2.TheF-measureisoptimal(F=0.069)when
γ
=0.5andt=2fordatasetUS1,whileindatasetUS2,itreaches0.056whenγ
=0.1andt=3.Theoptimalvalues
γ
obtainedforbothversionshighlightthattheconsiderationofsearchskillcomplementaritieswithintherole mining approachishigher inuser study US1than inuser study US2. Moreover, scores obtained forbothversions
highlightthattheeffectivenessofthemodelMineRank(t)islowerthanthatofmodelMineRank(q)inbothuserstudies.This
is consistent with thefact that our model relieson relevance feedbackexpressed after each submittedquery rather than
afterregulartime-stamps.Therefore,intheremainingexperiments, weonly considertheversion MineRank(q).
4.2.2. Analyzingthedynamicsofcollaborators’latentroles
Inthissection, ourgoalistoidentifytheevolutionofsearchskillsused bycollaboratorsthroughout thesearchsession.
For this purpose, we analyze, first, the average number of selected features for characterizing the latent role kernels of
collaborators,andsecond,theaverageoverlapbetweenthefeatureset selectedfortwo successivecollaborators’latentrole