A probabilistic model to exploit user expectations in XML information retrieval

(1)

To link to this article : DOI :

10.1016/j.ipm.2016.06.008

URL :

https://doi.org/10.1016/j.ipm.2016.06.008

To cite this version :

Dahak, Fouad and Boughanem, Mohand and Balla, Amar

A probabilistic model to exploit user expectations in XML information retrieval.

(2017) Information Processing & Management, vol. 53 (n° 1). pp. 87-105. ISSN

0306-4573

O

pen

A

rchive

T

OULOUSE

A

rchive

O

uverte (

OATAO

)

OATAO is an open access repository that collects the work of Toulouse researchers and

makes it freely available over the web where possible.

This is an author-deposited version published in :

http://oatao.univ-toulouse.fr/

Eprints ID : 18776

Any correspondence concerning this service should be sent to the repository

administrator:

staff-oatao@listes-diff.inp-toulouse.fr

(2)

A

probabilistic

model

to

exploit

user

expectations

in

XML

information

retrieval

Fouad

Dahak

a,∗

_,

_Mohand

_Boughanem

b

_,

_Amar

_Balla

a

a National Computer Sciences Engineering School (ESI), BP 68M Oued Smar, 16270, Algiers Algeria b IRIT, University of Paul Sabatier, 118 Route de Narbonne, 31062 Toulouse, France

Keywords: Priors

Element importance User browsing map Language model

a b s t r a c t

Themainobjectiveofthispaperistoexploitanew sourceofevidencederivedfromthe documenthierarchicalstructureforXMLinformationretrieval.Weconsiderthatthe struc-tureofXMLdocumentisanimportantsourceofpriorknowledge,andthestructural fea-turesofanelementmayinfluencetheusertoconsiderthatelementasrelevant.Webuild aprobabilisticmodeltoestimatetheprobabilitythatthestructuralcharacteristicsofan el-ementattractusertoexplorethecontentofthiselementandconsideritasrelevant.This probability reflectsthecontextimportance. Weproposea simple,well-motivated proba-bilistic model toestimatethecontextimportance. Finally,we demonstratethe effective-ness ofthecontextimportancethrough comprehensive experimental studiescarried out onIEEEXMLdocumentcollection.Experimentalresultsshowthattheproposedapproach outperformsmodelsexploitingothersourcesofevidence.

1. Introduction

XML (eXtensible Markup Language) is a well-known standard for data representation and exchange on the Internet. XML documentcontainstextual informationaswellaslogical structuresthat highlighttheunderlying semantic.Themain challenge for content-oriented XML retrieval isto select highlyrelevant elements that wouldsatisfy theuser information needs(Lalmas, 2009 ).Theseelements do,notonlymustberelevant,buttheymustalsobeattherightlevelofgranularity. To address this issue, information retrieval models leverage query-dependent features related to element characteristics, suchasquery-termfrequencywithintheelementcontent.Theyalsoexploitquery-independent features,calledpriors,such asthePage rank(Page,Brin, Motwani, &Winograd,1999),document length(Kraaij,Westerveld, &Hiemstra, 2002; Miller, Leek,&Schwartz,1998),andclickthroughdata(Baoetal.,2007;Joachims,2002;Kirsch,Gnasa,&Cremers,2006)toenhance theretrieval performance. InXML content-orientedinformation retrieval, element characteristics includingelement length (numberoftermsintheelementcontent)(Kamps,Rijke,&Sigurbjörnsson, 2004),element labelfrequencyinthecollection (Ashoori, Lalmas, & Tsikrika, 2007; Ogilvie and Callan, n.d. ),element path length(Huang, Watt, Harper, & Clark, 2006 )and nodepositioninthedocument(Huang, 2007)arealsoexploited.

Inthispaper,wepresentaprobabilisticmodelthat exploituserexpectationstoenhanceXMLretrievalperformance.Our objective isto estimatethecontext importance. Tofulfill thisobjective, wefirst definethenotion ofelement type and its structuralcontext.Then,weproposeaprobabilisticmodeltoestimatethestructuralcontextimportanceofanelement that weuseaspriorinalanguagemodelapproachforXMLretrieval.

∗ _{Corresponding author.}

(3)

Toachieve ourobjectives,wemakethefollowingassumptions:

• As for web information retrieval where exploiting hyperlinks structure of the web significantly improves retrieval

ef-fectiveness (Kamps, Kaptein, & Koolen, 2010; Page et al., 1999; Westerveld, Kraaij, & Hiemstra, 2001), the hierarchical structureofXML documentwouldbeanimportantsourceofpriorknowledgeinXML informationretrieval.Webelieve that structuralcharacteristics ofanelement,suchasitspositioninthedocumentand itssurroundingcontext,influence theuser duringbrowsingadocument.Therefore, weareinterestedinestimating theprobabilitythat auserfocuses on a particular element during browsinga document, by exploiting the document hierarchical structure. This probability, namedelementimportance,mayreflecttheuserexpectationsabout wheretofindrelevantinformation.

• The document elements are classified into several types according to their labels (tags) and their hierarchical level in

the document tree. Each element type belongs to a structural context defined by its surrounding element types. This structuralcontextmayinfluencetheusertofocuson agiven elementatagivenlevelinthedocumenttree.

We experiment context importance prior model (CPrior) on IEEE XML document collection of INEX and compare our resultswithlengthpriormodeland somemodelsexploitingother sourcesofevidence.

Theremainderofthispaperisorganizedasfollows:inSection 2 ,wereviewsomerelatedworkfocusingonthedifferent sources of evidence used as priors inXML information retrieval. Section 3describes our probabilisticmodel with an ap-proachfor estimatingtheelement contextimportance. Finally,we presenttheresults oftheexperimentsin Section4,and concludethisworkand listsomeperspectivesinSection5.

2. Related work

Several query-independent featureshave been successfullyused into informationretrieval modelsin order to enhance theretrieval effectiveness. Moreparticularly, featuressuch asthenumber ofincoming linksto a document (Kampset al., 2010; Kraaij et al., 2002; Westerveld et al., 2001 ),the page-rank(Page et al., 1999 ),thetypeof documentsassociated URL (Kraaij et al., 2002; Westerveld et al., 2001 ), the document publication time (Peetz & Rijke, 2013 ) and, the anchor text of outlinks (Kamps etal., 2010) are extensively exploited in web information retrieval. Earlier works such as Hiemstra et al.(2002) have shownthat dedocument lengthsignificantlyimprovestheretrieval effectivenessin documentinformation retrieval.Inthesameline,Milleretal.(1998)have combinedthedocumentlengthwiththeaveragewordlength.

Huurdeman,Kamps,Koolen,andWees(2012)haveexploitedthenumberofreviews,theratingaverageandtheusertag frequencies in the social information retrieval. Damak, Pinel-Sauvagnat, Boughanem, and Cabanac (2013) have introduced thelanguagequalityandBadache, & Boughanem, (2015) haveexploited socialsignalssuchasthenumberofuserlikesand shares.

These last years, several studies such as in Beckers and Korbar, (2010); Jay, Stevens, Glencross, Chalmers, and Yang, (2007); Tranand Fuhr, (2012);Velásquez, (2013)haveexploited theeye-tracking techniqueinorderto determinetheuser reading process on the Internet by understanding how people look at web pages. The user reading process may reveal something about the salience and the importance of elements in the document. Since it consolidates our intuition, the study carriedout inBuscher, Cutrell, and Morris, (2009 ) seemstobethe mostinteresting forour work.The authorshave mapped thegazedata, whichreflectstheuser attention,to DOM(DocumentObjectModel) elementsto buildupasalient mapofimportantelements. Theirobjective istocreateamodelbasedontheDOMofWebpages thatcanpredicttheuser attentiononsingleelementsonapage.Accordingtothisstudy,wecandeducethatauserexploresadocumentbyfollowing atop-down method going fromthe general aspects towards thedetail. Initially,theuser isattracted bygeneric elements, which areat thetop ofthe treestructure. Then hegoes in-depth of morespecific elements. Buscher etal.(2009) clearly demonstratethatthefirstlooktoadocumentexpressesuserexpectationsaboutwhereto findrelevantinformation.

ThestructuralinformationofXMLelements wasearlierintegratedbyGuo, Shao, Botev, and Shanmugasundaram, (2003 ) intheXRANKretrieval processbyconsidering two-dimensionalproximity metric involvingboththekeyword distance and the element ancestor distance. The element length (number of tokens in the element textual content) seems to be the mostused assource ofevidence (Banerjee&Han,2009;Blanco &Barreiro,2008;Gangulyetal.,2010;Kampsetal.,2004; Lalmas, 2009; Ogilvie & Callan, 2004, 2005;Pehcevski, Thom, & Tahaghoghi, 2005; Sigurbjörnsson, 2006; Sigurbjörnsson, Kamps, &Rijke, 2004). Element lengthpriorinfluences therelative rankingbyfavoring longest elements. However,it was clearly showed by Kamps et al., (2004 ) that the effectiveness of the lengthprior should not be interpreted as a general claim that long elements are inherently more relevant than short ones. Ashoori et al., (2007 ) explore another source of evidence,namely thenumber of topicshifts in thenode content. The idea is, sincethe exhaustivity1 _and _the _specificity2

are bothexpressed in termsof the “quantity” oftopics discussed withineach element,the numberof topic shiftswithin an element reflects itsrelevance. This approach tends to favor elements, which cover several topics. However, two nodes having differentstructuralcharacteristics anddifferent contentcaneasily havethesamenumberoftopicshifts.Mihajlovic etal., (2005) exploitinformation about relevant elementsissued from user’s relevance judgmentsto update thepriors,in orderto discover thecharacteristics of relevant elements and updatethe priorsin suchaway that elements withsimilar

1 The exhaustivity measures how exhaustively an element discusses the topic of the user query. ( _Lalmas,₂₀₀₉_). 2 The specificity measures the extent to which an element focuses on the topic of query. ( _{Lalmas, 2009}_).

(4)

characteristics are favored. However,this kindof information dependson the queryand cannotbe actually considered as priors. Two other properties are used byHuang et al., (20 06 ); Huang, (20 07 ), namely element position in thedocument and its path length. The idea behind is that relevant elements tend to appear in thebeginning ofthe document and are not likelyto benested in depth. Thisapproach favors thenodes with shortestpath and closestto the documentroot. As topicshifts,elementscanbefoundwiththesamelocationatthesamelevelinthedocumenttree,butcompletelydifferent in structure and content. The element typesare used in Termehchy and Winslett, (2011) to measure the strength of the relationships inacandidate answerand rank thecandidate answersaccordingto theirstrengths. The intuitionisthat the closertheassociationbetweentypesisinasubtree,themorethesubtreerepresentsameaningfulandcoherent object.

Among theworksexploiting priors,twoworks haveparticularlyattracted ourattention:Those presentedbyBeigbeder, Géry, Largeron, and Seck, (2010); Géry and Largeron, (2012) and Arvola, Kekäläinen, and Junkkari, (2011) The work de-scribedinBeigbederetal.,(2010)istheclosestonetoourwork.Indeed,authorsproposeanextensionofBM25(Robertson, Zaragoza,& Taylor,2004), byintroducingstructuralfeaturesin therelevance estimation formula.The authorsassume that thecapacityofatagto highlightrelevantterms isintrinsictothe tagitselfandis thereforenot dependenttothecontent terms. Theobjective thenis toevaluate whethera word featuring ina title is moreimportant than aword taken from a section/paragraph,regardlessoftheworditself.Aweightiscomputedforeachtagandestimatestheprobabilitythatthetag marks arelevant term.We sharealmostthesame intuition;however, thereare two majordifferences with ourapproach: first, thewayGéry and Largeron(2012) estimatethetag weights dependson the element content,while inourapproach wedo itindependently ofthecontent,we assumethat thestructuralcharacteristics ofanelement cangivebetter results. Second,thetagisconsideredoncefortheentirecollection,whilewedistinguishthedifferentlevelsinthedocument struc-ture.Forus,aparagraph inthedocumentsummaryshouldnot havethesameinfluencethat aparagraph atasection ora conclusionlevels.

Thesecondwork(Arvola et al., 2011 )developscontextualizationmodels.Themainideaistoexploretheeffectof differ-entcontextualizationmodelsondifferenthierarchicallevels.Theauthorshypothesizethattheretrievalofshortandfocused elementswouldbenefitfromstructuralcontextualizationmorethantheretrievalofbroaderones.Weareparticularly inter-estedbythedifferentwaysofconsideringthestructuralcontextofanelementand itsinfluenceontheretrieval.

Summary, we reviewed in this section varioussources ofevidence. We notethat theyare often used to favor specific typeofelementssuchaslongelementsorelementswithmoretopicshifts.AnimportantsourceofpriorknowledgeinXML informationretrieval isthehierarchicaldocumentstructure.The documentstructuredoesnot onlyorganizethedocument contentbutalsoplays aconsiderableroleduringthebrowsingofthedocumentbyuser.Thestructurecharacteristics ofan elementsuchasitspositioninthedocumentanditssurroundingcontextdrawtheattentionoftheusers.Byunderstanding theprobability that a user chooses suchelement at sucha level inthe document structure,we can finda newsource of evidence that allows benefiting from the document hierarchical structure in order to determine, which type of elements wouldbeusefultofavorduringtheretrieval.

3. Context importance

We proposea novelquery-independent featureforXML informationretrieval. Weusethese characteristics assourceof evidenceinordertoquantifytheelementimportance thatreflectsuserexpectationsabout wheretofindrelevant informa-tion. First,we present thedocumentmodel, thenwedefinethestructuralcontext and finally,wepresent ourapproachof estimatingthecontextimportance.

3.1. Structuralcontextmodeling

In thissection, we present ourcontextualizationapproach, which consistsof adocument representation and acontext model.

3.1.1. Documentmodel

ExtensibleMarkupLanguage(XML)isamarkuplanguagethatdefinesasetofrulesforencodingdocumentsinaformat, which is both human-readable and machine-readable. It is defined by the W3C’s XML 1.0 specification. XML documents have ahierarchical structureand can conceptuallybeinterpreted asa treestructure, calledan XML tree.XML documents mustcontaina rootelement (onethat istheparent ofallother elements). Allelements inan XMLdocument cancontain subelements,textandattributes.ThetreerepresentedbyanXMLdocumentstartsattherootelement andbranchestothe lowestlevel ofelements.

AnexampleofanXML documentisgiveninFig.1below:

An XML document model can represent both tree-shaped and graph structured data. In this paper, we consider only tree-shapedXML files.Wemodelthedocumentstructureasatree,whereeachnoderepresents anXML elementidentified byitsXPath(path-basedlanguage forfindinginformation inanXML document). Fig.2presentsthe documenttreeof the XMLdocumentshowninFig.1withouttext nodes.

Weusethefollowingfunctionalnotationsforrelationshipsamongelements: Let beeand ftwoXMLelements.

(5)

Fig. 1. Example of an XML document.

• label(e):Label(tag)oftheelement e.

• parent(e):Parentnodeofeinthedocumenttree.

• ancestors(e):Anancestorofanelement e isanelementinthesamedocumentase belongingtothehierarchicalpath of e inthedocumenttree(itsparent, grandparent,greatgrandparent etc.).Thefunctionancestors(e)givealltheancestors ofe.

• siblings(e): Listofelementsatthesamelevelthan e.

• distance(e,f): Representsthedifference between thelevel oftheelement eand that of theelement f. Itisestimated as

follows: distance

(

e,f

)

=

½

le

v

el

₍

f

₎

₋le

v

el

₍

e

₎

i f e_∈ancestors

₍

f

₎

0Otherwise (1) 3.1.2. Contextmodel

Thecontext of anelement consists ofelements, which have a relationship with thecontextualized element ata given distance.TodefineacontextofanelementintheXMLhierarchyweneedtodefinetwoconcepts:Thestructuralrelationship betweenelementsandthedistance,whichrefersto astructuralremotenessbetweenelements.

We first consider that elements sharing same characteristics such as label and hierarchical level belong to the same context. Therefore, we have to define class of elements and construct a context around it. Secondly, instead of using the

(6)

Fig. 2. XML document Tree.

distance between elements, we consider a probability between element classes. In the following, we define the element type,therelationship betweentheelementtypesandthenthecontextofanelementtype.

Definition 1 (Elementtype). Elementtype T isaclassofelementsinadocumenthavingthesamelabel l atagivenlevel v

anddenoted as:T

h

l,

v

i

=

{

e

|

label

(

e

)

=l ∧le

v

el

(

e

)

=

v

}

.

Theelementssatisfyingtheelementtypedefinitionarecalledinstancesofthatelementtype.Forexample,theinstances oftheelement type T =<p ,4 >intheXMLdocumentofFig.1arethoseaccessiblebythefollowingpaths:

/article[1]/abstract[1]/section[1]/p[1];

/article[1]/abstract[1]/section[1]/p[2];

/article[1]/body[1]/section[1]/p[1];

/article[1]/body[1]/section[1]/p[2];

/article[1]/body[1]/section[1]/p[3];

/article[1]/body[1]/section[2]/p[1].

The element typesrepresentclasses ofelements definedbyalabel appearingin agivenlevel ofdocument tree.As the elements ofanXML documentarelinkedbyhierarchical relationshiptheir respectiveelement typesshouldalso belinked byanaggregationofhierarchicalrelationships.Wethus definetherelationshipbetweentwoelement typesasfollows.

Definition 2 (Hierarchical relationship between element types). Given two element typesT and U. Let dist be a positive integer.ThehierarchicalrelationshipbetweenTandUatdistancedistnotedH_dist(T,U),determinestheexistenceofelements

e and f (respectively instances of T and U) which are atdistance dist in the same document of the collection. Formally, H_dist(T,U)isdefinedasfollows:

Hdist

(

T,U

)

=

½

1i f

∃

e∈ T ∧

∃

f ∈ U wheredistance

(

e, f

)

=dist

0otherwise (2)

Thehierarchicalrelationshipbetweenelementtypesrepresentstheparent/childrelationshipwhendist=1,the grandpar-ent/grandsonrelationshipwhendist=2andancestor/descendant relationshipwhendist>2.

The elementtypesofadocumentcanberepresentedasadirected graphwherenodesrepresenttheelementtypesand oriented edges express the hierarchical relationships between these element types. Such a graph illustrates the possible relationsbetweenthedifferentelement typesinanXML document.

Definition3(Elementtypegraph). ElementtypegraphG_d=(T,E,dist)ofdocumentdisthedirectedgraph.WithT={T|Tis anelementtype}isasetofnodeswhereeachnoderepresentsanelementtype.Eisasetofedges,whichareorderedpairs ofelementsofTand representsthehierarchicalrelationshipbetweenelement types.Theedges areconstructed asfollow:

E=

{

(

T,U

)

|

TandUareElementTypes ∧H_dist

(

T,U

)

=1

}

Example. LetdbetheXMLdocumentrepresentedinFig.1.Fig.3.a,bandcrepresentsrespectivelytheelementtypegraph withdist=1(parent/childrelationship), dist=2(grandfather/grandson relationship) and root/leafrelationship when distis theleaflevel−1.

(7)

Fig. 3. Element type graph.

Wenote that the graph differsaccordingto the considered distance. Some hierarchical relationships between element typesdisappearaccordingtotheconsidereddistance.

Wecannowdefinethestructuralcontextofanelementtype(Definition 4 )accordingtoitspositionintheelementtype graphandtheothernodesthatareconnectedtoit.

Definition4(Intrinsicstructuralcontext). Anintrinsicstructuralcontextofanelement typeTisdefinedbyasetcomposed ofTanditspredecessors.TheintrinsicstructuralcontextisdenotedCT=

{

U

|

U isanElementType ∧

(

U∈RT ∨ U=T

)

}

.

WhereR_T isthesetofallthepredecessorsoftheelementtype T intheelementtypegraphG_d=(T ,E,dist)knowingthat apredecessorofanelement typeintheelement typegraphisanelement typehavingahierarchicalrelationship withthe consideredelementtype:R_T={U|Uisanelement Type∧H_dist(U,T)=1}.

Theinstancesoftheelementtype T areelementsbelongingtoitscontextCT.

Theintrinsic structuralcontext isdefinedover theelement typegraph. Thus,it dependson thedistance consideredin thehierarchicalrelationship betweenelement types.Consequently,anelement typemayhave differentcontextsaccording tothegivenhierarchicalrelationship.

Forexample, elementtypeT=<p,4>intheelement typegraphofFig.3hasthreestructuralcontexts,showninFig.4, accordingto thegiven hierarchicalrelationship.

Theintrinsicstructuralcontextoftheelement typeT=<p,4>isthenforeachconsidereddistanceasfollows:

-Parent/childrelationship:CT={<p,4>,<article,1>,<body,2>,<abstract,2>,<section,3>}

-Grandfather/Grandsonrelationship:C_T={<p,4>,<article,1>,<body,2>,<abstract,2>} -Root/Leafrelationship:C_T={<p,4>,<article,1>}

3.1.3. Contextimportance

Asitismentioned,theuserchoiceisguidedbyitsexpectationsaboutelementswhichmaycontainrelevantinformation. Assumingthatuserexploresthedocumenttreelevelbyleveland,ateachlevel,hechoosesbetweenelementswithdifferent contexts.Thechoicebetweenelementswiththesamecontextsisdonearbitrarily.Wemayquantifytheseexpectations and consideritascontextimportance.

Definition5(Contextimportance). Giventwoelement typesTandU.LetH_dist(T,U)bethehierarchicalrelationshipbetween

Tand Uatagiven distancedist.The contextimportanceoftheelement typeUcompared totheelement typeTaccording tothehierarchicalrelationship Hdist denoted asCIdist(T,U),istheprobabilitythat auserexploresanelement fbelongingto

(8)

Fig. 4. Intrinsic structural context of an element type.

Thecontextimportanceisdefinedasatransitionweightfromoneelementtypetoanother.Itcantherefore,beusedasa weightingoftheelement typegraphedgesandthus giveaweighted graph.Theresultedgraphshowsamapofthevarious possiblepathsthat usercanfollowtogetthedesired information.

Definition 6 (User browsing map). Given a document d, a user browsing map BMap(d) is the element type graph of d, whereedgesareweightedaccordingtothecontextimportance.Gw

d =

(

E,V,d,w

)

wherew:E→[0,1]isafunctionthatmakes

amappingfromdirectededges totheircontextimportancevalue.

Fig. 5 representstheuserbrowsingmapofthedocumenttreeshowninFig. 2 withdist=1.

The user-browsingmapallowspredictingtheuserbrowsingprocessofagiven document.The edgesrepresent the pos-siblepathsthat usermayfollowtoaccessanelementtypefromanotherone.Thecontextimportanceweightingeachedge reflectstheprobabilitythat auserchooses thattargetedge.

3.2. Contextimportanceestimation

When userexploresanelement typeintheuserbrowsingmapofagivendocument,thecontextimportancerepresents the probability that he expects to find a given element type among the successors of the current element. This can be interpreted asaconditional probability and dependson the hierarchicalrelationships between element typesinthe user-browsing map. First, we define context importance by considering the parent/child relationship between element types

(dist=1),thismeansthat the userselects anelement from thechildrenoftheelement that heis exploring, and thenwe definethegeneralizedformula.

3.2.1. Parent/childcontextimportanceestimation

Given twoelement typesT, Uand the parent/childhierarchicalrelationship H1(T,U).The contextimportance CI1(T,U) of UknowingTaccordingtoH1isestimatedbyaconditionalprobabilityasfollows:

CI1

(

T,U

)

=P

(

U

|

T

)

(3)

CI₁(T,U) indicates the probability that a user selects an element f in the document tree belonging to the context CU

knowingthat heisexploringanelemente belongingtothecontextCT.

According toBayesformula,theprobabilityofFormula (3 )iscalculatedby:

P

(

U

|

T

)

=P

(

U

)

.P

(

T

|

U

)

/P

(

T

)

(4) The precedence likelihood P(T|U) expresses theprobability that when useris exploring anelement type U,he hasjust explored an element type T in the predecessors of U. Since an element type in a user browsing map may have several

(9)

Fig._{5. User browsing map.}

predecessorsand eachelementtypeisrepresentedjust once.Theprecedencelikelihoodtranslatestheprobabilitythat one ofthepredecessorsofUisT.Thus,itisestimatedasfollows:

P

(

T

|

U

)

= 1

|

RU

|

(5)

Consequently,thecontextimportanceCI1(T,U)iscalculatedasfollows:

CI1

(

T,U

)

=P

(

U

)

/

|

RU

|

∗P

(

T

)

(6)

3.2.2. EstimationofthecontextprobabilityP(T)

Givenanelement typeT, P(T)istheprobabilitythat anelement ebelongingto C_T isexploredbytheuser.Sincea user explores thedocument level bylevel, and chooses the element typeto explore at eachlevel, we estimate theprobability

P(T)byusingthemaximumlikelihoodateachleveloftheuserbrowsingmap.

To know the number of elements in a document belonging to a given context, we define the context cardinality as follows.

Definition 8 (Contextcardinality). LetCTbetheintrinsicstructuralcontextofelementtypeT.ThecardinalityofCTdenoted

|C_T|_disthenumberofelementsindocumentdbelongingtocontextC.

ProbabilityP(T)isestimatedbythefollowingformula:

P

(

T

)

=P

|

CT

|

d

U

|

CU withU.le

v

el=T.le

v

el

|

d

(7)

where|CT|d isthecardinality ofthe contextCT andP U

|

CU withU.le

v

el=T.le

v

el

|

thesum ofcardinalitiesofallthe contextsat

thesamelevelasC_T.

Twoexceptionalcasescanariseinthisestimationofthecontextimportance.First, maximumlikelihoodwillassignzero probabilitytocontextsnotoccurringinadocument.Forexample,letbeC_T={<section,3>,<article,1>,<body,2>}anintrinsic structuralcontextintheuserbrowsingmapofadocument collection.Ifacontext ofagivenelement typeT=<section,3> doesnothaveany occurrenceinagiven documentd;itscardinality |CT|d=0.Thus,theprobabilityP(T)inthedocumentis

(10)

Fig. 6. Estimation of the context importance.

null.Weuseasmoothingmethodtoavoid thenullprobability.Thecontextimportanceofanelement typeisthusnotonly thefruit ofthedocumentstructurebutalsoofoccurrencesofthatelementtypeinallthecollection.

Several classesofsmoothing strategieshavebeen proposed.(Zhai & Lafferty, 2004 )studiedthreeapproachesto smooth-ing:Jelinek-Mercer smoothing,Dirichlet priorsand absolutediscounting,aswell asthebackoff versionsof thesemethods. The effects ofeach ofthese smoothing mechanisms was examined on five differenttest collections. Therewas aclear or-deringamongthemethodsintermsofprecisionresults;Dirichletpriorsperformedbetterthan absolutediscounting,which performedbetterthanJelinek-Mercer.Accordingthe(Zhai&Lafferty,2004),weuseDirichletsmoothing techniqueto calcu-lateP(T)asfollows: P

(

T

)

=

|

CT

|

d+

µ

s∗

|

CT

|

c P U

|

CU withU.le

v

el=T.le

v

el

|

c /

µ

s+ X U

|

CU withU.le

v

el=T.le

v

el

|

d (8)

Wherecdenotesthewholecollection.

Ontheotherhand,asweestimatetheprobabilityofanelementtypebyusingthemaximumlikelihoodateverylevelof thedocumenttree (Formula(7)),it ispossible, whenestimatingthecontext importance inFormula(6)that P(U) isupper thanP(T) becauseofthenature ofthedocumentstructure.ThatcanmakeCI₁(T,U)upperto1.Therefore,toavoid asimilar case,theFormula (6 )becomesasfollows:

CI1

(

T,U

)

=P

(

U

)

/

(

1+

|

RU

|

∗P

(

T

)

(9)

By applyingtheFormula(9)to theuser-browsingmapofFig.5correspondingto thedocument ofFig.1,weobtain the context importance illustrated in the Fig. 6 below. We annotate nodes with corresponding element type probability and edgeswithstructuralcontextimportance.

3.2.3. Generalizedcontext importanceestimation

Thegeneralizationofcontextimportanceestimationallowsconsideringanydistancedist.Itcanbeconsideredasa prop-agation of the context importance throughout the user-browsing map. This leads to consider two cases. First of all, we estimatethecontextimportancepropagationalongasimplepath thenthroughoutthewholeuserbrowsingmap.

3.2.3.1. PropagationofCI_distalongasimplepath. Fig.7showstheimportanceoftheelementtype<p,5>atdifferentdistances comparedtoitsrespectivepredecessorsalongasimple path.

Accordingtodistancedist,ausermayaccessanelementbelongingto<p,5>from<subsec,3>ifdist=2,from<section,2> if dist=3 and from <article,1> if dist=4. Thus, the context importance of the element type <p,5> differs according to

(11)

Fig. 7. Context importance along a simple path.

thegiven distance. In Formula(6),we defineditfor distance dist=1(parent/child relationship).We assume aconditional independence over a different level of the simple path and seenow how to estimate the context importance nor matter whatthevalueofthedistancedistis.

Withdistancedist=2,thehierarchicalrelationship H2(T,U)betweentheelementtypesTandUmeansthatitexiststwo

elements e and fin the document treebelonging respectively to thecontexts C_T and C_U, where e is thegrandfather of f. Thus,thereexistsanelementgsuchasgisthechildofeandtheparent off.Accordingtothedefinitionofthehierarchical relationship in Formula (2), and knowingthat an element in the document tree has only one parent, thismeans that e

cannot bethegrandfather off withoutthe existenceof anintermediate element g.Consequently, the relationship H2(T,U)

cannotexists withouttheexistence ofthetwo relationshipsH1(T,V)and H1(V,U)atthesametimewhere Vistheelement

typeofg.

On another side, the user choice of thenext node to explore at any fixed time during thebrowsing process over the browsingmap ofa given XML document doesnot depend on thehistory of allthevisited nodes but only on thecurrent node. Therefore, wecan consider the userbrowsing process asMarkovian with respectto a filtration{F}t. Thus, with the

Markovpropertyofthebrowsingprocess, thecontextimportance atdistancedist=2is estimatedwiththeproduct ofthe twocontextimportanceatdistancedist=1asfollows:

CI2

(

T,U

)

=CI1

(

T,V

)

∗CI1

(

V,U

)

(10)

ByreplacingthecontextimportanceinFormula(10)withtheirestimation inFormula(6),weobtain:

CI2

(

T,U

)

= P

(

V

)

|

RV

|

∗P

(

T

)

∗ P

(

U

)

|

RU

|

∗P

(

V

)

= P

(

U

)

P

(

T

)

∗

|

RV

|

∗

|

RU

|

(11)

Wenotethat,whatispropagatingalongasimplepathisjusttheprecedencelikelihood.Thismeansthattheintermediate elementtypesbetweentwoelementtypesalongasimple pathdoesnotmatterwhateverthedistancebetweenTandU.In ageneralway,thecontextimportanceCI_distalongasimplepath iscalculatedasfollows:

CIdist

(

T,U

)

= P

(

U

)

P

(

T

)

∗ Y V∈path(T,U) 1

|

RV

|

(12)

Wherepath(T,U)isafunctiongivenaset ofelementtypesbelongingtothepathgoingfromTtoUintheuserbrowsing map.

3.2.3.2. Estimating CIdist in a user browsing map. Fig. 8 shows the context importance of the element type T=<p,6> at

different distances compared to its predecessors in a user browsing map. When distance dist is superior to 1, the same elementtypecanbeaccessedfromdifferentpaths.

With distance dist=2, a user may access the element type T6=<p,6> from two possible paths. He may go from T4=<subsec,4> or from T5=<citation,5>.Thus, the probabilityof exploring an element oftype U just after exploring an

element of type T₃=<section,3> can beestimated along two simple paths. Consequently, user may choosethe first path orthesecond one. Thustheprobability ofexploring anelement oftypeU isthesumof thetwo probabilities,that ofthe path passing byVand that ofthepath passing byW. However, accordingtoFormula (9), theintermediate element types

(12)

Fig. 8. Context importance on all the user browsing map.

betweenUandTdoesnotmatter:

CI2

(

T,U

)

= P

(

U

)

2∗P

(

T

)

+ P

(

U

)

2∗P

(

T

)

= P

(

U

)

P

(

T

)

(13)

Inageneralway,giventwoelementtypesTandUthecontextimportanceofTcomparedtoUatagivendistancedistis estimatedfollowingthegeneralizedformulaasfollows:

CIdist

(

T,U

)

= P

(

U

)

P

(

T

)

∗ X P∈paths(T,U) Y V∈P 1

|

RV

|

(14)

wherepaths(T,U)isasetofallpossiblepathsfromTtoUintheuserbrowsingmap.

To avoidthesameexceptionalcasementionedfor thecontextimportanceestimation (see Formula(6)),Formula(14) is thenestimatedasfollows:

CIdist

(

T,U

)

= P

(

U

)

1+P

(

T

)

∗ X P∈paths(T,U) Y V∈P 1

|

RV

|

(15)

Formula (15)represents thegeneralizedformula ofestimatingthecontextimportance ofany elementtype inthe user-browsingmapwhateverthedistancedist.

4. Experiments and results

4.1. Methodology

In this section, we describeour methodology to evaluate intrinsic structural context importance in XML retrieval. Our aimsare:

(a) Examiningthecharacteristics ofXML elementsreflectedbytheircontextimportance(Section4.2);

(b) Comparing context importance with length prior which is the most used source of evidence in information retrieval (Section4.3.3);

(c) Comparingourresultswithfourdifferentworksusingdifferent sourcesofevidence (Section4.4).

We conducted extensiveexperiments on INEX IEEE collectionto investigate our threeaims. Section 4.1.1describes the IEEEcollectionusedinourexperiments,Section 4.1.2 presentsthebaselineretrievalmodelthatweuse,andSection 4.1.3 dis-cussesourexperimentalsettings.

4.1.1. Collectionandmetrics

INEX (The Initiative for the Evaluation of XML retrieval) provides a benchmark for the evaluation of XML information retrieval. This includes a document collection, topics, relevance assessments and metrics. There have been a number of changes in the document collection used over the years in the INEX experiments. Before 2006, the collection used was anIEEE XML document. Whichconsistedof 16,819 articles,marked-up in XML,from 24 magazinesof theIEEE Computer Society’s publications, covering the period of 1995–2004, and totaling 764MB in size, and over 11 million in number of elements. On average, an article contains 1532 XML nodes, where the average depth of the node is 6.9. In 2007, INEX

(13)

introduced theWikipedia XML document collection. The 2007 document collection was approximately 5.6 GB in size. On average, an article contains161.35 XML nodes, where theaverage depth of an element is 6.72.In 2009, INEX provided a newWikipedia collection,which isapproximately60 GBin sizeand contains 2.7million articleswith over30,000unique tagsinit.

Inordertomeasure theeffectivenessoftheelement contextimportanceweneeddeep documentsandagreatnumber of elements per document and per level so that ourmaximum likelihood based probability take its full meaning. In our experiments, weusetheIEEE XML documentcollections version1.8 and relatedtopics withrelevance assessments.In the INEX (IEEE)collection, the granularitylevels arerelatively easy to distinguishby providing reasonably clear and standard divisionofarticle-section-subsection-paragraphlevels,similartomanyotherXMLstandardsforstructuredtext.Ontheother hand,thetwo smoothingparametersofourformulathat wemustset compelustoperform multipletestsand adjustment hencetheneedforarelativesmallcollection.

Two typesof queries are used in INEX: content only (CO), and content and structure (CAS). Queries of the first type areformedbysimple termswithoutanyinformationon thenodestructure.Thesecondtypeofqueryspecifiesthedesired contentandthedesiredstructure.Inourstudies,wefocuson COqueriesbecausewewanttoshowthedifferencebetween thenodetypesthoroughtheirimportance.

Until2005,therelevanceassessmentswerecollectedalongtwodimensions,specificityandexhaustivity.Since2006,only specificity is considered. Exhaustivity, which is defined as the extent to which the document component (XML element) discusses thetopic ofrequest, is assumed a constantfactor bearingno effecton therelevance scoreof anXML element. Specificity,which is definedas the extent to which adocument component focuses on the topic of request,is calculated automaticallyastheratioofthenumberofhighlightedcharacterscontainedwithintheXMLelement andthelengthofthe element(rsize/size).Specificity hencecantakeanyvalue in[0,1].Sinceexhaustivityisaconstant,therelevancescoreofan XMLelement isafunctionofthespecificityscoreonly (Fuhr, Lalmas, & Trotman, 2007 ).

TheINEXBenchmarkpropose 29ContentOnlyqueriesforIEEEcollection.Weusedallofthesequeriesinourtests. Theused metricsareXCG(eXtended CumulatedGain)metrics,whichareanextensionofCumulativeGain(CG)(Järvelin & Kekäläinen,2002)which takes into account the dependenciesbetween XML elements. Twometrics are included in the XCG metrics:nxCG (normalizedextendedCumulated Gain)and ep/gr(effort-precision/gain-recall). Inourexperiments, we usenxCG atthecutoffs 10, 25 and 50.The value nxCG[i]represents theratio betweenthe gaincumulated bytheuser at rankiandtheonehecouldcumulateifthesystemwasoptimal.WewillalsousetheMAepmeasure,whichistheaverage oftheeffort /precisionobtainedforeachrankwherearelevantelementisreturned.

4.1.2. Baselineretrievalmodel

Inordertoshowthecontributionofourapproachweneedtouseamodelthatintegrateselementpriorsintherelevance estimation.Wecanthenusedifferentsources ofevidence aspriorsand comparetheresultsonthesameplatforminorder to makescomparison moresignificant. The question that arisesat thislevel isthat, compared withthe sameuser query, what are the query-independent characteristics of the element can improve the retrieval? Therefore, like in (Kaptein & Kamps,2013;Sigurbjörnsson, 2006)ourbaselineretrieval modelisa standardlanguagemodel becauseitallowsusto use query-independent characteristics like prior probability.The relevance of anelement in a languagemodel is computed as follow:

P

(

e

|

q

)

=P

(

e

)

.P

(

q

|

e

)

(16)

Where e isanelement, q isaqueryconsideredasasequenceoftermst1,t2… tn. P(e) isthepriorprobabilityofelement e

and P(q|e) istheprobabilityofgeneratingquery q fromelement e .Weconsideraunigramlanguagemodelwheretheterms composingthe element content are produced randomly and independently from each other.It is thereforeamultinomial distribution overterms t_i ofthe indexingvocabulary Vwith freq(t,q)the frequencyofterm tinthe queryq. Weassume thatthecontentofanelement e isobtainedbygatheringitsowncontentwiththecontentofallitsdescendants.Thequery generatinglikelihoodbyelement e isobtainedbythefollowingformula:

P

(

q

|

e

)

=Y

t∈qP

(

t

|

e

)

f req(t,q) ₍₁₇₎

where, the conditional probability P(t|e) represents the probability that term t occurs knowing that element’s language modeleoccurred. ItiscalculatedusingthemaximumlikelihoodwithaDirichletsmoothing followingtheformula:

P

(

t

|

e

)

= f req

(

t,e

)

+

µ

m∗ f req_|_C(_|t,C)

µ

m+

|

e

|

(18)

where freq (t,C) isthefrequency ofterm t incollection C and |C| isthe sumoftermsfrequenciesinthecollection. 4.1.3. Parametersettings

Weused aDirichlet smoothing attwolevels. The firstparameter µm inthe querylikelihoodestimation (Formula (18)).

The secondparameter µs in thecontextprobability estimation (Formula(8)).Todetermine theparameter values that give

theoptimal results weused cross-validation ina seriesof experiments. Wehave divided the29 queries into two groups: GroupAand GroupB.Agridsearch(from50to3000bystepof50)isusedtofindtheoptimalparametervaluesforGroup

(14)

Fig. 9. Smoothing parameters estimation for MAep metric.

Fig. 10. Most important element labels in the collection.

A,and teston Group B, and vice versa.The results areshown inFig. 9. The optimalvaluesfor thesmoothing parameters areµm=360and µs=10.Wealsonoticedthatthetwosmoothingparametersarecompletelyindependent.

4.2. Characteristicsofcontextimportance

This section discusses theresults ofthe experiments we conducted to investigate thecharacteristics ofXML elements reflectedbytheircontextimportance.First,weexaminetherelationbetweentheelementtypeandtheirintrinsicstructural context importance (Section 4.2.1).We discuss in Section 4.2.2 therelation between theelement context importance and theelementslength.Wethenexaminewhethertheelementslevelinthedocumenttreeinfluencetheircontextimportance (Section 4.2.3 ).

4.2.1. Elementtypevs.contextimportance

This section discusses the relation between element types and their context importance. First, we present the most important element types in the collectionaccording to their context importance. We then study the correlation between theelement typefrequencyand thecontextimportance.

Fig. 10 showsthe list of thetop 20 most important element typesin the collection accordingto theaverage oftheir contextimportanceconsideringtheparent/childrelationship whateverthecontextlevel.

We note that the element type sec (section) is themost important in this collection,followed bythe documents root element article comes then snm (secondname).Whereas elementtype ti (title) isplacedin28thposition.For acollection

(15)

Fig. 11. Element type frequency vs. context importance.

Fig. 12. Elements length vs. context importance.

containingscientificarticlessuchasIEEE, sectionsaremostcarryinginformation.Theyarethus morefrequentinthe docu-mentsandespeciallycomparedtotheirhierarchicalparentinthedocumenttreestructure.Wealso,noticethat theelement typeparagraph(p )isatthe9th positionwhenitsparents(section, bb,bdy) areatthetopofthelist.Thatmeansthat the contextimportancecanmakedistinctionbetweenanelementand itsdescendantsorancestors.

Fig.11presents contextimportance compared toelement type frequencyinthecollection. Wenotethat there isnot a directrelationbetween thetwofeatures.However, itseemsthat thehighervalues ofcontextimportance areobservedfor themostfrequentelementtypesinthecollection.However,thereisnotaproportionalrelationshipbetweenthem.Thiscan beexplained bythedistribution oftheelements type over theirhierarchicalcontexts. Thus,what makesanelement type moreimportantthan anotheroneisnotonly itsfrequencyinthecollectionbutalsoitsfrequencycompared toitssiblings.

Fig. 11 clearlyshowsthat itisnotthemostfrequent elementtype,whichisthemost important.Elementtype p (para-graph)for exampleis mostfrequent in thecollectionbut itis notmoreimportant than sectionwhose frequencyismuch smaller.The elementscontainingelements oftypepalsocontainother elementsofdifferent typewhatisnotthecasefor section. The structural context of element type section is thus more important than that of p even if this latter is most frequent.

4.2.2. Elementlengthvs.contextimportance

Fig.12 showstherelationbetween element lengthand context importance.We notethat thecontext importancedoes notdepend directlyon element length.However,thelengthand thecontext importanceevolvealmost proportionally.The moretheelement is longthe moreitseemsto beimportant. Indirectly,the contextimportance confirmsthat the longest elements arereadyto berelevantand smallest one canbeneglected. Thisalso showsthat thestructure ofadocument is not present randomly but it bringsa semantic to the textual content. Because the sameelements considered asrelevant accordingto their lengtharealso relevant accordingto theircontext importance, whichis obtained only accordingto the structuralcharacteristicsandwithoutconsidering theelement content.

Thisresultcanbe used toimprove retrievaleffectivenessbyremoving for instance smallerelements accordingtotheir importance.Itseemsthat greateristheelementlengthmoreitscontextisimportant.

(16)

Fig. 13. Elements level vs. context importance. Table 1

PCPrior vs. RCPrior results according to nxCG and ep-gr metrics.

nxCG MAep

@10 @25 @50

PCPrior 0 .2678 0 .2437 0 .2192 0 .06 86 8 RCPrior 0 .2710 0 .2433 0 .2164 0 .08040

4.2.3. Elementlevelvs.contextimportance

Fig. 13 showstherelation betweentheelement typeslevel and the averageof contextimportance ateachlevel of the collectiontreestructure.

Fig.13 showsthat thecontextimportance isproportionaltoelement level andinversely proportionaltothenumberof elementsperlevel. Thismeansthatthemoretheelementisin-depth, moreitisimportant,and thegreateristhenumber ofitssiblings, less isitsimportance. Thefirst observationcanbeexplained bythefact that themorewego in-depth, the morewemeetspecificelementsandmorethenumberoftheelement typesdecrease.Moreover,lessisthediversityinthe element typeson a given level greateris theimportance. The second observationis directlyrelated to thenature of P(T), greateristhenumber ofsiblingsless istheelement context importance. Wededucethat themostimportant elementsin structureddocumentsarethose,whichareatthetopofthetreestructurebecausetheyaremoregeneral,andthose,which aremostin-depthbecausetheyholdrequired information.Theintermediateelements arethus lessimportant.

4.3. Contextimportanceas priorprobability

In thissection, we present theresults obtained with contextimportance model CPrior used as priorprobability in the baselinemodel by setting the smoothing parameters to optimalvalues that our experiments have shown. The results are presented accordingto thenxCG and MAep metrics atcutoffs10, 25 and 50. First, we carriedoutour experiments on all theelements ofthe collectionwithout any distinction.In second time, accordingto the characteristics studied in Section 4.2 about the relationship between context importance and element length, we improved our retrieval effectiveness by removingsmallelements.

4.3.1. Variantsofthecontextimportanceprior model

According to the considered distance from the user-browsing map, the context importance can beestimated by com-parison with the direct predecessors of the considered element type or with the root node. The first distance (dist=1)

gives a local estimation of thecontext importance when thesecond (dist=T.level−1 where T.level is thelevel of the ele-menttypeT)givesaglobalestimation. Thus,accordingtothesevalueswedistinguishtwomodels:Theparent-childcontext prior(PCPrior)where theimportance isestimatedaccordingto Formula (9 )and therootcontext prior(RCPrior)where the importanceisestimatedaccordingtoFormula (15 ).

Table1showstheresultsobtainedwiththesetwomodelsaccordingtonxCGandMAepmetricsatcutoffs10,25and50. RCPrior model achieved significantly better results compared to PCPrior model (0.2710vs. 0.2678) on the first 10% re-turnedelements. Whatmeansthat therelevantelements arereturnedearlier byRCPrior model.While these elementsare returnedonlyatthe25% oftheretrievalresultbyPCPriormodel.TheMAep metricclearlyshowsthat RCPriorgives better resultsthan PCPrior(0.08040vs.0.06868)byimprovingtheretrievaleffectivenessof17.06%.

(17)

Fig._{14. CPrior improvement by removing small elements.} Table₂

CPrior vs. BaseLM and BaseLP according to nxCG and ep-gr metrics.

nxCG[10] nxCG[25] MAep CPrior 0 .2710 0 .2433 0 .08040 BaseLM 0 .1832 0 .1921 0 .06280 Diff(%) 47_.92 26_.65 28_.03 BaseLP 0 .2261 0 .2199 0 .06590 Diff(%) 19_.86 10_.64 22

ThePCPrior modelconsider elements appearing in-depth witha reducednumber ofsiblings asimportant even if they arenot relevant.The localestimation ofelement importance canmake sothat anelement isclassified asmoreimportant than anotheronewhilebothwereestimatedcompared withtheirdirectparents.Therefore, notestimatingtheimportance oftwoelementsaccordingtothesamecontextcanbiasestheranking.

On the other hand, the RCPrior model estimates the importance of allthe elements according to the document root. Therefore,elementsthatareconsideredasimportantareimportantfortheentiredocumentnotonlyfortheirlocalcontexts. Thismeansthat relevantelementsaremoreimportantaccordingtothedocumentrootcontextthanintheirlocalcontext.

Fortheremainsexperiments,weretaintheRCPriormodelthatwesimplycallCPrior.

4.3.2. LengthimprovementofCPrior

AsshowninSection4.2.2,therelationship betweencontextimportance andelementlengthcanbeused toimprovethe retrievaleffectivenessofCPrior modelbyremoving smallelements.Weconductedsomeexperiments toevaluatetheMAep evolution according to the minimalelement length to beconsidered. Fig. 14 shows that byremoving elements under 10 keywords(elementlengthiscalculatedafterstemmingandremovalofstopwords)theretrievaleffectivenessissignificantly improvedby21.87%.

4.3.3. Contextpriorvs.lengthprior

Inthisseriesofexperiments,theresultsobtainedbyCPriormodelarecompared withthebasiclanguagemodelBaseLM (thebaselinemodel withoutthe priorprobability)and length priorlanguage modelBaseLP presented inRamírez, Wester-veld,and Vries,2005wherethepriorprobabilityisconsideredasanelement lengthfunction.

Table2showstheresultsobtainedwithcontextimportance modelCPrioronIEEE documentcollectionaccordingto the nxCGand MAepmetricsatcutoffs10and 25comparedwiththebaselinemodelBaseLMand lengthpriormodelBaseLP.

Comparedwith theBaseLM model usingno information aspriors, CPrior presentsa clear improvementof theretrieval

effectivenessfromthefirstreturnedelements:47.92%atthefirst10%returnedelementsand26.65%atthefirst25%returned elements.TheMAepmeasurealsoshowsthatthestructuralimportanceusedaspriorsimprovestheresultsby28.03%what isaconsiderableimprovement.This firstcomparisonshowsclearlythat thestructuralimportance canbeusedaspriorsto improveretrievaleffectivenessandthatelementsconsideredasimportantseemtobethemostrelevant.

TheCPrior modelpresents alsoaconsiderable improvementcompared withtheBaseLP model (whichuses theelement

lengthaspriors)fromthefirst returnedelements.Onenotesanimprovementof19.86%atthefirst 10%returnedelements and 10.64% at the first 25% returned elements. The improvement according to the MAep measure, which is 22%, is also considerable and this improvement is statistically significant according to t-test. We can deduce that the use of element contentcharacteristics aspriorsimprovestheretrievaleffectivenessbut ourstructuralcontext importance(whichis exclu-sivelyestimatedwithelement structuralcharacteristics)improvesitbetter.Inaddition,itshowsthat anelementwithabig importancehasofbigprobabilitytoberelevantwhateveritscontent.

(18)

Fig._{15. Comparison with other source of evidence according to the ep-gr metric.}

The experiments carriedout in thissection clearlyshow that the useofthe structuralproperties oftheelements as a newsource ofevidenceprovedto beeffective.Thecontext importanceimprovestheretrievaleffectivenessbetter thanthe lengthpriormodel.

4.4. Contextpriorvs.othersources ofevidence

Inthissection,wecompareourapproachwithworksusingdifferentsources ofevidence aspriorsand different contex-tualizationmodels,namely:

1. Börkuretal.(Sigurbjörnsson, 2006):Theprincipalretrievalmodelisalanguagemodelinwhichthepriorprobabilityis estimatedbyaratiooftheelementslengthover thelengthofthecollection.

2. BM25t (Géry & Largeron, 2012) : This model isan extension ofthe famous BM25 model (Jones, Walker, & Robertson, 20 0 0 )adaptedtoXMLretrieval.Informationaboutelementslengthareintegratedintherelevanceestimation withatag weightingfunctionwhichpermittoestimatethecapacityofatagtoreinforcearelevantterm.

3. RMIT (Pehcevski et al., 2005) : In addition to the element length which is a content characteristic, this probabilistic modelincludes astructuralfeaturewhichconsistsoftheelementabsolutepathlength.

4. Dopichaj et al. (Dopichaj, 2006) : In addition to the element length, this model exploit the element position in the documenttree.

5. Paavo Arvolaet al.(Arvola etal., 2011) :The model takes intoaccount the hierarchicaldistance between element and the element position in the hierarchical structure of an XML document. Two models of contextualization have been considered:

A. Arvola-V:Averticalcontextualizationtakingintoaccounttheancestorsofanelementtoestimateitsrelevance. B. ArvolaH:Ahorizontalcontextualizationmodelthat considerselementsatthesamelevelastheconsideredelement

ascontext.

Fig. 15 showsthecomparisonofCPriormodelwiththementionedmodelsaccordingtotheep-grmetric.

All ofthemodelsshown inthiscomparisonuseacombinationofelement contentand structuralcharacteristics as pri-ors.Letusnotethatourmodelgivesbetterresultscomparedwithalltheothersand,thisfromthefirstreturnedelements. Noticethatmodelsintegratingdirectlystructuralcharacteristicssuchastags(BM25t),element absolutepathlength(RMIT) andelementposition(Dopichaj)presentbetterresultsthanthoseconsideringcontentfeaturessuchasBokür.This strength-ensourintuitionand demonstrates that thedocument structureis animportantinformation, which allowsimprovingthe retrievaleffectiveness.

On the other hand, thecurves ofCPrior and Dopichaj evolve almost in the sameway. We recall that Dopichaj model uses the elements position in the document tree as source of evidence. It exploits patterns allowing strengthening the score of certain element types as those containing titles elements. Besides the element position, information concerning the structuralcontext of the element is also exploited. The fact that CPrior gives better results provesthat ourapproach ofconsidering priorsintegrates intuitivelythe element location in the documentduring the estimation ofits importance.

(19)

Wedonotneedtospecifywhich elementtypetoboostbut theelement intrinsicstructuralcharacteristicsaresufficientto determine itsimportance.The BM25tmodel (whichtakes intoaccount theimpactof tagsbyestimating theprobabilityof a tagto distinguishthe relevant terms ofothers) and the RMITmodel (which exploitsthe element absolute path length) realizeless goodresults thanCPrior.Thefactthat thesemodelsexploitonly specificstructuralcharacteristicsand onlyone atthesametime(tagsforBM25t,elementpositionforDopichajandtheelementabsolutepathlengthforRMIT) madethat theydonotbenefit fromallthepowerofthedocumentstructure. Thing,whichCPrior knewhow tointegratethrough the structuralcontextimportance conceptallowingestimatingtheprobabilitythat anelement containsrelevantinformation.

We conclude that ourmodel allowsa good combination and a better exploitation of various element structural char-acteristics through the element typeconcept. The context importance reflects theimpact ofthe element structureon the capacityofanelementtocontainrelevantcontent

5. Conclusion andperspectives

Content-orientedXML retrievalidentifyhighlyrelevantXML elementsthatwouldsatisfyuserinformationneeds. Tothis end,several sources ofevidence are exploited while themost known seemsto be theelement length.In this article,we hypothesize that the location of an element in the document structure has a considerable impact on the user exploring process.

What makes user considering an element as relevant is the ability of that element to reflect user expectations about where to find relevant information. We therefore exploited a new source of evidence, the structural context importance, in order to quantify the user expectations. This new measure is content-independent, as it only requires the structural informationforagivenelement.Wethendefineatheoretically-drivenprobabilisticmodeltoestimatethestructuralcontext importance.

Usingthisprobabilisticmodel,wefirststudiedthecharacteristicsofXMLelementsasreflectedbytheirstructuralcontext importance.Wethencompared contextimportancetolengthpriorbyincorporatingeachofthemasfeaturesinaretrieval settinginordertocomparetheireffectonXMLretrievaleffectiveness.Finally,weproposedacontextimportancesmoothing processwithinthelanguagemodelingframeworkandinvestigatewhetherusingcontextimportancelikepriorprobabilityis effectiveforXML informationretrieval.Ourresearchobjectives wereinvestigatedbycarrying outextensiveexperiments on IEEEXMLdocumentcollection.

Regardingthecomparisonbetweenlengthpriorand contextimportanceprior,theresultsindicatethateven ifthelatter iscompletelyindependenttotheformer,ithoweverseemsthat themostimportantelementsarenotsmall(contentlength is morethan ten keywords). Ouranalysis further indicates that, in contrary to length prior, context importance does not excludesmallestelementswhenthesearerelevant.

The comparisonwith other modelsusing different sources of evidence showedthat ourmodel exploits a better com-bination of element structural characteristics. Our approach to estimate the importance of an element integrates at the sametimetheelementposition,theleveloftheelementinthedocumenttreeand tags.Theconceptsofelementtypeand structuralcontextallow benefitingfrom thedocumentstructureatthemost and,makesourprobabilisticmodelstrong by allowing reflecting the importance of elements such as probably intended bythe document writers. We can also deduce that, thehierarchicalstructureofadocumentwheretextualcontentisintentionallyplacedatpreciseplaces isadeliberate constructoftheauthorinordertoconveyinformationandattractreaders.

Several perspectives remain open. First of all, sinceour model gives a non-content and a query independent element importance estimation, it can be generalized by incorporating links between XML elements in the same document and between elements in different documents. This may give a weighting approach of links in a specific collection such as theWikipedia documentcollection. On anotherdirection, theintrinsic structuralcontext importance maybeimproved by integrating element content features. We could for example, integrate theelement lengthor theterm weight in orderto benefitatthesametimefromthestructureand fromthecontent.

OurmodelhasbeenevaluatedusingtheIEEEXMLdocumentcollection,andaposterioranalysisispresentedbutwithan overfittingrisk.Thedocumentsofthiscollectionareallofthesamestructure;itwouldbeinterestingtostudythebehavior ofourmodelwithacollectionofheterogeneousdocumentstructures.

References

Arvola, P. , Kekäläinen, J. , & Junkkari, M. (2011). Contextualization models for XML retrieval. Information Processing & Management, RI XML, 47 , 762–776 . Ashoori, E. , Lalmas, M. , & Tsikrika, T. (2007). Examining topic shifts in content-oriented XML retrieval. International Journal of Digital Libraries, RI XML, 8 ,

39–60 .

Badache, I., & Boughanem, M. (2015). Document priors based On time-sensitive social signals. In Advances in information retrieval - 37th European conference

on_{IR research, ECIR 2015, Vienna, Austria (pp. 617–622). March 29 - April 2, 2015. Proceedings. doi: 10.1007/978- 3- 319- 16354- 3 _ 68 .}

Banerjee, P. , & Han, H. (2009). Language modeling approaches to information retrieval. JCSE, RI XML, 3 , 143–164 .

Bao, S. , Xue, G.-R. , Wu, X. , Yu, Y. , Fei, B. , & Su, Z. (2007). Optimizing web search using social annotations. In WWW (pp. 501–510) . Beckers, T. , & Korbar, D. (2010). Using eye-tracking for the evaluation of interactive information retrieval. In INEX (pp. 236–240) .

Beigbeder, M., Géry, M., Largeron, C., & Seck, H. (2010). ENSM-SE and UJM at INEX 2010: scoring with proximity and tag weights. In Comparative evaluation

of focused retrieval - 9th international workshop of the initiative for the evaluation of XML retrieval, INEX 2010, Vugh, The Netherlands (pp. 44–53). December

13-15, 2010, revised selected papers. doi: 10.1007/978- 3- 642- 23577-1 _ 3 .

(20)

Buscher, G., Cutrell, E., & Morris, M. R. (2009). What do you see when you’re surfing?: using eye tracking to predict salient regions of web pages. In

Proceedings of the 27th international conference on human factors in computing systems, CHI 2009, Boston, MA, USA (pp. 21–30). April 4-9, 2009. doi: 10.

1145/1518701.1518705 .

Damak, F. , Pinel-Sauvagnat, K. , Boughanem, M. , & Cabanac, G. (2013). Effectiveness of state-of-the-art features for microblog search. In SAC (pp. 914–919) . Dopichaj, P. (2006). The University of Kaiserslautern at INEX 2006. In Comparative evaluation of XML information retrieval systems, 5th international workshop

of the initiative for the evaluation of XML retrieval, INEX 2006, Dagstuhl Castle, Germany (pp. 223–232). December 17-20, 2006, revised and selected

papers. doi: 10.1007/978- 3- 540- 73888- 6 _ 22 .

Fuhr, N. , Lalmas, M. , & Trotman, A. (2007). Comparative evaluation of XML information retrieval systems. 5th international workshop of the initiative for the

evaluation_{of XML retrieval, INEX 2006, Dagstuhl Castle, Germany December 17-20, 2006, revised and selected papers, lecture notes in computer science.}

Springer .

Ganguly, D. , Leveling, J. , Jones, G. J. F. , Palchowdhury, S. , Pal, S. , & Mitra, M. (2010). DCU and ISI@INEX 2010: adhoc and data-centric tracks. In INEX, RI XML (pp. 182–193) .

Géry, M., & Largeron, C. (2012). BM25t: a BM25 extension for focused information retrieval. Knowledge and Information Systems, 32 , 217–241. doi: 10.1007/

s10115- 011- 0426- 0 .

Guo, L. , Shao, F. , Botev, C. , & Shanmugasundaram, J. (2003). XRANK: ranked keyword search over XML documents. In Proceedings of the 2003 ACM SIGMOD

international conference on management of data (pp. 16–27). ACM .

Huang, F. (2007). Using language models and topic models for XML retrieval. In INEX (pp. 94–102) .

Huang, F. , Watt, S. N. K. , Harper, D. J. , & Clark, M. (2006). Compact representations in XML retrieval. In INEX, RI XML (pp. 64–72) .

Huurdeman, H. C. , Kamps, J. , Koolen, M. , & Wees, J. van (2012). Using collaborative filtering in social book search. CLEF (online working notes/labs/workshop) . Järvelin, K. , & Kekäläinen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems, 20 , 422–446 .

Jay, C. , Stevens, R. , Glencross, M. , Chalmers, A. , & Yang, C. (2007). How people use presentation to search for a link: expanding the understanding of

accessibility on the web. Universal Access in the Information Society, 6 , 307–320 .

Joachims, T. (2002). Optimizing search engines using clickthrough data. In KDD (pp. 133–142) .

Jones, K. S., Walker, S., & Robertson, S. E. (20 0 0). A probabilistic model of information retrieval: development and comparative experiments - part 1.

Information_{Processing & Management, 36 , 779–808. doi: 10.1016/S0306-4573(0 0)0 0 015-7 .}

Kamps, J. , Kaptein, R. , & Koolen, M. (2010). Using anchor text, spam filtering and Wikipedia for web search and entity ranking. TREC .

Kamps, J. , Rijke, M. de , & Sigurbjörnsson, B. (2004). Length normalization in XML retrieval. In SIGIR, RI XML (pp. 80–87) .

Kaptein, R. , & Kamps, J. (2013). Exploiting the category structure of Wikipedia for entity ranking. Artificial Intelligence, 194 , 111–129 . Kirsch, S. M. , Gnasa, M. , & Cremers, A. B. (2006). Beyond the web: retrieval in social information spaces. In ECIR (pp. 84–95) .

Kraaij, W., Westerveld, T., & Hiemstra, D. (2002). The importance of prior probabilities for entry page search. In SIGIR 2002: proceedings of the 25th annual

international ACM SIGIR conference on research and development in information retrieval, August 11-15, 2002, Tampere, Finland (pp. 27–34). doi: 10.1145/

564376.564383 .

Lalmas, M. (2009). XML retrieval, RI XML . Morgan & Claypool Publishers .

Mihajlovic, V. , Ramírez, G. , Westerveld, T. , Hiemstra, D. , Blok, H. E. , & Vries, A. P. de (2005). TIJAH scratches INEX 2005: vague element selection, image search, overlap, and relevance feedback. In INEX, RI XML (pp. 72–87) .

Miller, D. R. H. , Leek, T. , & Schwartz, R. M. (1998). BBN at TREC7: using hidden Markov models for information retrieval. In Proceedings of the seventh text

rEtrieval_{conference, TREC 1998, Gaithersburg, Maryland, USA (pp. 80–89). November 9-11, 1998 .}

Ogilvie, P. , & Callan, J. (2005). Parameter estimation for a simple hierarchical generative model for XML retrieval. In INEX, RI XML (pp. 211–224) .

Ogilvie, P. , & Callan, J. (2004). Hierarchical language models for XML component retrieval. In INEX, RI XML (pp. 224–237) .

Ogilvie, P., & Callan, J. (2007).n.d. Using Language Models for Flat Text Queries in XML Retrieval.

Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: bringing order to the web. (Technical Report No. 1999–66). Stanford

InfoLab.

Peetz, M.-H. , & Rijke, M. de (2013). Cognitive temporal document priors. In ECIR (pp. 318–330) .

Pehcevski, J., Thom, J. A., & Tahaghoghi, S. M. M. (2005). RMIT University at INEX 2005: Ad Hoc track. In Advances in XML information retrieval and evaluation,

4th international workshop of the initiative for the evaluation of XML retrieval, INEX 2005, Dagstuhl Castle, Germany (pp. 306–320). November 28-30, 2005,

revised selected papers. doi: 10.1007/11766278 _ 23 .

Ramírez, G. , Westerveld, T. , & Vries, A. P. de (2005). Structural features in content oriented XML retrieval. In CIKM, RI XML (pp. 291–292) .

Robertson, S. E., Zaragoza, H., & Taylor, M. J. (2004). Simple BM25 extension to multiple weighted fields. In Proceedings of the 2004 ACM CIKM international

conference on information and knowledge management, Washington, DC, USA (pp. 42–49). November 8-13, 2004. doi: 10.1145/1031171.1031181 .

Sigurbjörnsson, B. (2006). Focused information access using XML element retrieval . Universiteit Amsterdam .

Sigurbjörnsson, B. , Kamps, J. , & Rijke, M. de (2004). Processing content-and-structure queries for XML retrieval. In TDM (pp. 35–41) .

Termehchy, A. , & Winslett, M. (2011). Using structural information in XML keyword search effectively. ACM Transactions on Database Systems TODS, 36 , 4 .

Tran, V. T. , & Fuhr, N. (2012). Using eye-tracking with dynamic areas of interest for analyzing interactive information retrieval. In SIGIR (pp. 1165–1166) .

Velásquez, J. D. (2013). Combining eye-tracking technologies with web usage mining for identifying website keyobjects. Engineering Applications of AI, 26 ,

1469–1478 .

Westerveld, T. , Kraaij, W. , & Hiemstra, D. (2001). Retrieving web pages using content, links, URLs and anchors. TREC .

Zhai, C. , & Lafferty, J. (2004). A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems