• Aucun résultat trouvé

Semantics of Links and Document Structure Discovery

N/A
N/A
Protected

Academic year: 2022

Partager "Semantics of Links and Document Structure Discovery"

Copied!
8
0
0

Texte intégral

(1)

John R. Punin

[email protected]

http://www.cs.rpi.edu/~ puninj

DepartmentofComputerScience,RPI,Troy,NY

12180.

M. S. Kr ishnamoorthy

[email protected]

http://www.cs.rpi.edu/~ moorthy

DepartmentofComputerScience,RPI,Troy,NY

12180.

ABSTRACT

Thispaperpresentsanovelalgorithmtodiscoverthehier-

archicaldocumentstructurebyclassifyingthelinksbetween

thedocumentpages. Thislinkclassicationaddsmetadata

tothelinks thatcanbeexpressed usingResource Descrip-

tion Framework Syntax [7]. Several well-known programs

automaticallygenerateHTMLwebpagesfromdierentdoc-

umentformatssuchasLaTeX,Powerpoint,Word,etc. Our

interest is in the intertwined HTMLweb pages generated

bytheLaTeX2HTMLprogram[6]. Weusethewebrobotof

theWWWPalSystem[11]tosavethestructureoftheweb

documentinawebgraph. Thenthewebanalyzerofthesys-

temapplies ouralgorithm todiscoverthesemanticsofthe

linksandinferthehierarchicalstructureofthedocument.

1. INTRODUCTION

Thesemanticinformation of adocumentis conveyedby

itslogical structure. Suppose we are given a collectionof

URL'sofawebdocument,alongwiththeup,previous,and

nextlinks, and we are askedto determine the hierarchical

structureof thedocument. Thistask caneasily beaccom-

plishedusingsomestandardtraversalofthewebdocument.

However, ifallthelinksinthe webdocumentareclassied

thesame,theproblemofndingthedocument'shierarchical

structureis not trivial. Our hierarchical structure nding

algorithm solvesthis nontrivial problem and discovers the

logicalstructure.

Severalwell-knownprogramsautomaticallygenerateHTML

webpagesfromdierentdocumentformatssuchasLaTeX,

Powerpoint, Word,etc. Ourinterestis inthe HTMLweb

pagesgeneratedbytheLaTeX2HTMLprogram[6]. Oural-

gorithmisalsoapplicabletoHTMLdocumentsproducedby

Powerpoint.

This hierarchical discoveryproblem is related to nding

a Hamiltonian path in a webgraph. Linearization result-

ingfromfollowingthenextlinksfromthestartingwebpage

correspondstoaHamiltonianpath. OfcoursenotallHamil-

tonianpathsare meaningfullinearizations.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor

personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare

notmadeordistributedforprotorcommercialadvantageandthatcopies

bearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,to

republish,topostonserversortoredistributetolists,requirespriorspecic

permissionbytheauthors.

SemanticWebWorkshop2002Hawaii,USA

Inrecent years,anumberofsoftware systemshavebeen

developed for the analysis of web pages, such as Mapuc-

cino [8][9],MicrosoftSiteAnalyst[12]andWWWPal[11].

WWWPal is dierent because it can handle large graphs

(called webgraphs), has a better display that usescluster-

ing algorithms, and has a skeletal graph browser. One of

WWWPal's functions is to visualize the output XGMML

le from the web robot into a graph that represents the

structureof the website. Severaldrawing algorithms have

beenimplementedtovisualizewebgraphsthatproducenice

graphlayouts. Thegraphvisualization for webdocuments

generated usingthe LaTeX2HTMLprogramis not able to

layoutthehierarchicalstructureoftheunderlyingwebdoc-

ument. SeeFigure1.Ouralgorithmdiscoversthestructure

byclassifying thelinksandthenusingthesemanticsofthe

links for discovering the hierarchical structure of the web

document.

Inthefollowing sections,wedescribethe problemstate-

ment,ouralgorithm,examplesandthereportoflinkclassi-

cationusingRDF.

2. PROBLEM STATEMENT AND ANALY­

SIS

2.1 Common Characters in LaTeX2HTML

websites

AfterseveralLaTeX2HTMLwebsites(webdocumentsgen-

erated by the LaTeX2HTML program) are examined and

the XGMML lesare generated by aweb robot, some in-

sightsfortheLaTeX2HTMLwebsitesareobtained. Mostof

theLaTeX2HTMLwebsitescontainthefollowingnodes/pages:

1. Table ofContents Node: Most of theLaTeX2HTML

documents have this node at the top, which we can

call the root. It lists all the sections the site covers

muchlikeabook'stableofcontents.

2. IndexNode: Thisnodefunctionslikethebook'sindex

pages,onecangotothisnodeandndtheterminolo-

giesoneis interestedin. Thenonecangodirectlyto

thecontentrelatedtothetermthroughthelinkfrom

theIndexNode.

3. Bibliography Node: Thisnode contains the informa-

tionoftheauthororsomerelatedmaterialsandlinks.

4. StartPageNode:ThisisnotthesameastheTableof

ContentsNode,butit includesthe samelinks as the

Table ofContents Node. It is the initial pageof the

wholedocument.

(2)

TUREDISCOVERY

Edges in a webgraph represent the links between web

pages. These links can have a type such as: start, next,

prev, chapter, etc. A group of these standard types has

beenrecommendedfor HTML4.0 [10]. Whenawebgraph

hasahierarchical structure, we would like to visualizethe

webgraphasaradialtreedrawingsothehierarchicalstruc-

tureofthedocumentcanbeclearlyseen. Forexample,the

webgraphofFigure1andFigure2havethesamestructure.

Wecanseethatthehierarchical structureofthewebgraph

isnotevidentinFigure1, butitiseasily visible inFigure

2. Theonlydierence betweenthese twodrawingsis that

inFigure1thetypeofthe edgesarenotconsideredfor vi-

sualization, and in Figure 2the links of type chapter and

sectionarevisualizedas treeedges. InFigure2wecanno-

tice that node 5 is a chapter of Node 1, and nodes 16 to

20are sections ofnode5. We canalsonotice thateachof

the children of node 1 have subtrees, and they are linked

togetherthroughnext andprev links. Figure3 shows how

thesubtreesofnode1arelinkedwiththelinksoftypenext

andprev.

Therearemanywebgraphinstanceswherethetypeofthe

linksismissing. Sowehavetoapplyanalgorithmtoassign

atypeto eachlinkofthe webgraph. Inthissubsection we

willexplainanovelalgorithmtoclassifythelinksofaweb-

graphthatrepresentsthestructureofawebdocument.Our

algorithmwillonlyassignthesefourtypesoflinks: chapter,

section, next and prevThe following rules assign typesto

theedgesofthewebgraph:

The children of the root node that also have a back

linkareconsiderednodechapters. Thereforetheedges

from theroot nodetothosechildren areclassied as

chapteredges.

Ouralgorithmndsthe subtreeofthechapternodes.

Thelinksbetweenthesesubtreesareclassiedasnext

andprevlinks.

Ouralgorithmordersthesesubtreessothereisalinear

orderofall nodesof thewebgraph. Thislinearorder

can be considered the Hamiltonian pathof the web-

graph. Thisordercanbeused,for example,to make

alinearprintoutofthewebdocument.

Thesubtreesofthewebgraphareconsideredchapters

of the document and each of the chapters have sec-

tions. Hencethepreviousrulescanbeappliedtothese

chaptersinordertondthesubtreesections.

Theseruleswillbeapplieduntilalledgesareclassiedso

thechaptersandsectionsarefullyresolved. Forthepurpose

ofvisualizationthechapterandsectionedgesaremappedto

treeedgesandaradialtreedrawingisapplied(Figure2).

The classication of the edges (or links) algorithm has

two steps: rst, the ordering of the children nodes of the

roottree,andsecond,thetraversalofthetreebyretrieving

thenextnode. Thersttaskhelpsustondtherstchild

nodeandndthechapterandsectionedges;thesecondtask

linearizesthewebgraphandndsthenextandprevedges.

Thersttaskisimplementedinthefunctiongetfirst child.

Thisfunctionreceivesthe root node (StartPage Node) of

Figure 1: Radial Tree Drawing of a Web Docu-

ment/webgraph

Figure 2: Typed Links Drawing of a Web Docu-

ment/webgraph

Figure3: SubtreesofNode1

(3)

Figure4: Example ofthe Typed Links Drawing

Node Neighbor Count Count Neighbor

2 8 1 | |

8 2 3 12 1

12 17 1 8 2

17 12 2 22 1

22 17 4 | |

Table1: Neighbor nodes ofthechildren oftheroot

nodeand the counts

foundthatifweapplyBreadthFirstSearch(BFS)starting

atanyofthechildrennodesoftherootnode,thenextchild

nodeisvisitedatmostonetimeandthepreviousnodeisvis-

itedatleastonetime.Forexample,inFigure4,thechildren

oftherootnode1arenodes2,8,12,17and22. Ifweapply

BFS startinginchildnode12, node8is visitedtwotimes

andnode17isvisitedonce. Wecanconcludethatnode8is

thepreviousnodeandnode17isthenextnodeofnode12.

Withthisinformationatableisconstructed(Table 1)with

thenumberoftimesthatBFSvisitedthenextandprevious

nodes. Usingthistablewecanconcludewhatneighborsare

previousandnexttoeachchildrennodeoftherootnode1.

Previousnodesare theoneswhosecountsaregreater than

oneand nextnodes are the ones whosecountsare exactly

one. Table2showsthenextandpreviousnodeforthechil-

drenoftherootnode1(Figure 4). Wecaneasilyseethat

therstchildofnode1isnode2. Noticethattherstnode

2doesnothaveapreviousnode. However,wehavetoverify

thatthereisalinearorderingbetweenallthechildrenofthe

rootnode1toconcludethatnode2isindeedtherstnode.

If arst nodeis notfound, the functionget firstchild

returnsanullnodeandafailureerrorcode.

Thealgorithmofthesecondstepusesthegetfirst child

functiontoretrievetherstnodeofthecurrentvisitednode.

Ifrstnodeisnotfound,itmeansthat thecurrentnodeis

aleafnode. Hence,thenextnodeistheneighbornodethat

hasnotbeenvisitedyet. Thisalgorithm producesa linear

Node Previous Next

2 | 8

8 2 12

12 8 17

17 12 22

22 17 |

Table 2: Table of previous and next nodes for the

children ofthe root node

intlineartraversal(PNODEroot)

f

interror=0;

PNODEnext=root;

//chapteredges

classifychapteredges(root);

//Lineartraversal

while(next&&error)f

next=get nextvertex(next,&error);

g

returnerror;

g

//Getnextnodeinlineartraversal

PNODEget nextvertex(PNODEv,int*error)

f

PNODEnext;

mark(v);

//getrstchildofchildrennodes

next=get rstchild(v,error);

//sectionedges

if(next&&*error)

classifysectionedges(v);

//leafnodes

if(!next&&*error)f

next=get unmarkedneighbor(v,error);

//nextandpreviousedges

if(next&&*error)

classifynextprevedges(v,next);

g

returnnext;

g

Table 3: Algorithm for classicationofthe edgesof

awebgraph

orderingofthenodes ofthewebgraph. Theedgesbetween

thecurrentnodeandthenot-visitedneighborsareclassied

asnextandprevedges. Theedgesbetweenthecurrentnode

and itschildrenare classiedas chapterandsection edges.

Table3showsthealgorithmofthesecondstepoftheweb-

graphtraversal. Onceall edgesof thewebgraph havebeen

assignedatype,thechapterandsectionnodesare mapped

totreeedges(blackcolor),thenextedgesaremappedtofor-

wardedges(magentacolor),andtheprevedgesaremapped

to backwardedges (greencolor). Afterwards, atreeor ra-

dialtreedrawingcanbeappliedtovisualizethehierarchical

structurethewebgraph. (Figures2and4).

Our algorithm fails when all the counts of the table of

neighbor nodes(Table 1)are onesincewewill notbeable

toconstructthetableofpreviousandnextnodes(Table2).

Without that table we cannot conclude whichnode is the

rstnode. Forexample,Figure5showsasimplehierarchical

webgraphwherewecannotinferwhatnodeistherstnode;

it is either node 2 or 8. When we construct the table of

neighbornodesallcountsareoneasshowninTable4.

Ouralgorithm isof lineartimecomplexityas eachnode

getsvisitedaconstantnumberoftimes.

4. REPORTOFLINKCLASSIFICATIONUS­

(4)

wherethe Algorithm fails.

Node Neighbor Count Count Neighbor

2 3 1 | |

3 4 1 2 1

4 3 1 5 1

5 4 1 6 1

6 5 1 7 1

7 8 1 6 1

8 7 1 | |

Table4: Neighbor nodes ofthechildren oftheroot

node and the counts - All being equal to 1 makes

thealgorithm fail

Thealgorithmdescribedintheprevioussection attaches

semanticsto thelinksand hencethewebpages. Onemay

questiontheusefulnessofthealgorithmifthesemanticsof

hyperlinks, suchas up,nextandprevisknownapriori. We

envisage that explicitly specifying the links in a common

semantic vocabulary will be useful to both the users and

theagents.

Asanexample,theRDFdescriptionfornode9

(http://cbl.leeds.ac.uk/ nikos/doc/www94/subsection3 41.html)

ofthewebgraphoftheFigure3isasfollows:

(5)

xmlns:rs=''http://purl.org/net/rdf/papers/sitemap#''

xmlns:dc=''http://purl.org/dc/elements/1.1/''>

<rdf:Description rdf:about=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html''>

<dc:title>CommonObjections toAutomatic Conversion</dc:title>

<rs:prevrdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/section3_4.html''/>

<rs:nextrdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_1.html''/>

<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_1.html''/>

<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_2.html''/>

<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_3.html''/>

<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_4.html''/>

<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_5.html''/>

<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_6.html''/>

<rs:contentsrdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/tableofcontents3_1.html''/>

</rdf:Description>

</rdf:RDF>

TheRDFtriplesinN-triplesformat[2]are:

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>

<http://purl.org/dc/elements/1.1/title>

"CommonObjectionsto AutomaticConversion" .

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>

<http://purl.org/net/rdf/papers/sitemap#prev>

<http://cbl.leeds.ac.uk/~nikos/doc/www94/section3_4.html>.

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>

<http://purl.org/net/rdf/papers/sitemap#next>

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_1.html>.

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>

<http://purl.org/net/rdf/papers/sitemap#section>

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_1.html>.

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>

<http://purl.org/net/rdf/papers/sitemap#section>

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_2.html>.

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>

<http://purl.org/net/rdf/papers/sitemap#section>

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_3.html>.

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>

<http://purl.org/net/rdf/papers/sitemap#section>

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_4.html>.

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>

<http://purl.org/net/rdf/papers/sitemap#section>

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_5.html>.

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>

<http://purl.org/net/rdf/papers/sitemap#section>

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_6.html>.

<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>

<http://purl.org/net/rdf/papers/sitemap#contents>

<http://cbl.leeds.ac.uk/~nikos/doc/www94/tableofcontents3_1.html>.

NoticethatweusetheDublincore [DC]andtheRDFSitemapvocabulary[RS].

(6)

Weillustratetheoutputofouralgorithmonawebgraph,

obtainedfromvisitingtheLaTeX2HTMLwebsite\Stateof

theArtReviewonHypermediaIssuesAndApplications"at

http://cbl.leeds.ac.uk/nikos/ tmp/ hypem edia /hyp emedi a.ht ml.

Figure6shows theradial treedrawing ofthewebdocu-

ment, and Figure 7 shows the hierarchical drawing of the

samewebdocument, where the chapter nodes and section

nodescanbeobserved.

6. CONCLUSION

This paperis yetanother important contributionto the

Semantic Web[1][7][3][4]as we provide semanticclassica-

tionoflinksandthewebpagesofaLaTeX2HTMLwebsite.

Thispaperalso provides a visualization tool basedonour

algorithm. Thisvisualizationnotonlydrawsaprettygraph,

butalso semantic content isimparted. Our algorithm can

beextendedforother typesofdocuments.

Printingawebdocument,consistingofanumberofweb

pages,isalaborioustask[1],aseachofthewebpageshasto

betraversedinalinearorder. Thislinearizationisobtained

automaticallyfrom thehierarchical structureorderdiscov-

eredbyouralgorithm. WWWPalusesthislinearizationto

simplifytheprintingofwebdocuments.

7. REFERENCES

[1] T.Berners-Lee,SemanticWebAreaforPlay: Closed

WorldMachine.

http://www.w3.org/2000/10/sw ap/Ov ervi ew.ht ml,

February2001.

[2] T.Berners-Lee,Notation3.

http://www.w3.org/DesignIssu es/No tati on3.h tml,

April2001.

[3] D.Brickley,RDFsitemapsandDublinCoresite

summaries,

http://purl.org/net/rdf/pape rs/si tema p/02),

June1999.

[4] D.Brickley,Biz/edRDFMetadataTestbed,

http://ilrt.org/discovery/20 00/08 /biz ed-me ta,

July2001.

[5] O.DeTroyerandT.Decruyenaere,\Conceptual

ModellingofWebSitesforEndUsers,"WWW

Journal,Vol.3,Issue1,Baltzer SciencePublishers,

2000.

[6] N.Drakos,\Fromtexttohypertext: Apost-hoc

rationalisationofLaTeX2HTML,"InProceedingsof

theFirstWorldWideWebConference,Geneva,

Switzerland,1991.

[7] O.LassilaandR.Swick,W3C,ResourceDescription

Framework(RDF)ModelandSyntaxSpecication.

http://wwww.w3.org/TR/REC-rd f-syn tax, 1999.

[8] Mapuccinohttp://www.ibm.com/java/map ucci no/

[9] Y.MaarekandI.Shaul,\WebCutter: ASystemfor

DynamicandTailorableSiteMapping,"InProceedings

ofWWW6Conference,SantaClara,USA,1997.

[10] D.Raggettetal,HTML4.01Specication,

http://www.w3.org/TR/html401 / ,1999.

[11] J.PuninandM.Krishnamoorthy,\WWWPalSystem

-ASystemforAnalysisandSynthesisofWebPages",

InProceedingsoftheWebNet98Conference,

PressInc,July1999.

(7)
(8)

Références

Documents relatifs

Then, we define the notions of comparability and incomparability graphs of a poset, summarize the recent advances on the question and compute the number of edges of these graphs,

In the present paper, we prove that the average complexity of the tree align- ment algorithm of [10] is in O(nm), as well as the average complexity of the RNA structure

Article 2 of the Schengen Agreement states that “internal borders may be crossed at any point without any checks on persons carried out.” Yet the aboli- tion of internal

Quantitative frameworks use algebraic expressions instead of rules and ex- haustive tables. Amyot et al. In the AHP-based proposals by Liaskos et al. [13], the same aggregation

‫الخــاتمة العامــة ‪:‬‬ ‫يتضق مف خالؿ دراستنا لموضوع االنعكاسات المرتقبة عمى انضماـ الدوؿ النامية في المنظمة‬ ‫العالمية لمتجارة مع اإلشارة إلى

Therefore we consider the situation where the stress free strain E is constant, non zero, in a half space bounded by a wedge, whose edge is along Oxl. We shall name

longitudinal and c) transversal cuts. 136 Figure 88 : a) design of the TEM cell and the MEA-2, b) En field spatial distribution at the surface of MEA-2 microelectrodes at 1.8 GHz,

Keywords: self-dual operators · tree of shapes · vertex-valued graph · well-composed gray-level images · digital topology..