John R. Punin
http://www.cs.rpi.edu/~ puninj
DepartmentofComputerScience,RPI,Troy,NY
12180.
M. S. Kr ishnamoorthy
http://www.cs.rpi.edu/~ moorthy
DepartmentofComputerScience,RPI,Troy,NY
12180.
ABSTRACT
Thispaperpresentsanovelalgorithmtodiscoverthehier-
archicaldocumentstructurebyclassifyingthelinksbetween
thedocumentpages. Thislinkclassicationaddsmetadata
tothelinks thatcanbeexpressed usingResource Descrip-
tion Framework Syntax [7]. Several well-known programs
automaticallygenerateHTMLwebpagesfromdierentdoc-
umentformatssuchasLaTeX,Powerpoint,Word,etc. Our
interest is in the intertwined HTMLweb pages generated
bytheLaTeX2HTMLprogram[6]. Weusethewebrobotof
theWWWPalSystem[11]tosavethestructureoftheweb
documentinawebgraph. Thenthewebanalyzerofthesys-
temapplies ouralgorithm todiscoverthesemanticsofthe
linksandinferthehierarchicalstructureofthedocument.
1. INTRODUCTION
Thesemanticinformation of adocumentis conveyedby
itslogical structure. Suppose we are given a collectionof
URL'sofawebdocument,alongwiththeup,previous,and
nextlinks, and we are askedto determine the hierarchical
structureof thedocument. Thistask caneasily beaccom-
plishedusingsomestandardtraversalofthewebdocument.
However, ifallthelinksinthe webdocumentareclassied
thesame,theproblemofndingthedocument'shierarchical
structureis not trivial. Our hierarchical structure nding
algorithm solvesthis nontrivial problem and discovers the
logicalstructure.
Severalwell-knownprogramsautomaticallygenerateHTML
webpagesfromdierentdocumentformatssuchasLaTeX,
Powerpoint, Word,etc. Ourinterestis inthe HTMLweb
pagesgeneratedbytheLaTeX2HTMLprogram[6]. Oural-
gorithmisalsoapplicabletoHTMLdocumentsproducedby
Powerpoint.
This hierarchical discoveryproblem is related to nding
a Hamiltonian path in a webgraph. Linearization result-
ingfromfollowingthenextlinksfromthestartingwebpage
correspondstoaHamiltonianpath. OfcoursenotallHamil-
tonianpathsare meaningfullinearizations.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare
notmadeordistributedforprotorcommercialadvantageandthatcopies
bearthisnoticeandthefullcitationontherstpage.Tocopyotherwise,to
republish,topostonserversortoredistributetolists,requirespriorspecic
permissionbytheauthors.
SemanticWebWorkshop2002Hawaii,USA
Inrecent years,anumberofsoftware systemshavebeen
developed for the analysis of web pages, such as Mapuc-
cino [8][9],MicrosoftSiteAnalyst[12]andWWWPal[11].
WWWPal is dierent because it can handle large graphs
(called webgraphs), has a better display that usescluster-
ing algorithms, and has a skeletal graph browser. One of
WWWPal's functions is to visualize the output XGMML
le from the web robot into a graph that represents the
structureof the website. Severaldrawing algorithms have
beenimplementedtovisualizewebgraphsthatproducenice
graphlayouts. Thegraphvisualization for webdocuments
generated usingthe LaTeX2HTMLprogramis not able to
layoutthehierarchicalstructureoftheunderlyingwebdoc-
ument. SeeFigure1.Ouralgorithmdiscoversthestructure
byclassifying thelinksandthenusingthesemanticsofthe
links for discovering the hierarchical structure of the web
document.
Inthefollowing sections,wedescribethe problemstate-
ment,ouralgorithm,examplesandthereportoflinkclassi-
cationusingRDF.
2. PROBLEM STATEMENT AND ANALY
SIS
2.1 Common Characters in LaTeX2HTML
websites
AfterseveralLaTeX2HTMLwebsites(webdocumentsgen-
erated by the LaTeX2HTML program) are examined and
the XGMML lesare generated by aweb robot, some in-
sightsfortheLaTeX2HTMLwebsitesareobtained. Mostof
theLaTeX2HTMLwebsitescontainthefollowingnodes/pages:
1. Table ofContents Node: Most of theLaTeX2HTML
documents have this node at the top, which we can
call the root. It lists all the sections the site covers
muchlikeabook'stableofcontents.
2. IndexNode: Thisnodefunctionslikethebook'sindex
pages,onecangotothisnodeandndtheterminolo-
giesoneis interestedin. Thenonecangodirectlyto
thecontentrelatedtothetermthroughthelinkfrom
theIndexNode.
3. Bibliography Node: Thisnode contains the informa-
tionoftheauthororsomerelatedmaterialsandlinks.
4. StartPageNode:ThisisnotthesameastheTableof
ContentsNode,butit includesthe samelinks as the
Table ofContents Node. It is the initial pageof the
wholedocument.
TUREDISCOVERY
Edges in a webgraph represent the links between web
pages. These links can have a type such as: start, next,
prev, chapter, etc. A group of these standard types has
beenrecommendedfor HTML4.0 [10]. Whenawebgraph
hasahierarchical structure, we would like to visualizethe
webgraphasaradialtreedrawingsothehierarchicalstruc-
tureofthedocumentcanbeclearlyseen. Forexample,the
webgraphofFigure1andFigure2havethesamestructure.
Wecanseethatthehierarchical structureofthewebgraph
isnotevidentinFigure1, butitiseasily visible inFigure
2. Theonlydierence betweenthese twodrawingsis that
inFigure1thetypeofthe edgesarenotconsideredfor vi-
sualization, and in Figure 2the links of type chapter and
sectionarevisualizedas treeedges. InFigure2wecanno-
tice that node 5 is a chapter of Node 1, and nodes 16 to
20are sections ofnode5. We canalsonotice thateachof
the children of node 1 have subtrees, and they are linked
togetherthroughnext andprev links. Figure3 shows how
thesubtreesofnode1arelinkedwiththelinksoftypenext
andprev.
Therearemanywebgraphinstanceswherethetypeofthe
linksismissing. Sowehavetoapplyanalgorithmtoassign
atypeto eachlinkofthe webgraph. Inthissubsection we
willexplainanovelalgorithmtoclassifythelinksofaweb-
graphthatrepresentsthestructureofawebdocument.Our
algorithmwillonlyassignthesefourtypesoflinks: chapter,
section, next and prevThe following rules assign typesto
theedgesofthewebgraph:
The children of the root node that also have a back
linkareconsiderednodechapters. Thereforetheedges
from theroot nodetothosechildren areclassied as
chapteredges.
Ouralgorithmndsthe subtreeofthechapternodes.
Thelinksbetweenthesesubtreesareclassiedasnext
andprevlinks.
Ouralgorithmordersthesesubtreessothereisalinear
orderofall nodesof thewebgraph. Thislinearorder
can be considered the Hamiltonian pathof the web-
graph. Thisordercanbeused,for example,to make
alinearprintoutofthewebdocument.
Thesubtreesofthewebgraphareconsideredchapters
of the document and each of the chapters have sec-
tions. Hencethepreviousrulescanbeappliedtothese
chaptersinordertondthesubtreesections.
Theseruleswillbeapplieduntilalledgesareclassiedso
thechaptersandsectionsarefullyresolved. Forthepurpose
ofvisualizationthechapterandsectionedgesaremappedto
treeedgesandaradialtreedrawingisapplied(Figure2).
The classication of the edges (or links) algorithm has
two steps: rst, the ordering of the children nodes of the
roottree,andsecond,thetraversalofthetreebyretrieving
thenextnode. Thersttaskhelpsustondtherstchild
nodeandndthechapterandsectionedges;thesecondtask
linearizesthewebgraphandndsthenextandprevedges.
Thersttaskisimplementedinthefunctiongetfirst child.
Thisfunctionreceivesthe root node (StartPage Node) of
Figure 1: Radial Tree Drawing of a Web Docu-
ment/webgraph
Figure 2: Typed Links Drawing of a Web Docu-
ment/webgraph
Figure3: SubtreesofNode1
Figure4: Example ofthe Typed Links Drawing
Node Neighbor Count Count Neighbor
2 8 1 | |
8 2 3 12 1
12 17 1 8 2
17 12 2 22 1
22 17 4 | |
Table1: Neighbor nodes ofthechildren oftheroot
nodeand the counts
foundthatifweapplyBreadthFirstSearch(BFS)starting
atanyofthechildrennodesoftherootnode,thenextchild
nodeisvisitedatmostonetimeandthepreviousnodeisvis-
itedatleastonetime.Forexample,inFigure4,thechildren
oftherootnode1arenodes2,8,12,17and22. Ifweapply
BFS startinginchildnode12, node8is visitedtwotimes
andnode17isvisitedonce. Wecanconcludethatnode8is
thepreviousnodeandnode17isthenextnodeofnode12.
Withthisinformationatableisconstructed(Table 1)with
thenumberoftimesthatBFSvisitedthenextandprevious
nodes. Usingthistablewecanconcludewhatneighborsare
previousandnexttoeachchildrennodeoftherootnode1.
Previousnodesare theoneswhosecountsaregreater than
oneand nextnodes are the ones whosecountsare exactly
one. Table2showsthenextandpreviousnodeforthechil-
drenoftherootnode1(Figure 4). Wecaneasilyseethat
therstchildofnode1isnode2. Noticethattherstnode
2doesnothaveapreviousnode. However,wehavetoverify
thatthereisalinearorderingbetweenallthechildrenofthe
rootnode1toconcludethatnode2isindeedtherstnode.
If arst nodeis notfound, the functionget firstchild
returnsanullnodeandafailureerrorcode.
Thealgorithmofthesecondstepusesthegetfirst child
functiontoretrievetherstnodeofthecurrentvisitednode.
Ifrstnodeisnotfound,itmeansthat thecurrentnodeis
aleafnode. Hence,thenextnodeistheneighbornodethat
hasnotbeenvisitedyet. Thisalgorithm producesa linear
Node Previous Next
2 | 8
8 2 12
12 8 17
17 12 22
22 17 |
Table 2: Table of previous and next nodes for the
children ofthe root node
intlineartraversal(PNODEroot)
f
interror=0;
PNODEnext=root;
//chapteredges
classifychapteredges(root);
//Lineartraversal
while(next&&error)f
next=get nextvertex(next,&error);
g
returnerror;
g
//Getnextnodeinlineartraversal
PNODEget nextvertex(PNODEv,int*error)
f
PNODEnext;
mark(v);
//getrstchildofchildrennodes
next=get rstchild(v,error);
//sectionedges
if(next&&*error)
classifysectionedges(v);
//leafnodes
if(!next&&*error)f
next=get unmarkedneighbor(v,error);
//nextandpreviousedges
if(next&&*error)
classifynextprevedges(v,next);
g
returnnext;
g
Table 3: Algorithm for classicationofthe edgesof
awebgraph
orderingofthenodes ofthewebgraph. Theedgesbetween
thecurrentnodeandthenot-visitedneighborsareclassied
asnextandprevedges. Theedgesbetweenthecurrentnode
and itschildrenare classiedas chapterandsection edges.
Table3showsthealgorithmofthesecondstepoftheweb-
graphtraversal. Onceall edgesof thewebgraph havebeen
assignedatype,thechapterandsectionnodesare mapped
totreeedges(blackcolor),thenextedgesaremappedtofor-
wardedges(magentacolor),andtheprevedgesaremapped
to backwardedges (greencolor). Afterwards, atreeor ra-
dialtreedrawingcanbeappliedtovisualizethehierarchical
structurethewebgraph. (Figures2and4).
Our algorithm fails when all the counts of the table of
neighbor nodes(Table 1)are onesincewewill notbeable
toconstructthetableofpreviousandnextnodes(Table2).
Without that table we cannot conclude whichnode is the
rstnode. Forexample,Figure5showsasimplehierarchical
webgraphwherewecannotinferwhatnodeistherstnode;
it is either node 2 or 8. When we construct the table of
neighbornodesallcountsareoneasshowninTable4.
Ouralgorithm isof lineartimecomplexityas eachnode
getsvisitedaconstantnumberoftimes.
4. REPORTOFLINKCLASSIFICATIONUS
wherethe Algorithm fails.
Node Neighbor Count Count Neighbor
2 3 1 | |
3 4 1 2 1
4 3 1 5 1
5 4 1 6 1
6 5 1 7 1
7 8 1 6 1
8 7 1 | |
Table4: Neighbor nodes ofthechildren oftheroot
node and the counts - All being equal to 1 makes
thealgorithm fail
Thealgorithmdescribedintheprevioussection attaches
semanticsto thelinksand hencethewebpages. Onemay
questiontheusefulnessofthealgorithmifthesemanticsof
hyperlinks, suchas up,nextandprevisknownapriori. We
envisage that explicitly specifying the links in a common
semantic vocabulary will be useful to both the users and
theagents.
Asanexample,theRDFdescriptionfornode9
(http://cbl.leeds.ac.uk/ nikos/doc/www94/subsection3 41.html)
ofthewebgraphoftheFigure3isasfollows:
xmlns:rs=''http://purl.org/net/rdf/papers/sitemap#''
xmlns:dc=''http://purl.org/dc/elements/1.1/''>
<rdf:Description rdf:about=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html''>
<dc:title>CommonObjections toAutomatic Conversion</dc:title>
<rs:prevrdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/section3_4.html''/>
<rs:nextrdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_1.html''/>
<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_1.html''/>
<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_2.html''/>
<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_3.html''/>
<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_4.html''/>
<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_5.html''/>
<rs:section rdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_6.html''/>
<rs:contentsrdf:resource=''http://cbl.leeds.ac.uk/~nikos/doc/www94/tableofcontents3_1.html''/>
</rdf:Description>
</rdf:RDF>
TheRDFtriplesinN-triplesformat[2]are:
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>
<http://purl.org/dc/elements/1.1/title>
"CommonObjectionsto AutomaticConversion" .
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>
<http://purl.org/net/rdf/papers/sitemap#prev>
<http://cbl.leeds.ac.uk/~nikos/doc/www94/section3_4.html>.
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>
<http://purl.org/net/rdf/papers/sitemap#next>
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_1.html>.
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>
<http://purl.org/net/rdf/papers/sitemap#section>
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_1.html>.
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>
<http://purl.org/net/rdf/papers/sitemap#section>
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_2.html>.
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>
<http://purl.org/net/rdf/papers/sitemap#section>
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_3.html>.
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>
<http://purl.org/net/rdf/papers/sitemap#section>
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_4.html>.
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>
<http://purl.org/net/rdf/papers/sitemap#section>
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_5.html>.
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>
<http://purl.org/net/rdf/papers/sitemap#section>
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsubsection3_4_1_6.html>.
<http://cbl.leeds.ac.uk/~nikos/doc/www94/subsection3_4_1.html>
<http://purl.org/net/rdf/papers/sitemap#contents>
<http://cbl.leeds.ac.uk/~nikos/doc/www94/tableofcontents3_1.html>.
NoticethatweusetheDublincore [DC]andtheRDFSitemapvocabulary[RS].
Weillustratetheoutputofouralgorithmonawebgraph,
obtainedfromvisitingtheLaTeX2HTMLwebsite\Stateof
theArtReviewonHypermediaIssuesAndApplications"at
http://cbl.leeds.ac.uk/nikos/ tmp/ hypem edia /hyp emedi a.ht ml.
Figure6shows theradial treedrawing ofthewebdocu-
ment, and Figure 7 shows the hierarchical drawing of the
samewebdocument, where the chapter nodes and section
nodescanbeobserved.
6. CONCLUSION
This paperis yetanother important contributionto the
Semantic Web[1][7][3][4]as we provide semanticclassica-
tionoflinksandthewebpagesofaLaTeX2HTMLwebsite.
Thispaperalso provides a visualization tool basedonour
algorithm. Thisvisualizationnotonlydrawsaprettygraph,
butalso semantic content isimparted. Our algorithm can
beextendedforother typesofdocuments.
Printingawebdocument,consistingofanumberofweb
pages,isalaborioustask[1],aseachofthewebpageshasto
betraversedinalinearorder. Thislinearizationisobtained
automaticallyfrom thehierarchical structureorderdiscov-
eredbyouralgorithm. WWWPalusesthislinearizationto
simplifytheprintingofwebdocuments.
7. REFERENCES
[1] T.Berners-Lee,SemanticWebAreaforPlay: Closed
WorldMachine.
http://www.w3.org/2000/10/sw ap/Ov ervi ew.ht ml,
February2001.
[2] T.Berners-Lee,Notation3.
http://www.w3.org/DesignIssu es/No tati on3.h tml,
April2001.
[3] D.Brickley,RDFsitemapsandDublinCoresite
summaries,
http://purl.org/net/rdf/pape rs/si tema p/02),
June1999.
[4] D.Brickley,Biz/edRDFMetadataTestbed,
http://ilrt.org/discovery/20 00/08 /biz ed-me ta,
July2001.
[5] O.DeTroyerandT.Decruyenaere,\Conceptual
ModellingofWebSitesforEndUsers,"WWW
Journal,Vol.3,Issue1,Baltzer SciencePublishers,
2000.
[6] N.Drakos,\Fromtexttohypertext: Apost-hoc
rationalisationofLaTeX2HTML,"InProceedingsof
theFirstWorldWideWebConference,Geneva,
Switzerland,1991.
[7] O.LassilaandR.Swick,W3C,ResourceDescription
Framework(RDF)ModelandSyntaxSpecication.
http://wwww.w3.org/TR/REC-rd f-syn tax, 1999.
[8] Mapuccinohttp://www.ibm.com/java/map ucci no/
[9] Y.MaarekandI.Shaul,\WebCutter: ASystemfor
DynamicandTailorableSiteMapping,"InProceedings
ofWWW6Conference,SantaClara,USA,1997.
[10] D.Raggettetal,HTML4.01Specication,
http://www.w3.org/TR/html401 / ,1999.
[11] J.PuninandM.Krishnamoorthy,\WWWPalSystem
-ASystemforAnalysisandSynthesisofWebPages",
InProceedingsoftheWebNet98Conference,
PressInc,July1999.