Integration of resources on the World Wide Web using XML

(1)

Cahiers

enberg

GUT GUT GUT

m INTEGRATION OF RESOURCES ON THE WORLD WIDE WEB USING XML

P Roberta Faggian

Cahiers GUTenberg, n  35-36 (2000), p. 157-167.

<http://cahiers.gutenberg.eu.org/fitem?id=CG_2000___35-36_157_0>

© Association GUTenberg, 2000, tous droits réservés.

L’accès aux articles des Cahiers GUTenberg (http://cahiers.gutenberg.eu.org/),

implique l’accord avec les conditions générales

d’utilisation (http://cahiers.gutenberg.eu.org/legal.html).

Toute utilisation commerciale ou impression systématique

est constitutive d’une infraction pénale. Toute copie ou impression

de ce fichier doit contenir la présente mention de copyright.

(2)

(3)

Integration of resources on the World

Wide Web using XML

RobertaFaggian

CERN, Genève,Suisse

Abstract. An initiative to explain High Energy Physics to the general public has

beenstartedatCERN.TheuseoftheWebhasbeenidentiedascrucialtothesuccess

of this initiative. An integral part of this project is theconstruction of a Web-based

informationsystem,whichcollectsmanydierentresourcesontheWeb(information

publishedbymanyEuropean,andUS,ParticlePhysicsinstitutes).Thispaperproposes

a solution tothe problemof integration and reuse of heterogeneous information by

enrichingexistingcontentsemanticwithmetadatainordertoimproveunderstanding

anddiscovery.ThemainpartoftheworkisthestudyofRDFstandardforrepresenting

metadata,anditsimplementationusing theXMLsyntax.

1. Introduction

The spread of the Internet and the enormous growth of the World Wide

Web have allowed everybody to publish and distribute electronic documents

throughout the world. At the same time, however, the enormous amount of

informationontheInternetmakesitverydiculttondspecicitems.Thisis

themainreasonwhysearchengineswereinvented:to searchtheWebforpar-

ticular subjects.Search engines,bydenition,are verygood atsearchingthe

Web,but theanswerswearelooking forare oftenburiedinsideanenoumous

amountofirrelevant,atleasttous,information.Thereasonforthisisthatthe

authorsofWebpagesdidnotconsiderstructuringtheirinformationandusing

metadata to help the search engines; furthermore, the language commonly

usedforpublishingontheWeb,HTML,is,ontheonehand,easyto use,but,

ontheotherhand,hasseverallimitations:

noextensibility,itisnotpossibletodenenewtagstobetterclassifythe

information;

poorstructuringofdocuments,makingitinsucientforpublishing com-

plexinformation;

(4)

astandardformattingisassociatedtoeachtag.

ThenewreleasesoftheHTMLlanguage(DHTML,XHTML)trytoovercome

these limitationsandproducedocumentswhichcanbebetterunderstood and

processed. Newapproaches are tryingto oer better alternativesand, at the

moment, XML seems to be a good solution: it enriches document structure,

meaning, comprehensibility, validation, representation and more. But what

should wedowiththemillionsofHTMLdocumentsalreadyontheWeb?

This article proposes how to organisemanyheterogeneus Web resourcesin a

way that they can be classied and searched using appropriatemetadata to

describe their contents, the RDF model to structure the metadata and the

XML syntaxto implementthemodel.

The case of study described here concerns the eld of High Energy Physics

(HEP), butitcouldbeusedin anyothereldofknowledge.

2. The problem of information retrieval

TheWebcontainsinformationaboutahugenumberofsubjectsintendedfora

wide varietyofpurposes, and arestructured and formattedin many dierent

ways.In orderto allowWeb resourcesto be properlysearched and processed

metadataneedsto beused.

2.1. Metadata

Metadata is information about information, i.e. descriptive/semantic infor-

mation about documents which is structured in such a way that it can be

usedforclassicationandretrieval;allowingdocumentstobebetterusedand

shared over the Net. It is unlikely that everyonewill agree to use the same

metadata implementation, nevertheless, metadatafunctionality have alot in

common, even when the metadata are dierent. RDF (Resource Description

Framework) is aneort to identify these common threads and denes away

forwebarchitectsto usethem toprovideusefulWebmetadata.

2.2. RDF

RDF(ResourceDescriptionFramework)denesastandardmechanismforrep-

resentingandinterchangingmetadata,whichissuitablefordescribinginforma-

(5)

anaudiole,amovieclip,etc.)canbereferredtobyanaddressanddescribed

using metadata in the form of Statements. A Statement is a triplet which

associates a specic resource, a named property and the value the property

takesfor thatresource.A Statementcan begraphicallyrepresentedasshown

below:

Asimpleexampleofastatementis:

The title of the page identified by http://mysite/intro.html is

`History of Physics'.

Thecorrespondingtripletis

(http://mysite/intro.html, Title, `Hystory of Physics')

which canberepresentedasfollows:

RDFpropertiesmaybethoughtofasattributes ofresourcesandalso asrela-

tionshipsbetweenresources.

SinceXML seemsto beagood solutionasanexchangeformatontheWebit

hasbeenproposedasthesyntaxfortheimplementationoftheRDFmetadata

(6)

<rdf:RDF>

<rdf:Description about="http://mysite/intro.ht ml">

<s:Title>Hystory of Physics</s:Title>

</rdf:Description>

</rdf:RDF>

that becomes,withthebasicabbreviatedsyntax:

<rdf:RDF>

<rdf:Description about="http://mysite/intro.ht ml"

s:Title="Hystory of Physics">

</rdf:RDF>

RDFcan beusedin avarietyofapplicationareas:

in resourcediscovery,toprovidebettersearchenginecapabilities;

in cataloging,to describethecontentandcontentrelationshipsavailable

ataparticularWebsite,page,ordigitallibrary;

by intelligent software agents, to facilitate knowledge sharing and ex-

change;

in contentrating,forcontentselection;

in describing collections of pages that represent a single logical "docu-

ment";

fordescribingintellectualpropertyrightsofWebpages;

for expressing the privacy preferences of a user as well as the privacy

policiesofaWebsite.

2.3. RDF Schemas

Whendeningameta-representationofinformationitisfundamentaltoavoid

ambiguousdescriptions.Theauthorandthereadermustagreeonthemeaning

of thetermsused forrepresentingproperties andproperty values.Inaglobal

system,liketheWWW,onecannotassumethatalltermswillnecessarlymean

thesamething foreverybody.It isnecessary, therefore,to beprecise andun-

ambiguous.TheRDFSchemaspecicationprovidesamechanismthat canbe

used to denespecial vocabulariesfor avarietyof application domains. This

should helppeopletocommunicateeectively.

Independentcommunitiescandevelopvocabulariesthatsuittheirspecicneeds

andsharevocabularieswithothercommunities.Aschemadenesthemeaning,

characteristics,andrelationshipsofasetofproperties(thevocabulary'sterms),

(7)

propertiesfromotherschemas.TheRDFlanguageallowseachdocumentcon-

tainingmetadatato clarify which vocabulary is being usedby assigningeach

vocabulary aWeb address.If twoapplications usethe sametagnames XML

Namespacesbecomeimportantinresolvingconicts.

One of the best-known schemas is the Dublin Core, invented by the library

community.TheDublin CoreMetadataElementSetwasintendedtodescribe

any electronic document or resource. It is made up of the following items:

Title, Creator, Subject, Description, Publisher, Contributor, Date, Type,

Format, Identier,Source,Language,Relation,Coverage,Rights.

3. The project

The European ParticlePhysics Outreach Groupwasborn at CERN in 1997

asaninitiativeofECFA(EuropeanCommiteeforFutureAccelerators).Itwas

felt that as a publicly funded activity, HEP needed to better explain to the

Europeen tax-payers what exacly they were doing. The group consists of a

networkofpeople,representingtheCERNmemberstates,whoareinvolvedin

thepopularisationofParticlePhysics.

It was identiedat avery earlystage that theWWW would play animpor-

tant role in thework of thegroupand the project to constructaWeb-based

informationsystembegunin 1998.Theproject'saimsweretocollectdataand

publicationsrelatedtoHEP,mergeandorganiseinformationalreadypublished

onotherWebsitesbyEuropean,andUS,HEPresearchinstitites,universities,

museums, laboratories, etc. Today the project includes more than 140 Web

sites.

Inimplementingthissystem,themainproblemwehadtofacewasthatinfor-

mationisheterogeneous(instructureandtype),iscontinuouslychangingand

isscatteredacrosstheWeb.

Thesolutionwearedevelopingisbasedontheideathatalltheinformationwe

wantto organisecanbeconsideredasWebresourceswhich canbeindentied

byan uniqueidentier. Each resourcecanbedescribedwith theappropriate

set ofmetadata(Properties),which wewill dene. A databaseof metadata

can be createdand queriedby users to retrieve thelinks to the resources in

which theyareinterested.

InoursystemweidentifyeachWebresourcebyitslocation,theURL(Uniform

Resource Locator), and we associate to it a description of its contents. In

this waywecanwork (search andreorganisethe Webresources) ata levelof

(8)

Property Value

Institute typeoftheinstitutepublishing theinformation(resear

Category categoryofinformation(news,events,seminars,people,...)

Kind typeofinformation(text,image,slide,video,...)

Where wheretheinstituteissituated

When datereletedto theinformation,ifeventsorconferences...

Contact thepersonsresponsableforthisresource

Expiry expirydate,ifitexists

Related secondarysubjects

Table1:Propertieschosenfordescribingtheresources.

information objects distributed across multiple locations and systems in the

samesimpleway.

4. Arriving at a solution

TherststepisthechoiceofPropertieswhichbestdescribeourresources.We

havedecidedtoadoptthefollowing:

theDublin CoreElementsSet,

otherpropertiescorrespondingtotheneedsofthedomainofinformation

wearegoingtotreat;thesearelistedandbrieydescribedin table1.

This vocabularyof terms willbe usedfor describingthe resourceswhich will

takepartin ourinformationsystem.Wehavetodenetheirmeaningandthe

possiblevaluestheycanassumeinanRDFSchema.

Our model is ready: we have dened a special vocabulary for our appli-

cation domain and a precise meaning forits terms. This means we will have

a common-undestanding of terms and subjects for everybody sharing these

information.

Forinstance, wecanconsider the RDFmodelof the metadatarelated to the

resourcereferredto by theURLhttp://www.pd.infn.it/,itcan begraphi-

cally representedinthis way:

(9)

XMLcanbeusedfortheimplementation:

<?xml version="1.0"?>

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:dc="http://purl.org/metadata/dublin_core#"

xmlns:dcq="http://purl.org/metadata/dublin_core_qualifiers#"

xmlns:otrc="http://outreach.cern.ch/rdf/elements.htm#"

xmlns:ent="http://outreach.cern.ch/rdf/entity.htm#">

xmlns:pla="http://outreach.cern.ch/rdf/place.htm#">

<rdf:Description about="http://www.infn.pd.it/"

<dc:Title>INFN - Sezione di Padova</dc:Title>

<dc:Creator>

<ent:Name>Massimo Gravino</p:name>

<ent:Email>[email protected]</p:Email>

</dc:Creator>

<dc:Subject>

<rdf:Bag>

<rdf:li>experiment</rdf:li>

<rdf:li>accelerator</rdf:li>

<rdf:li>cosmic ray</rdf:li>

<rdf:li>neutrino beam</rdf:li>

<rdf:li>nuclear physics</rdf:li>

<rdf:li>theoretical physics</rdf:li>

(10)

...

</rdf:Bag>

</dc:Subject>

<dc:Description>Istituto Nazionale di Fisica Nucleare,

Sezione di Padova</dc:Description>

<dc:Type>home page</dc:Type>

<dc:Format>text/html</dc:Format>

<dc:Language>en</dc:Language>

<dc:Language>it</dc:Language>

<dc:Relation rdf:parseType="Resource">

<dcq:RelationType rdf:resource=

"http://purl.org/metadata/dublin_core_qualifiers#IsPartOf"/>

<rdf:value resource="http://www.infn.it"/>

</dc:Relation>

<otrc:Institute>research centre</otrc:Institute>

<otrc:Category>

<rdf:Bag>

<rdf:li>research</rdf:li>

<rdf:li>experiment</rdf:li>

<rdf:li>event</rdf:li>

<rdf:li>people</rdf:li>

...

</rdf:Bag>

</otrc:Category>

<otrc:Kind>

<rdf:Bag>

<rdf:li>text</rdf:li>

<rdf:li>reference</rdf:li>

<rdf:li>slide</rdf:li>

</rdf:Bag>

</otrc:Kind>

<otrc:Where>

<pla:Country>Italy</pla:Country>

<pla:City>Padua</pla:City>

</otrc:Where>

<dc:Related>

<rdf:Bag>

<rdf:li>University of Padua</rdf:li>

<rdf:li>department of Physics</rdf:li>

</rdf:Bag>

</dc:Related>

</rdf:Description>

(11)

Datadescribingtheresourcescanbestoredinacentralrepository.Thisrepos-

itorycanbeorganisedasadatabaseholdinginformationaboutpropertiesand

propertyvaluesfor each resource.Thisdatabasestoresameta-representation

of the information we want to organise, and it can be used to execute a

metadata-level query on the resources it stores. In this way we can obtain

high-precisionresults and theintegration of heterogeneus information. XML,

together with the related standards, will be used as the syntax for the

representationof theresults.

Thefollowingschemagivesanideaof thesystemoperatingcycle:

5. The Web interface

How can we dene the metadata describing all the resourcesof our system?

Fortunately wecan count onthecollaborationof thepersons involvedin the

Outreachproject.Peoplein chargeofllingupthemetadatarepositorydon't

needto haveatechnical prole:asimpleWeb interface(passwordprotected)

(12)

representedhere:

Below,weseeanexampleofhowthemeta-informationcanbeusedtohelpthe

user nding resources. An alphabetical list of the subjects can be consulted,

and foreachsubjectthecorrespondingresourceswillbelisted.

(13)

6. Conclusions

Thekeypointsofoursolutionare:

thedenition ofapersonalised,descriptiveRDFmodeloftheresources;

theexibilityoftheXMLimplementation.

Usingoursolution,theproblemofintegrationofresourcesbecomesaproblem

ofdenition oftheirmeta-representations.These meta-representationscanbe

usedtoquerythewholeinformationsystematametadata-levelofabstraction

insteadofworkingontheactualdocuments.

Theadvantagesofoursolutionare:

we havedened acommon basisfor understanding foreveryonesharing

theseresourcesandusingourvocabulary;

metadataisupdatedthroughoutthecollaboration,i.e.theworkofadding

totheknowledgebaseisdistributed;

metadatastoredin therepositoryhasbeenenteredmanuallyandsowill

havesignicantaddedvalueovercomputer-generatedinput;

metadata can be processed independently from the resources to which

theyrefer.Inthiswaytheresourceisnotbound toonlyonedescription;

asinglemetadata-entryintherepositorycanrepresentanentrypointinto

anarbitrarilycomplexresource.Theentrypointto acompleteWeb site

wouldbeitsHomePage;

dierentkindsofresourcecanbeincluded,eachidentiedbyaURL;

theresponsabilityoftherepositorymaintenanceiscentralised;

theRDFmodelcanbeextended,ifnecessary,simplybyaddingnewterms

andpropertiestothevocabulary;

thesolutionisindependentofanyproprietarytechnology;

thesolutioncanbeapplied toanyother domainby redeningtheprop-

ertiesparticulartothismodel;

the use of XML (metadata can be shared, reformatted, etc.) which is

becoming thede-factostandardforsharinginformationontheWeb.