Query Language for Structural Retrieval of Deep Web Information

(1)

Deep Web Information

StefanMuller 1

,Ralf-DieterSchimkat 1

,andRudolfMuller 2

1

UniversityofTubingen, WSIforComputerScience,Sand13, D-72076

Tubingen,Germany

2

UniversityofMaastricht,DepartmentofQuantitativeEconomics,P.O.Box616,

6200 MDMaastricht,TheNetherlands

Abstract. InformationprovidedontheWebisoftenhiddenindatabasesanddy-

namicallyextractedbyscriptpagesaccordingtosomeuserinput.Webinformation

systemsfollowthisconceptwhentheymusthandleahugeamountoffastchanging

datathat issomehowrelated (e.g.shoppers,routeplaners,yellowpages).Incon-

trasttostaticpagesitisveryhardfortraditionalsearchenginestoproperlyindex

pageswithnon-staticcontentinawaythatWebuserscanperformaprecisesearch.

Theinformationthatisprovidedthroughdynamicscriptpagesisoftencalledthe

HiddenorDeepWeb.Wepresentaretrievalapproachthatusesthestructureofthe

databaseschemasandfurtherschemainformationtocalculateasimilaritybetween

a structured query and registered Deep Web information systems. Our retrieval

process is a combinationof structured and keyword-basedretrieval. This combi-

nation shouldovercome thedrawbackof asolelykeyword-basedindexingmethod

whichisinsuÆcienttoexactlydescribethecontentoftheDeepWeb.Dynamically

andindividuallyadjustableretrievalbehaviorfurtherimprovestheuseabilityofour

approach.

1 Introduction

TheWorld-Wide-Web(WWW)asaglobalrepositoryforinformationisabig

improvement for thequalityof life of people aroundthe world: Organizing

ajourney,buyingorsellinggoods,orjustsearchingdetailedand up-to-date

informationbecomescheaper,faster,andeasier.Themainproblemistond

suitablepagestoperformtheindividualtasks. Theresearcheortstakento

solve this problem broughtup alot of dierent technologies, e.g.to better

describethecontentofWebpages(RDF[15]),theWebservices(UDDI[14]),

andelaboratedindexandsearchstrategies(pagerankusedbyGoogle[13]).

TheWWW contains alot ofpages that arenot explicitly made persis-

tent in aWeb-accessiblemanner. These pagesare dynamically constructed

by scriptsorservletsaccordingto someuserinput, e.g.traveldestinations,

or product lists of Web shops. The set of those pages is called the Deep

Web. Informationprovidersusedynamic pageswhentheyhaveto handlea

mass of related information that is rapidly changing overtime. Describing

thisinformationinawaythattraditionalsearchenginescanpresentthemto

an interesteduseris only possible, iftheUniform ResourceLocator(URL)

(2)

keywordslike"ight","from","Frankfurt","to","Hawaii"would probably

not delivera link to a travel agency, because "Frankfurt" and "Hawaii" is

an information of the Deep Web and therefore not available for the index

generation ofsearchengines.A keyworddescriptionofscriptpages(the en-

try points to the Deep Web) can only describe very vague the Deep Web

informationandtherealcontentofaWebsite.Additionallythemainpartof

the informationprovidedby keywordsrelies ontherelations betweenthem

andarenotexpressed.AsearchengineforDeepWebinformationshoulduse

thisstructuralinformation, too.

Itisinterestingtoseethatthestructuralinformation(lostwhenauser's

questionistransformedtoasetofkeywords)isactuallyusedtomanagethe

informationoftheDeepWeb:MostofthesitesoeringdynamicWebpages

store their data within relational databases for which Entity-Relationship

models (ER models [4]) accurately describe the content and relations. ER

models are abstract enough to deliver information about the stored data

withoutusingthedataitself.Imagineasearchenginethatindexestheinter-

nal ER models of Deep Web information systemsand provides afrontend

to formulate queriesin astructural manner like ER modelling. This paper

presentsarealizationofthisapproachbasedonaframeworkforgeneralmodel

retrieval[12].Thepaperisstructuredasfollows:Webrieydescribethecom-

ponentsoftheframeworkandtheirapplicationin thedomainofERmodels

in the nextsection. Section 3evaluates thestructural retrievalapproachin

aclassroomexperiment.Section4showshowthestructuralqueryapproach

can becombinedwiththenecessarykeyword-basedretrieval.Dynamiccon-

trolofretrievalbehaviorisexplainedinsection5.Beforewesummarizeand

concludethispaper,section6givesanoverviewofotherapproachestosearch

forDeepWebinformation.

2 Structural Retrieval of ER models

ThemanagementofcomplexmodelslikeERmodelsisaverypopularscien-

ticeld. Mapping,matching,ormergingmodelsisrequiredinmanysitua-

tions,e.g.whencompanieswantto sharedatacontainedin dierentkindof

databasesordocuments.There isaneedofmoreabstractdatadescriptions

andoperationstoautomaticallyperformthegiventasks.Bernsteinetal.de-

scribedin [1]analgebraonageneralizedviewofcomplexmodels.Matching

of models with respect to their inherent structure is oneof theissues that

havetobetackled.

Firestorm[10]isaframeworkfortheretrievalofmodelsaccordingtotheir

implicitgraphstructure.Thedetails arepresentedin [11]and[12].Thecore

of the framework is a representation of models through three-layergraphs

calledStructuredServiceModels(SSM).Thisrepresentationisbasedonthe

conceptsofStructuredModelling[6],whichisusedtoformulatemathematical

(3)

graphsimilaritybetweendierentSSMs.Afterstatingaqueryinagraphical

waythe systemreturnsan orderedlist of themodelsthat aremost similar

to thequerywithrespecttothegraphstructures.

Customer Account

Transaction

has performs

(1,1)

(0,N)

(1,1) (0,N)

Entity Entity Entity

Relationship Relationship Transaction Account Customer

has performs

(1,1) (0,N) (1,1) (0,N)

ER Model SSM

Fig.1. MappingbetweenERmodelandSSM

Weextended the framework to the eld ofER modelsby thedenition

of aone-to-one mapping betweenER models and SSMs followingthe rules

of [12]:Dierent kindof entities become dierent kindof nodesof the rst

layer.Relationshipsbecomenodesofthethirdlayerandtherelationsbetween

entities and dependencies between entities and relationships become nodes

onthesecondlayerwithtypesderivedfromthecardinalitiesoftherelations.

Figure1showsasimpleERmodelanditscorrespondingSSM.Themapping

rulesenforcethatSSMsalwaysdecomposeintotwobipartitegraphs.Apartof

themappingsweimplementedgraphicaluserinterfacesusedtostatequeries

andtovisualize thestoredSSMsasnaturalERmodels.

Theserverpartof Firestormremains merelyunchanged.ItstoresSSMs

inarelationaldatabaseandprovidesalgorithmstospecifythesimilaritybe-

tweenSSMs.ThissimilarityexploitstheadjacencystructureofSSMgraphs.

GiventwographsG=(V;E)andG 0

=(V 0

;E 0

),welookatpartialmappings

of the nodes from G onto the nodes of G 0

which only map nodes onto

nodes ofthe sametype.The numberq expressesthenumberof edges in G

that areimplicitlymappedonto edgesin G 0

, i.e.edges(v;w)2E suchthat

((v);(w))2E 0

.Thematchingqualityrealizedbythemapping isdened

asd()=2q=(jEj+jE 0

j).Thesimilaritybetweentwographsisthendened

by the maximum overall matching qualities of mappings from G onto G 0

.

Mappingsarealwaysonetoone.Thematchingqualityisarationalnumber

between0and1withavalueof1indicatingthatthereexistsamappingbe-

tweenthelibrarygraphandthequerygraphthatmatchesexactlyalledges.

Aswecan assumew.l.o.g.that therearenoisolatednodesinaSSM,this is

thecaseifandonlyifbothgraphsareisomorphic.

(4)

is NP-hard. But the three-layer structure of SSMs supports the design of

heuristicandexactretrievalalgorithmsaswellasafastlter.Theexactal-

gorithmenumeratesallfeasiblemappingsofnodesfromthesecondlayerand

solvesforeachofthemthematchingproblemsonlayers1and3tooptimality

byusingalgorithmsforweightedbipartitematching. Theexactalgorithmis

polynomialforaxednumberofnodesonlayer2,andareasonablealgorithm

forsmallnumbersofnodesonlayer2.Filterandheuristicalgorithmsspeed

up the retrievaltime tremendouslyby reducing the set of graphsto which

theexactalgorithmhasto beapplied. Furtherdetails ofthealgorithms are

explainedin[11].

Fig.2. AppletinterfaceofFirestorm

As an example, gure 2shows the search for a site that deals with in-

formation in the context of online banking. Customers have accounts and

performtransactions.Thereis actuallynoonlinebankingsystemregistered

in Firestorm and we get a top match for a system that logs car accidents

becausethegraphstructureis identical.Obviouslythisisnotananswerwe

wantto achieve.So thereis aneedto combine thestructuralretrievalwith

(5)

weaddednewfunctionalitytoFirestormtone-tunetheretrievalalgorithms

withrespecttotheretrievalsituationofauser.Thesefunctionalitiesareex-

plainedin section5.Butrstofall,reportingexperienceswithaclassroom

exercisewillindicatethat structuralsimilarityofER modelsisagoodmea-

surefortheclosenessofcorrespondingrealworldsituations(ifthecontextis

predened).

3 Evaluation of Structural Retrieval

ThissectionwillshowhowwellthestructuresofSSMscapturethesemantics

of corresponding ER. Modelling is an art and one may argue that e.g. the

samereal worldsituation canbemodelled in verydierent waysleadingto

dierent SSMs. The structural retrieval approach reduces the detection of

similaritybetweendierentERmodelstothegraphsimilaritybetweentheir

SSMs but the retrievalqualitydepends on semantic similarity betweenER

models. The restrictiveness in structure and available means to create ER

models somehow limits the modelling freedom but a mathematical quality

proofisoutofsight.

Asexplained in theprevious sectionFirestorm wasextended to thedo-

main of ER models. This means that the system canreport the similarity

between two dierent ER models. The focus of our working group at the

Universityof Tubingen (Germany)isondatabaseandinformationsystems.

PartofadatabasecourseistolearnhowtocreateERmodelsthat describe

an aspect of the real world to be managed with adatabase. In this course

FirestormisusedbystudentstocreateandmanagetheirER models.

The course contains three exercises that require students to design ER

models for three dierent real world cases. These cases are formulated as

naturallanguagetext. Thetextimpliesthenaming ofentitiesand relation-

shipssothat wecanconcentrateonlyonthestructure.Somestamembers

evaluatedthesolutionsofthestudentsandassigned creditpointsaccording

tohowwelltheERmodelsrepresenttherealworldcases.Foreachexercisea

stamembercreatedamastermodelrepresentingtheexpectedsolution.This

madeitpossibletoevaluatewhetherhighsimilarityontheSSMs(respectively

the correspondingER models)reects acertain similarity accordingto the

semantics of the modelled real world cases, expressed by a high grade. A

positiveanswerwouldpromisetoapplythestructuralretrievalapproachfor

thedomainofERmodels.

These master models were used as queriesto the system with the cor-

responding student models building the library. The system computed the

(graph) similarity betweeneach studentmodel and themaster model. The

similaritieswereconvertedinto creditpointsbysimplymultiplyingthesim-

ilaritieswithmaximalcredit points.

(6)

thestamemberstothecreditpointsassignedbyFirestorm.Thedierences

in credit point assignment are marginal so it seems that graph similarity

between SSMs can be used to determine semantic similarity between ER

models (aslongasthecontextisclear).

1 2 3 4 5 6 7 8 9 10 11 0

1 2 3 4

credit points

student

Exercise 1

staff

system 1 2 3 4 5 6 7 8 9 10 11 0

1 2 3 4 5 6 7

credit points

student

Exercise 2

staff system

1 2 3 4 5 6 7 8 9 0

1 2 3 4 5 6

credit points

student

Exercise 3

staff system

Fig.3. Creditpointassignment(stavs.Firestorm)

4 Combining Structural and Keyword-based Retrieval

The results of the previous section are verypromising but as the example

usage of section 2shows structural retrieval on its own cannot distinguish

between dierent contexts. The application of Firestorm on the Web with

thousands of sites oering heterogenousinformation needs a mechanismto

inuencethestructuralretrievalbysomethingdescribingthecontext.

Ameaningful context descriptionisgivenby thenamesof the elements

(e.g. Entities, Relationships) that compose an ER model. Tocalculate the

structuralsimilaritythealgorithmsmapnodesonnodesandcountthenum-

berof implicitlyoverlappingedges.Forallalgorithmswecandeneweights

peredgewithout changingthealgorithmstructure. Structuralretrievalcan

thenbecombinedwithsomekindofkeyword-basedretrievalinthefollowing

way:

1. Thesystemlls amatrixthat keepsthecloseness betweennode names

for each combination of nodes. If the system only allows the usage of

namesof a controlled vocabulary ornames with agivensemantic(e.g.

byRDF[15])thenormalizedstringdistances(e.g.editdistance)express

theclosenessbyavaluebetween0and1withavalue1indicatingthat

twonameshaveanidenticalmeaning.

2. Theretrievalalgorithmsusethevaluesofthematrixtoidentifytheweight

ofrealized edge mappingwhich is themean valueof the corresponding

nodemappingmatrixvalues.

The similarity between structural identical graphs that have no node

namesincommonwillgetmuchlower.

(7)

Like in any other retrieval system, the retrieval quality can be measured

in terms of Recall and Precision. Recall is the quotient of the number of

relevantER modelscontainedin theresultsetbythenumberofallrelevant

ER models of thesystem withrespect to aquery. Precisionisthe quotient

of the number of relevant ER models contained in the result set and the

size oftheresultset with respectto agivenquery.Recalland precisionare

noobjectivemeasuresbecauseeach userhasitsowndenition ofrelevance.

In thefollowingweuse recall andprecisionin aweakersense dening that

allER modelsare relevant withrespect to aqueryifthesimilarity ismore

thanacertain threshold.Thethresholdcanbedenedbyauserjust before

starting a retrieval action. Additionally we implemented afunctionality to

controlthebehaviorofthestructuralretrievalalgorithms.

Generic Connection

(*,1)

(*,N) (0,N)

(1,N)

(N,M) (1,1) (0,1) Level 4 Level 3

Level 2 Level 1

Fig.4.Nodetypetreeofsecondlayer

ThecontrolfunctionalityreliesonatypetreeofSSMnodes.Asweknow

from section 2the elementsof ER models are mapped onto SSM nodes of

dierenttypes.Wedeneatypetreethatstartswithagenericroottypeand

andspecializesthetypesalongthetreepaths.Figure4showsthesubtreeof

thesecondlayerSSMnodetypesforthedomainofERmodels.Followingthe

roottypethesecond leveldenesthetypesofallnodesonthesecondlayer.

These nodes represent connections between entities and relationships. The

third level of the tree distinguishes between (*,N) connections and (*,1)

connectionsthat representtwodierent kindsofconnection cardinalitiesin

(min;max)notation.Thefourth levelofthetreecapturesthemostspecic

cardinalitydistinction((0,1),(1,1),(0,N),(1,N)and(N,M)).

Assume that the SSM of the query model and the SSMs of all library

modelsonlycontainnodesofthemostspecictypetreelevels.Beforeauser

triggers a retrieval action the user species for each layer which tree level

should be used by the algorithms for the similarity computation, in other

wordswhichsubtypescanbematchedtoeachother.Thismeanse.g.thatif

ausers chooses the third tree level for thesecond layer thealgorithms can

(8)

userchoosesthefourthtreelevelforthesecondlayer.

The eect of this mechanism is that for a certain threshold choosing a

higher type level reduces the size of the result set. Some of the relevant

models maynotappearin the resultset but theprecisionof theremaining

modelsishigher.Ontheotherhandsettingalowtypelevelleadsto alarge

resultsetwithpossiblyalotofirrelevantmodels.Butirrelevantmodelscan

contain alot ofhintsthat help to specifythe typelevels in awaythat the

resultsmeetauser'sexpectations. Withthis functionalityausercanadjust

theretrievalsystemtowardstheindividualsenseofrelevance andquality.

6 Related Work

Searching for information that is hidden in the Deep Web is not a new

task. Sites providing information generated by scripts out of data stored

in databasesexist nearlyaslongasthe Web exists. There area number of

searchenginesthatindexthosesitesandquerythemaccordingtosomeuser

input. We look at someof those enginesand relate them to the structural

retrievalapproach.

OneofthemostpopularcommercialDeepWebsearchenginesisLexiBot

[3]fromBrightPlanet.Itsearchesamong2200databasesandprovidesaquery

interfaceforsimpletext orbooleanqueries.Theuserscanadjustthesearch

strategiesandresultpresentationtoitsownpreferences.

Alsofrom BrightPlanet is thefree search engine CompletePlanet [2]. It

contains the addresses of about 103000 databases organizedin a directory

structurewithcategoriesandsubcategories.Thesearchinterfaceallowssim-

pletextinput.

TheLII[9]isafreesearchenginewithanannotated,searchablesubject

directoryofabout9000Webresources.Theresourcesareselectedandevalu-

atedbylibrariansfortheirusefulnesstousersofpubliclibraries.Simpletext

andbooleanoperatorsarethemeansto statequeries.

The search engines of IntelliSeek (ProFusion [8] and InvisibleWeb [7])

resemblemuchtheothermentioneddirectory-basedsearchengines.

Neither of the presented search engines use the database structure of

DeepWebinformationsystems.Theydescribethecontentof thosesystems

withkeywordsandorganizethem(byhandorautomated)withinadirectory.

Queryinterfacesrequiresimpletextandbooleanqueries.Theyareeasierto

usethanourqueryinterfacethat isbasedonERmodels.Butespeciallythe

structural dependencies between dierent information units captures much

moresemanticsandthereforeallowsamoreprecisesearch.

(9)

This paperdescribesastructuralretrievalapproachto search forWebsites

that provide information contained in the Deep Web. Therefore weextend

thegeneralmodelretrievalframeworkFirestormtothedomainofERmodels.

Inuenced by thespecial requirements of theWeb wedescribe atechnique

to combine structural and keyword-basedretrieval. A functionality for the

dynamiccontrolofretrievalbehaviorbasedontypetreesprovidestheusers

with meansto adjust the systemwith respectto the individual preferences

accordingto recallandprecision.

Usershaveto statetheirqueriesto thesystembyspecifyingERmodels

todescribethedesiredinformationin structuralcorrelation.Manyusersare

unfamiliar with ER modelling but it is a standardized and powerful way

to express dependenciesbetweeninformation objects.The simplegraphical

user interfaceshould allowusers with basicknowledge to specify their own

ERmodels.

AnotherdrawbackoftheapproachistheabsenceofERmodelsthatWeb

sitesneedtoregisterwiththesystem.Mostofthetargetsitesstoretheirdata

withinrelationaldatabases.IftheydonothaveERmodelsthatdescribetheir

databasesatoolcouldprobablycreateamodel outofthedatabaseschema.

Apart of the sandbox evaluation presented in section 3 the structural

retrieval approach has yet to prove its applicability under the real world

conditions of the Web. Thereforewehopethat many Web sites registerto

oursystemto provideWebusersapromisingwayto searchfor information

oftheDeepWeb.

LookingintothefuturetheongoinghypeoftheSemanticWeb[5]brings

up many applications that requiretools to handle semanticdescriptions of

information.Thesemanticdescriptionismostlydonefollowingtheprinciples

of RDF [15].RDF denitions resemble graphs that show the dependencies

among the individual elements. The management of RDF denitions can

benetalotbyanRDFextensionofFirestorm.

References

1. P.A.Bernstein, A.Y. Halevy,and R.Pottinger. Avisionof management of

complexmodels. SIGMODRecord,29(4):55{63,2000.

2. BrightPlanet. CompletePlanet,2000. http://www.completeplanet.com/.

3. BrightPlanet. LexiBot,2001. http://www.lexibot.com/.

4. P.P.S.Chen. Theentity-relationshipmodel|towardauniedviewofdata.

Proceedingsofthe1thConference onVeryLargeDatabases,MorganKaufman

pubs.(LosAltosCA),Kerr(ed),pp.173, 1975.

5. S. Decker. The Semantic Web Community Portal, 2001. http://-

www.semanticweb.org/.

6. A.M.Georion.Anintroductiontostructuredmodeling.ManagementScience,

33(5):547{588,1987.

(10)

www.invisibleweb.com/.

8. IntelliSeek. ProFusion,2001. http://www.profusion.com/.

9. Library of California. Librarians' Index to the Internet, 2001. http://-

www.lii.org/.

10. S.Muller. FIRESTORM{FIrsta REtrievalSysTemforOperationsResearch

Models,2002. http://restorm.informatik.uni-tuebingen.de/ermodeling.

11. S. Muller and R. Muller. Retrieval of service descriptions using structured

service models. In Proceedings of the 10thAnnual Workshop onInformation

TechnologiesandSystem,pages55{60,2000.

12. S.MullerandR.Schimkat. Ageneral,web-enabledmodelretrievalapproach.

InProceedingsoftheInternationalConferenceforObject-OrientedInformation

Systems(OOIS2001),2001.

13. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Cita-

tion Ranking: Bringing Order tothe Web, 1998. http://citeseer.nj.nec.com/-

page98pagerank.html.

14. UDDIcommunity. Universal Description,Discoveryand Integration of Busi-

ness fortheWeb(UDDI),2002. http://www.uddi.org/.

15. WorldWideWebConsortium(W3C). Resource DescriptionFramework,Feb.

1999. http://www.w3.org/TR/REC-rdf-syntax/.