Deep Web Information
StefanMuller 1
,Ralf-DieterSchimkat 1
,andRudolfMuller 2
1
UniversityofTubingen, WSIforComputerScience,Sand13, D-72076
Tubingen,Germany
2
UniversityofMaastricht,DepartmentofQuantitativeEconomics,P.O.Box616,
6200 MDMaastricht,TheNetherlands
Abstract. InformationprovidedontheWebisoftenhiddenindatabasesanddy-
namicallyextractedbyscriptpagesaccordingtosomeuserinput.Webinformation
systemsfollowthisconceptwhentheymusthandleahugeamountoffastchanging
datathat issomehowrelated (e.g.shoppers,routeplaners,yellowpages).Incon-
trasttostaticpagesitisveryhardfortraditionalsearchenginestoproperlyindex
pageswithnon-staticcontentinawaythatWebuserscanperformaprecisesearch.
Theinformationthatisprovidedthroughdynamicscriptpagesisoftencalledthe
HiddenorDeepWeb.Wepresentaretrievalapproachthatusesthestructureofthe
databaseschemasandfurtherschemainformationtocalculateasimilaritybetween
a structured query and registered Deep Web information systems. Our retrieval
process is a combinationof structured and keyword-basedretrieval. This combi-
nation shouldovercome thedrawbackof asolelykeyword-basedindexingmethod
whichisinsuÆcienttoexactlydescribethecontentoftheDeepWeb.Dynamically
andindividuallyadjustableretrievalbehaviorfurtherimprovestheuseabilityofour
approach.
1 Introduction
TheWorld-Wide-Web(WWW)asaglobalrepositoryforinformationisabig
improvement for thequalityof life of people aroundthe world: Organizing
ajourney,buyingorsellinggoods,orjustsearchingdetailedand up-to-date
informationbecomescheaper,faster,andeasier.Themainproblemistond
suitablepagestoperformtheindividualtasks. Theresearcheortstakento
solve this problem broughtup alot of dierent technologies, e.g.to better
describethecontentofWebpages(RDF[15]),theWebservices(UDDI[14]),
andelaboratedindexandsearchstrategies(pagerankusedbyGoogle[13]).
TheWWW contains alot ofpages that arenot explicitly made persis-
tent in aWeb-accessiblemanner. These pagesare dynamically constructed
by scriptsorservletsaccordingto someuserinput, e.g.traveldestinations,
or product lists of Web shops. The set of those pages is called the Deep
Web. Informationprovidersusedynamic pageswhentheyhaveto handlea
mass of related information that is rapidly changing overtime. Describing
thisinformationinawaythattraditionalsearchenginescanpresentthemto
an interesteduseris only possible, iftheUniform ResourceLocator(URL)
keywordslike"ight","from","Frankfurt","to","Hawaii"would probably
not delivera link to a travel agency, because "Frankfurt" and "Hawaii" is
an information of the Deep Web and therefore not available for the index
generation ofsearchengines.A keyworddescriptionofscriptpages(the en-
try points to the Deep Web) can only describe very vague the Deep Web
informationandtherealcontentofaWebsite.Additionallythemainpartof
the informationprovidedby keywordsrelies ontherelations betweenthem
andarenotexpressed.AsearchengineforDeepWebinformationshoulduse
thisstructuralinformation, too.
Itisinterestingtoseethatthestructuralinformation(lostwhenauser's
questionistransformedtoasetofkeywords)isactuallyusedtomanagethe
informationoftheDeepWeb:MostofthesitesoeringdynamicWebpages
store their data within relational databases for which Entity-Relationship
models (ER models [4]) accurately describe the content and relations. ER
models are abstract enough to deliver information about the stored data
withoutusingthedataitself.Imagineasearchenginethatindexestheinter-
nal ER models of Deep Web information systemsand provides afrontend
to formulate queriesin astructural manner like ER modelling. This paper
presentsarealizationofthisapproachbasedonaframeworkforgeneralmodel
retrieval[12].Thepaperisstructuredasfollows:Webrieydescribethecom-
ponentsoftheframeworkandtheirapplicationin thedomainofERmodels
in the nextsection. Section 3evaluates thestructural retrievalapproachin
aclassroomexperiment.Section4showshowthestructuralqueryapproach
can becombinedwiththenecessarykeyword-basedretrieval.Dynamiccon-
trolofretrievalbehaviorisexplainedinsection5.Beforewesummarizeand
concludethispaper,section6givesanoverviewofotherapproachestosearch
forDeepWebinformation.
2 Structural Retrieval of ER models
ThemanagementofcomplexmodelslikeERmodelsisaverypopularscien-
ticeld. Mapping,matching,ormergingmodelsisrequiredinmanysitua-
tions,e.g.whencompanieswantto sharedatacontainedin dierentkindof
databasesordocuments.There isaneedofmoreabstractdatadescriptions
andoperationstoautomaticallyperformthegiventasks.Bernsteinetal.de-
scribedin [1]analgebraonageneralizedviewofcomplexmodels.Matching
of models with respect to their inherent structure is oneof theissues that
havetobetackled.
Firestorm[10]isaframeworkfortheretrievalofmodelsaccordingtotheir
implicitgraphstructure.Thedetails arepresentedin [11]and[12].Thecore
of the framework is a representation of models through three-layergraphs
calledStructuredServiceModels(SSM).Thisrepresentationisbasedonthe
conceptsofStructuredModelling[6],whichisusedtoformulatemathematical
graphsimilaritybetweendierentSSMs.Afterstatingaqueryinagraphical
waythe systemreturnsan orderedlist of themodelsthat aremost similar
to thequerywithrespecttothegraphstructures.
Customer Account
Transaction
has performs
(1,1)
(0,N)
(1,1) (0,N)
Entity Entity Entity
Relationship Relationship Transaction Account Customer
has performs
(1,1) (0,N) (1,1) (0,N)
ER Model SSM
Fig.1. MappingbetweenERmodelandSSM
Weextended the framework to the eld ofER modelsby thedenition
of aone-to-one mapping betweenER models and SSMs followingthe rules
of [12]:Dierent kindof entities become dierent kindof nodesof the rst
layer.Relationshipsbecomenodesofthethirdlayerandtherelationsbetween
entities and dependencies between entities and relationships become nodes
onthesecondlayerwithtypesderivedfromthecardinalitiesoftherelations.
Figure1showsasimpleERmodelanditscorrespondingSSM.Themapping
rulesenforcethatSSMsalwaysdecomposeintotwobipartitegraphs.Apartof
themappingsweimplementedgraphicaluserinterfacesusedtostatequeries
andtovisualize thestoredSSMsasnaturalERmodels.
Theserverpartof Firestormremains merelyunchanged.ItstoresSSMs
inarelationaldatabaseandprovidesalgorithmstospecifythesimilaritybe-
tweenSSMs.ThissimilarityexploitstheadjacencystructureofSSMgraphs.
GiventwographsG=(V;E)andG 0
=(V 0
;E 0
),welookatpartialmappings
of the nodes from G onto the nodes of G 0
which only map nodes onto
nodes ofthe sametype.The numberq expressesthenumberof edges in G
that areimplicitlymappedonto edgesin G 0
, i.e.edges(v;w)2E suchthat
((v);(w))2E 0
.Thematchingqualityrealizedbythemapping isdened
asd()=2q=(jEj+jE 0
j).Thesimilaritybetweentwographsisthendened
by the maximum overall matching qualities of mappings from G onto G 0
.
Mappingsarealwaysonetoone.Thematchingqualityisarationalnumber
between0and1withavalueof1indicatingthatthereexistsamappingbe-
tweenthelibrarygraphandthequerygraphthatmatchesexactlyalledges.
Aswecan assumew.l.o.g.that therearenoisolatednodesinaSSM,this is
thecaseifandonlyifbothgraphsareisomorphic.
is NP-hard. But the three-layer structure of SSMs supports the design of
heuristicandexactretrievalalgorithmsaswellasafastlter.Theexactal-
gorithmenumeratesallfeasiblemappingsofnodesfromthesecondlayerand
solvesforeachofthemthematchingproblemsonlayers1and3tooptimality
byusingalgorithmsforweightedbipartitematching. Theexactalgorithmis
polynomialforaxednumberofnodesonlayer2,andareasonablealgorithm
forsmallnumbersofnodesonlayer2.Filterandheuristicalgorithmsspeed
up the retrievaltime tremendouslyby reducing the set of graphsto which
theexactalgorithmhasto beapplied. Furtherdetails ofthealgorithms are
explainedin[11].
Fig.2. AppletinterfaceofFirestorm
As an example, gure 2shows the search for a site that deals with in-
formation in the context of online banking. Customers have accounts and
performtransactions.Thereis actuallynoonlinebankingsystemregistered
in Firestorm and we get a top match for a system that logs car accidents
becausethegraphstructureis identical.Obviouslythisisnotananswerwe
wantto achieve.So thereis aneedto combine thestructuralretrievalwith
weaddednewfunctionalitytoFirestormtone-tunetheretrievalalgorithms
withrespecttotheretrievalsituationofauser.Thesefunctionalitiesareex-
plainedin section5.Butrstofall,reportingexperienceswithaclassroom
exercisewillindicatethat structuralsimilarityofER modelsisagoodmea-
surefortheclosenessofcorrespondingrealworldsituations(ifthecontextis
predened).
3 Evaluation of Structural Retrieval
ThissectionwillshowhowwellthestructuresofSSMscapturethesemantics
of corresponding ER. Modelling is an art and one may argue that e.g. the
samereal worldsituation canbemodelled in verydierent waysleadingto
dierent SSMs. The structural retrieval approach reduces the detection of
similaritybetweendierentERmodelstothegraphsimilaritybetweentheir
SSMs but the retrievalqualitydepends on semantic similarity betweenER
models. The restrictiveness in structure and available means to create ER
models somehow limits the modelling freedom but a mathematical quality
proofisoutofsight.
Asexplained in theprevious sectionFirestorm wasextended to thedo-
main of ER models. This means that the system canreport the similarity
between two dierent ER models. The focus of our working group at the
Universityof Tubingen (Germany)isondatabaseandinformationsystems.
PartofadatabasecourseistolearnhowtocreateERmodelsthat describe
an aspect of the real world to be managed with adatabase. In this course
FirestormisusedbystudentstocreateandmanagetheirER models.
The course contains three exercises that require students to design ER
models for three dierent real world cases. These cases are formulated as
naturallanguagetext. Thetextimpliesthenaming ofentitiesand relation-
shipssothat wecanconcentrateonlyonthestructure.Somestamembers
evaluatedthesolutionsofthestudentsandassigned creditpointsaccording
tohowwelltheERmodelsrepresenttherealworldcases.Foreachexercisea
stamembercreatedamastermodelrepresentingtheexpectedsolution.This
madeitpossibletoevaluatewhetherhighsimilarityontheSSMs(respectively
the correspondingER models)reects acertain similarity accordingto the
semantics of the modelled real world cases, expressed by a high grade. A
positiveanswerwouldpromisetoapplythestructuralretrievalapproachfor
thedomainofERmodels.
These master models were used as queriesto the system with the cor-
responding student models building the library. The system computed the
(graph) similarity betweeneach studentmodel and themaster model. The
similaritieswereconvertedinto creditpointsbysimplymultiplyingthesim-
ilaritieswithmaximalcredit points.
thestamemberstothecreditpointsassignedbyFirestorm.Thedierences
in credit point assignment are marginal so it seems that graph similarity
between SSMs can be used to determine semantic similarity between ER
models (aslongasthecontextisclear).
1 2 3 4 5 6 7 8 9 10 11 0
1 2 3 4
credit points
student
Exercise 1
staff
system 1 2 3 4 5 6 7 8 9 10 11 0
1 2 3 4 5 6 7
credit points
student
Exercise 2
staff system
1 2 3 4 5 6 7 8 9 0
1 2 3 4 5 6
credit points
student
Exercise 3
staff system
Fig.3. Creditpointassignment(stavs.Firestorm)
4 Combining Structural and Keyword-based Retrieval
The results of the previous section are verypromising but as the example
usage of section 2shows structural retrieval on its own cannot distinguish
between dierent contexts. The application of Firestorm on the Web with
thousands of sites oering heterogenousinformation needs a mechanismto
inuencethestructuralretrievalbysomethingdescribingthecontext.
Ameaningful context descriptionisgivenby thenamesof the elements
(e.g. Entities, Relationships) that compose an ER model. Tocalculate the
structuralsimilaritythealgorithmsmapnodesonnodesandcountthenum-
berof implicitlyoverlappingedges.Forallalgorithmswecandeneweights
peredgewithout changingthealgorithmstructure. Structuralretrievalcan
thenbecombinedwithsomekindofkeyword-basedretrievalinthefollowing
way:
1. Thesystemlls amatrixthat keepsthecloseness betweennode names
for each combination of nodes. If the system only allows the usage of
namesof a controlled vocabulary ornames with agivensemantic(e.g.
byRDF[15])thenormalizedstringdistances(e.g.editdistance)express
theclosenessbyavaluebetween0and1withavalue1indicatingthat
twonameshaveanidenticalmeaning.
2. Theretrievalalgorithmsusethevaluesofthematrixtoidentifytheweight
ofrealized edge mappingwhich is themean valueof the corresponding
nodemappingmatrixvalues.
The similarity between structural identical graphs that have no node
namesincommonwillgetmuchlower.
Like in any other retrieval system, the retrieval quality can be measured
in terms of Recall and Precision. Recall is the quotient of the number of
relevantER modelscontainedin theresultsetbythenumberofallrelevant
ER models of thesystem withrespect to aquery. Precisionisthe quotient
of the number of relevant ER models contained in the result set and the
size oftheresultset with respectto agivenquery.Recalland precisionare
noobjectivemeasuresbecauseeach userhasitsowndenition ofrelevance.
In thefollowingweuse recall andprecisionin aweakersense dening that
allER modelsare relevant withrespect to aqueryifthesimilarity ismore
thanacertain threshold.Thethresholdcanbedenedbyauserjust before
starting a retrieval action. Additionally we implemented afunctionality to
controlthebehaviorofthestructuralretrievalalgorithms.
Generic Connection
(*,1)
(*,N) (0,N)
(1,N)
(N,M) (1,1) (0,1) Level 4 Level 3
Level 2 Level 1
Fig.4.Nodetypetreeofsecondlayer
ThecontrolfunctionalityreliesonatypetreeofSSMnodes.Asweknow
from section 2the elementsof ER models are mapped onto SSM nodes of
dierenttypes.Wedeneatypetreethatstartswithagenericroottypeand
andspecializesthetypesalongthetreepaths.Figure4showsthesubtreeof
thesecondlayerSSMnodetypesforthedomainofERmodels.Followingthe
roottypethesecond leveldenesthetypesofallnodesonthesecondlayer.
These nodes represent connections between entities and relationships. The
third level of the tree distinguishes between (*,N) connections and (*,1)
connectionsthat representtwodierent kindsofconnection cardinalitiesin
(min;max)notation.Thefourth levelofthetreecapturesthemostspecic
cardinalitydistinction((0,1),(1,1),(0,N),(1,N)and(N,M)).
Assume that the SSM of the query model and the SSMs of all library
modelsonlycontainnodesofthemostspecictypetreelevels.Beforeauser
triggers a retrieval action the user species for each layer which tree level
should be used by the algorithms for the similarity computation, in other
wordswhichsubtypescanbematchedtoeachother.Thismeanse.g.thatif
ausers chooses the third tree level for thesecond layer thealgorithms can
userchoosesthefourthtreelevelforthesecondlayer.
The eect of this mechanism is that for a certain threshold choosing a
higher type level reduces the size of the result set. Some of the relevant
models maynotappearin the resultset but theprecisionof theremaining
modelsishigher.Ontheotherhandsettingalowtypelevelleadsto alarge
resultsetwithpossiblyalotofirrelevantmodels.Butirrelevantmodelscan
contain alot ofhintsthat help to specifythe typelevels in awaythat the
resultsmeetauser'sexpectations. Withthis functionalityausercanadjust
theretrievalsystemtowardstheindividualsenseofrelevance andquality.
6 Related Work
Searching for information that is hidden in the Deep Web is not a new
task. Sites providing information generated by scripts out of data stored
in databasesexist nearlyaslongasthe Web exists. There area number of
searchenginesthatindexthosesitesandquerythemaccordingtosomeuser
input. We look at someof those enginesand relate them to the structural
retrievalapproach.
OneofthemostpopularcommercialDeepWebsearchenginesisLexiBot
[3]fromBrightPlanet.Itsearchesamong2200databasesandprovidesaquery
interfaceforsimpletext orbooleanqueries.Theuserscanadjustthesearch
strategiesandresultpresentationtoitsownpreferences.
Alsofrom BrightPlanet is thefree search engine CompletePlanet [2]. It
contains the addresses of about 103000 databases organizedin a directory
structurewithcategoriesandsubcategories.Thesearchinterfaceallowssim-
pletextinput.
TheLII[9]isafreesearchenginewithanannotated,searchablesubject
directoryofabout9000Webresources.Theresourcesareselectedandevalu-
atedbylibrariansfortheirusefulnesstousersofpubliclibraries.Simpletext
andbooleanoperatorsarethemeansto statequeries.
The search engines of IntelliSeek (ProFusion [8] and InvisibleWeb [7])
resemblemuchtheothermentioneddirectory-basedsearchengines.
Neither of the presented search engines use the database structure of
DeepWebinformationsystems.Theydescribethecontentof thosesystems
withkeywordsandorganizethem(byhandorautomated)withinadirectory.
Queryinterfacesrequiresimpletextandbooleanqueries.Theyareeasierto
usethanourqueryinterfacethat isbasedonERmodels.Butespeciallythe
structural dependencies between dierent information units captures much
moresemanticsandthereforeallowsamoreprecisesearch.
This paperdescribesastructuralretrievalapproachto search forWebsites
that provide information contained in the Deep Web. Therefore weextend
thegeneralmodelretrievalframeworkFirestormtothedomainofERmodels.
Inuenced by thespecial requirements of theWeb wedescribe atechnique
to combine structural and keyword-basedretrieval. A functionality for the
dynamiccontrolofretrievalbehaviorbasedontypetreesprovidestheusers
with meansto adjust the systemwith respectto the individual preferences
accordingto recallandprecision.
Usershaveto statetheirqueriesto thesystembyspecifyingERmodels
todescribethedesiredinformationin structuralcorrelation.Manyusersare
unfamiliar with ER modelling but it is a standardized and powerful way
to express dependenciesbetweeninformation objects.The simplegraphical
user interfaceshould allowusers with basicknowledge to specify their own
ERmodels.
AnotherdrawbackoftheapproachistheabsenceofERmodelsthatWeb
sitesneedtoregisterwiththesystem.Mostofthetargetsitesstoretheirdata
withinrelationaldatabases.IftheydonothaveERmodelsthatdescribetheir
databasesatoolcouldprobablycreateamodel outofthedatabaseschema.
Apart of the sandbox evaluation presented in section 3 the structural
retrieval approach has yet to prove its applicability under the real world
conditions of the Web. Thereforewehopethat many Web sites registerto
oursystemto provideWebusersapromisingwayto searchfor information
oftheDeepWeb.
LookingintothefuturetheongoinghypeoftheSemanticWeb[5]brings
up many applications that requiretools to handle semanticdescriptions of
information.Thesemanticdescriptionismostlydonefollowingtheprinciples
of RDF [15].RDF denitions resemble graphs that show the dependencies
among the individual elements. The management of RDF denitions can
benetalotbyanRDFextensionofFirestorm.
References
1. P.A.Bernstein, A.Y. Halevy,and R.Pottinger. Avisionof management of
complexmodels. SIGMODRecord,29(4):55{63,2000.
2. BrightPlanet. CompletePlanet,2000. http://www.completeplanet.com/.
3. BrightPlanet. LexiBot,2001. http://www.lexibot.com/.
4. P.P.S.Chen. Theentity-relationshipmodel|towardauniedviewofdata.
Proceedingsofthe1thConference onVeryLargeDatabases,MorganKaufman
pubs.(LosAltosCA),Kerr(ed),pp.173, 1975.
5. S. Decker. The Semantic Web Community Portal, 2001. http://-
www.semanticweb.org/.
6. A.M.Georion.Anintroductiontostructuredmodeling.ManagementScience,
33(5):547{588,1987.
www.invisibleweb.com/.
8. IntelliSeek. ProFusion,2001. http://www.profusion.com/.
9. Library of California. Librarians' Index to the Internet, 2001. http://-
www.lii.org/.
10. S.Muller. FIRESTORM{FIrsta REtrievalSysTemforOperationsResearch
Models,2002. http://restorm.informatik.uni-tuebingen.de/ermodeling.
11. S. Muller and R. Muller. Retrieval of service descriptions using structured
service models. In Proceedings of the 10thAnnual Workshop onInformation
TechnologiesandSystem,pages55{60,2000.
12. S.MullerandR.Schimkat. Ageneral,web-enabledmodelretrievalapproach.
InProceedingsoftheInternationalConferenceforObject-OrientedInformation
Systems(OOIS2001),2001.
13. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Cita-
tion Ranking: Bringing Order tothe Web, 1998. http://citeseer.nj.nec.com/-
page98pagerank.html.
14. UDDIcommunity. Universal Description,Discoveryand Integration of Busi-
ness fortheWeb(UDDI),2002. http://www.uddi.org/.
15. WorldWideWebConsortium(W3C). Resource DescriptionFramework,Feb.
1999. http://www.w3.org/TR/REC-rdf-syntax/.