HAL Id: hal-00733435
https://hal.archives-ouvertes.fr/hal-00733435
Submitted on 28 Dec 2012
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Cost Framework for a Distributed Semi-Structured Environment
Tianxiao Liu, Tuyet-Tram Dang-Ngoc, Dominique Laurent
To cite this version:
Tianxiao Liu, Tuyet-Tram Dang-Ngoc, Dominique Laurent. Cost Framework for a Distributed Semi-
Structured Environment. International workshop Database Management and Application over Net-
works - DBMAN (APWeb/WAIM Workshop), Jun 2007, France. pp.1-11. �hal-00733435�
Semi-Strutured Environment
TianxiaoLiu
1
,TuyêtTrâmDangNgo
2
,andDominiqueLaurent
3
1
ETISLaboratory-UniversityofCergy-Pontoise&XCaliaS.A,Frane.
Tianxiao.Liuu-ergy.fr;Tianx iao. Liux ali a.om
2
ETISLaboratory-UniversityofCergy-Pontoise, Frane.
Tuyet-Tram.Dang-Ngou-ergy.f r
3
ETISLaboratory-UniversityofCergy-Pontoise, Frane.
Dominique.Laurentu-ergy.fr
Abstrat. Thispaperproposesageneriostframeworkforqueryop-
timization inanXML-basedmediation system alled XLive, whihin-
tegratesdistributed,heterogeneous and autonomous datasoures.Our
approahreliesonostannotationonanXQuerylogial representation
alledTreeGraphView(TGV).Ageneriostommuniationlanguage
is usedto give anXML-based uniform format for ost ommuniation
within the XLive system. This ost framework is suitable for various
searhstrategiestohoosethebestexeutionplanforthesakeofmini-
mizingtheexeutionost.
Keywords: mediationsystem,queryoptimization,ostmodel,TreeGraphView,
ostannotation
1 Introdution
Thearhitetureof mediationsystemhasbeenproposed in [Wie92℄for solving
theproblemofintegrationofheterogeneousdatasoures.Insuhanarhiteture,
userssendqueriestothemediator,andthemediatorproessesthesequerieswith
thehelp ofwrappersassoiatedtodatasoures.Currently,thesemi-strutured
data model represented by XML format is onsidered as a standard data ex-
hange model.XLive[NJT05℄, mediation systembasedonXML standard,has
a mediator whih an aept queriesin the form of XQuery[W3C05℄ and re-
turnanswers.ThewrappersgivethemediatoranXML-baseduniformaessto
heterogeneousdatasoures.
For agiven userquery, the mediator angenerate various exeution plans
(referred to as"plan" in theremainder of this paper) to exeute it, and these
plansandierwidelyinexeutionost(exeutiontime,prieofostlyonne-
tions,ommuniationost,et.An optimizationproedure isthusneessaryto
determine the mosteientplan with the least exeutionost. However, how
to hoosethebest planbasedontheostisstillanopenissue.Inrelationalor
foreahoperatorappearingintheplan.Butinaheterogeneousanddistributed
environment,theostestimation ismuh morediult, due tothe lakofun-
derlyingdatabasesstatistisandostformulas.
Varioussolutionsfor proessingtheoverall ostestimation atthe mediator
level have been proposed. In [DKS92℄, aalibration proedure is desribed to
estimate the oeients of a generi ost model, whih an be speialized for
a lass of systems. This solution is extended for objet database systems in
[GGT96℄[GST96℄. Theapproahproposedin [ACP96℄ reordsostinformation
(results)foreveryqueryexeutedandreusesthatinformationforthesubsequent
queries. [NGT98℄ uses a ost-based optimization approah whih ombines a
generiostmodelwithspeiostinformationexportedbywrappers.However,
noneof thesesolutionshasaddressedtheproblem ofoverall ostestimationin
asemi-struturedenvironmentintegratingheterogeneousdatasoures.
Inthispaper,weproposeageneriostframeworkforanXML-basedmedi-
ationsystem,whihintegratesdistributed,heterogeneousandautonomousdata
soures. This framework allows to take into aount various ost models for
dierenttypesofdatasoureswithdiverseautonomydegrees.Theseostmod-
els are stored as annotations in an XQuery logial representation alled Tree
GraphView(TGV)[DNGT04℄[TDNL06℄.Moreover,ostmodelsareexhanged
betweendierentomponentsoftheXLivesystem.Weapplyourostframework
toomparetheexeutionostofandidateplansinordertohoosethebestone.
First,wesummarizedierentostmodelsfordierenttypesofdatasoures
(relational,objetorientedandsemi-strutured)anddierentautonomydegrees
ofthesesoures(proprietary,non-proprietaryandautonomous).Theoverallost
estimation relies on the ost annotation stored in orresponding omponents
TGV.This ostannotationderivesfrom ageneriannotationmodel whihan
annotateanyomponent(i.e.oneoragroupofoperators)ofaTGV.
Seond,inordertoperformtheostommuniationwithintheXLivesystem
duringqueryoptimization,wedeneanXML-basedlanguagetoexpresstheost
informationinauniform,ompleteandgenerimanner.Thislanguage,whihis
generienoughtotakeintoaountanytypeofostinformation,isthestandard
formatfortheexhangeofostinformationinXLive.
Thepaperis organizedasfollows:InSetion 2,weintrodueXLivesystem
withitsTGVmodelingofXQueryandwemotivateourapproahtoost-based
optimization. In Setion 3,wedesribe thesummarized ost models and show
howto representandexhangetheseostmodelsusingourXML-basedgeneri
language. Setion 4 provides the desriptionof TGV ost annotation and the
proedurefortheoverallostestimationatthemediatorlevel.Weonludeand
givediretionsforfutureworkinSetion 5.
2 Bakground
XQuery proessing in XLive A user'sXQuery submittedtotheXLiveme-
formationonevaluation,suhasthedatasoureloations,ostmodels,soures
funtional apabilities of soures,et. The optimal annotatedTGV is thense-
leted basedonaost-basedoptimization strategy.Inthis optimization proe-
dure, TGV is proessed as the logial exeution plan and the ost estimation
ofTGVis performedwithooperationbetweendierentomponentsofXLive.
This optimal TGVisthentransformedinto anexeutionplanusing aphysial
algebra.Tothisend,wehavehosentheXAlgebra[DNG03℄thatisanextension
to XML of the relationalalgebra. Finally, the physial exeution plan is eval-
uatedand an XML resultis produed, Fig.1 depitsthe dierent stepsof this
proessing.
Query Result (XML)
Mediator
Users
Canonized XQuery Canonization
XQuery
Tree Graph Views(TGV)
XAlgebra
Annotated TGV Modeling
Evaluation
Annotation Transformation
Query Response
Search Strategy
Static wrappers cost information Dynamic Wrappers Cost information Mediator
Operators
. . . . . .
Wrappers Operators
. . . . . .
Mediator Information Repository
Wrapper Information Repository Mediator cost information
Wrapper Wrapper Wrapper
Relational data bases
XML data sources
Web services Cost-based Optimization Equivalent rules
Mediator Users
Wrapper Sources
Fig.1.Cost-basedoptimizationinproessingofXQueryintheXLivesystem
TreeGraphView TGVisalogialstruturemodelimplementedintheXLive
mediatorforXQueryproessing,whihanbemanipulated,optimizedandeval-
uated [TDNL06℄. TGV takes into aount the whole funtionality of XQuery
(olletion, XPath,prediate,aggregate,onditional part,et.) andusesanin-
tuitiverepresentationthatprovidesaglobalviewoftherequestinamediation
ontext. Eah element in theTGVmodel hasbeendened formally usingAb-
stratDataTypein[Tra06℄andhasagraphialrepresentation.InFig.2(a),we
giveanexampleofXQuerywhihdelarestwoFORlauses($aand$b),ajoin
onstraintbetweenauthorsandaontainsfuntion,thenareturnlauseprojets
thetitlevalueoftherstvariable.ThisqueryisrepresentedbyaTGVinFig.2
(b).Weandistinguishthetwodomainvariables$aand$boftheXQuery,den-
ing eah nodes orresponding to the given XPaths. A join hyperlink links the
ReturnTreePatternin projetionpurposes.
for $a in col("catalogs")/catalog/book
for $b in col("reviews")/reviews/review where
$a/author = $b/author and contains($b/author,"Hobb")
return <books>
{$a//title}
</books>
(a) An XQuery query (b) TGV representation
=
$a $b
books
contains("Hobb") catalog
book
title author
reviews
review
author
"catalogs" "reviews"
Fig.2.AnexampleofXQueryanditsTGVrepresentation
TGV generi annotation The motivation to annotate a TGV is to allow
annotatingsubsetsofelementsofaTGV modelwithvarious information.Pre-
isely,foreaharbitraryomponent(i.e.oneoragroupofoperatorsofTGV),we
addsomeadditionalinformationsuhasostinformation,systemperformane
information, soure loalization, et. Our annotationmodel is generi and al-
lowsannotationofanytypeofinformation.Thesetofannotationbasedonthe
sameannotation type is alled an annotated view. There anbeseveral anno-
tatedviewsforthesameTGV,forexample,time-ostannotatedview,algorithm
annotatedview,soures-loalizationannotatedview,et.
3 Heterogeneous Cost Models and Cost Communiation
within XLive
3.1 Cost Modelsfor HeterogeneousAutonomousData Soures
Costmodelssummary Wesummarizedierentexistingostmodelsforvar-
ious typesof datasoures in Fig.3.This summary is notonly basedontypes
ofdatasouresbutalsoonautonomydegreesofthesesoures.Inaddition,this
summary givessomerelationsbetweendierentworksonost-basedqueryop-
timization. The ost models with the name "operation" ontain aurate ost
formulas foralulatingthe exeution ostof operators appearing in theplan.
Generally, ostinformation suh assoure statistisis neessaryforthese ost
models,beausethesestatistisareusedtoderivethevalueofoeientsinost
formulas.Itisoftendatasouresimplementerswhoareabletogiveaurateost
Extended Cost models based on
operation implemetation
Generic cost models
Calibration procedure unavailable
Specific methods for obtaining cost
Applied
Applied
Proprietary data sources Heterogeneous autonomous data sources
Path [GGT96]
Flora [Flo96]
[Gru96]
Hybrid cost model [NGT98]
Operation [CD92]
[BMG93]
[DOA+94]
Historical cost [ACP96]
Wrappers [HKWY97]
[ROH99]
Calibration [GST96]
Adaptive [Zhu95]
Operation [GP89]
[ML86] [SA82]
Calibration [DKS92]
Sampling [ZL98]
Operation [AAN01]
[MW99]
XQuery Self-Learning [ZHJGML05]
Relational data sources
Object-oriented data sources
Semi-structured data sources
Fig.3.Costmodelsfor heterogeneoussoures
Whenthedatasouresare autonomous,ostformulasandsourestatistis
are unavailable.Forobtainingostmodelsweneedsomespeial methods that
vary with the autonomy degree of data soures. For example, the method by
Calibration [DKS92℄estimates theoeientsof ageneriost model foreah
typeof relationaldatasoures.This alibrationneeds toknowaess methods
used by the soure. This method is extended to objet-orienteddatabases by
[GST96℄. Ifthisalibration proedure annotbeproesseddue to datasoure
onstraints,asamplingmethod proposed in[ZL98℄anderiveaostmodelfor
eahtypeofquery.Thequerylassiationin[ZL98℄isbasedonasetofommon
rules adopted by many DBMSs. When no implementation algorithm and ost
informationareavailable,weanusethemethoddesribedin[ACP96℄,inwhih
ostestimationofnewqueriesisbasedonthehistoryofqueriesevaluatedsofar.
Generi ost model Here, we showhowto reuse the summary in Fig.3 to
deneourgeneriostmodelusedforXQueryoptimizationintheXLivesystem.
First, a ost model is generally designed for some type of data soure (but
thereare alsosomemethodsthat anbeusedfordierenttypesofsoures,for
example,themethodbyhistory[ACP96℄).Seond,thisostmodelanontain
someaurate ost formulas withoeients'value derived from data soures
statistis,or aspei method forderiving theost formulas.This ostmodel
may also have only a onstant value for giving diretly the exeution ost of
operators. The possible attributes of our generi ost model are desribed in
Table1.ThisdesriptivedenitionofostmodelisusedforTGVostannotation
forthepurposeofoverallostestimation inthemediatorlevel(ref.Setion4).
Foraostmodel,allattributesareoptionalbyreasonofgenerality.Weapply
al osts,butithasalowerauraylevelthanostmodelsbasedonoperation
implementation.Thatmeansiftheostmodelsbasedonoperationsimplemen-
tationareavailable,weuseneitherthemethodbyalibration norhistory.
Attribute Desription
DatasouretypeThis typean be relational,objet-oriented,semi-strutured,
les,Webservies,et.
Method Thespeimethod storedin this eldanbeusedto derive
thepratiableostformulas.These ostformulasmaybein-
aurate,butanatleastroughlyestimatetheexeutionost.
This respet ouras aurate as possible priniple. Generally,
someAPIsorrespondingtothespei methodareavailable
inthiseld,theseAPIsareimplementedbyXLivesystemand
an give some useful servies suh as "provide the value of
oeientsintheformulas".
Formulas Thisistheoreofaostmodel,buttheyareoftenunavailable
in a heterogeneousenvironment. These formulas are given in
form of equations.Thevaluesof oeientsappearingin the
formulasanalsobe representedin form ofequations,forex-
ample,Cardinality=10000.Alltheseformulasformsanequa-
tionssystem.Forsomeostmodels,onlyaonstantostvalue
isavailable.Thisvalueanbeprovidedbydatasoure(stored
in wrapper information repository),orderivedfrom resultsof
exeutedqueries(historial ost)
Table1.Denition ofgeneriostmodel
3.2 Generi Language for CostCommuniation(GLCC)
XML-based generi language To perform ost ommuniationwithin our
XLivesystem,wedenealanguagetoexpresstheostinformationinauniform,
omplete and generimanner. This languagets to our XML environment, to
avoid ostly format onverting. It onsiders everyost model type and allows
wrappers to export their spei ost information. In our XLive ontext, this
language is generi enough to express ost information of dierent parts of a
TGVandisapabletoexpressostforvariousoptimizationgoals,forexample,
response time,prie,energyonsummation,et.
Our languageextends theMathML language[W3C03℄, whih allowsus to
dene all mathematial funtions in XML form. MathML ts to ost ommu-
niation within XLive due to its semi-strutured nature. We use the Content
Markup inMathMLtoprovideexpliitenodingforostformulas.Wejust add
somerulestoMathMLtodenethegrammarofourlanguage.Furthermore,this
grammarisextensible sothatusersanalwaysdeneitsowntagsforanytype
Mediator
Extract
Operators Evaluation
Information transfered using GLCC.
Provides Cost Information
Adjustment of cost models Records Historical
cost
Adjustment of cost models
Cost information Cost models
Parser Wrapper Information Repository TGV cost
computation
Historical Records
Data Source
Wrapper
<cost source="relational">
<apply><eq/>
<apply><ci>CostRe</ci></apply>
<apply><plus/>
<ci>CostRestriction</ci>
<ci>CostProjection</ci>
</apply>
<apply>
...
</cost>
Cost_Re =
Cost_Restriction + Cost_Projection An example of cost model and its representation on GLCC
Fig.4.DynamiostevaluationwithGLCCinXLivesystem
Cost formulas are representedin the form of equations set. Eah equation
orrespondstoaostfuntion thatmaybedenedbythesoureorbytheme-
diator.EahomponentofTGVisannotatedwithanequationsetinwhihthe
numberofequationsisundened.Onefuntioninasetmayusevariablesdened
in other sets. We dene somerules to ensure the onsistenyof the equations
system. First, every variable should have somewhere a denition. Seond, by
reasonofgenerality,therearenopredenedvariablenames.Forexample,inthe
grammar,wedo notdene aname"time" foraostvariable beausetheost
metrianbeaprieunit.It istheuserofthelanguagewhogivesthespei
signiant namesto variables. This gives a muh moregeneri ost denition
model omparedtothelanguagedenedin [NGT98℄.
Dynami ostevaluation Fig.4givesasimpleexamplefortheexpressionof
aost model andshowstherole ofour languagein ost ommuniation.After
extrating ost information from data soure, the wrapper exports that infor-
mationusingourlanguagetotheparser,whihderivesostmodelsthat willbe
storedinthewrapperinformationrepository.Whenthemediatorneedstoom-
pute the exeution ost of aplan (TGV), the wrapper information repository
providesneessaryostinformationforoperatorsexeutedonwrappers.Wehave
aahefor storinghistorialexeution ostof queriesevaluated, whih anbe
used toadjust theexportedostinformationfrom thewrapper.All theseom-
muniationsareproessedintheformofourlanguage.Ourlanguageompletes
4.1 TGV ost annotation
As mentioned in Setion 2, the TGV is the logial exeution plan of XQuery
within thequeryproessingin XLive.Thepurposeofourqueryoptimizationis
tondtheoptimal TGVwiththeleastexeutionost.Forestimatingtheoverall
ostofaTGV,weannotatedierentomponents(oneoragroupofoperators)of
TGV.ForanoperatororagroupofoperatorsappearinginaTGV,thefollowing
ostinformationanbeannotated:
Loalization: The operator(s) an be exeuted on the mediator or on the
wrappers(datasoures).
CostModel:Usedto alulatetheexeutionostoftheomponent.
Other information: Contains supplementary information that is useful for
ostestimation. Forexample,severaloperators'(suhasjoinoperator)im-
plementationallowsparallelexeutionbetweenitsrelatedoperators.
=
card: cardinality sel: selectivity restr: restriction proj: projection (1)
(9) (8)
(7) (6)
(5) (4) (3) (2)
$a $b
books
contains("Hobb") catalog
book
title author
reviews
review
author
"catalogs" "reviews"
Fig.5.AnexampleforTGVostannotation
Fig.5givesanexampleforTGVostannotation.Inthisexample,dierent
omponentsoftheTGVintroduedinFig.2(Ref.Setion2)areannotated.We
an see forthe operators exeutedon Soure1(S1),wehaveonly thehistorial
ost to use for estimate the total exeution ost of all the these operators;in
ontrast, foreah operator exeutedon Soure2(S2),wehaveaost model for
estimating itsexeutionost.Forthejoin operator(numbered(7))exeutedon
themediator,theoperatorslinkedtoitanbeexeutedinparallel.
4.2 Overallost estimation
Cost Annotation Tree (CAT) Wehaveseenhowto annotate aTGV with
annotated omponent of TGV generallydepends on theost of other ompo-
nents.Forexample,fortheostformulaannotatedin(6),weseethatitdepends
onthe ostof (2),(3), (4) and(5). From theost formulasannotatedforeah
omponentofTGV,weobtainaCost AnnotationTree(CAT).InaCAT,eah
node represents aomponent of TGVannotated by ost information and this
CAT desribes the hierarhial relations between these dierent omponents.
Fig.6(a)illustratestheCATof theTGVannotatedinFig.5.
(a) Cost Annotation Tree (CAT) (b) Overall cost estimation algorithm
1 7
4 9
3 5 2
6 8
1 Node that needs to call APIs for obtaining the necessary coefficients’ value
1 associateCost (node) { 2 node.analyzeCostModel ( );
3 if (node.hasSpecialMethod( )) { 4 node.callAPI( );
5 }
6 for (each child of node) { 7 associateCost(child);
8 }
9 node.configCostFormula( );
10 node.calculateCost( );
11 }
Fig.6.CostAnnotationTreeandthealgorithmforoverallostestimation
Overall ost estimation algorithm We now show how to use the CAT of
a TGV to perform the overall ost estimation. We use the reursivebreadth-
rstsearhalgorithmofatreeforperformingostestimationofeahnode.For
eah node ofCAT, we dene aproedure alled assoiateCost (Fig. 6(b)) for
operatingthe ostannotationofanode. Thisproedure rstanalyzestheost
annotation of the node and derives its ost model (line 2); If a spei ost
methodisfound,itallsanAPIimplementedbyXLiveforobtainingthenees-
saryvaluesofoeientsorostformulasforomputingtheost(line3-5);ifthe
ostof thisnodedependson theostof itshildnodes,itexeutesreursively
theassoiateCost proedureonitshildnodes(line6-8).Whenthese3stepsare
terminated, a proedure ongCostFormula ompletes the ost formulas with
obtained valuesof oeients(line 9) and exeution ost of this node will be
alulated(line10).Byusingthisalgorithm,weanobtaintheoverallostofa
TGV,whihistheostoftherootofCAT.
4.3 Appliation: plan omparisonand generation
It hasbeenshown in [TDNL06℄that for proessingagivenXQuery,anumber
hangingtheresult.TheexeutionostofaTGVanbeomputedbyusingour
generi ost framework and thus we an ompare the osts of these plans to
hoosethebestonetoexeutethequery.
However,asthe number of rulesis huge,this impliesan exponential blow-
up of the andidate plans. It is impossible to alulate the ost of all these
andidateplans,beausetheostomputationandthesubsequentomparisons
willbeevenmoreostlythantheexeutionoftheplan.Thus,weneedasearh
strategy to redue the size of the searh spaeontainingandidate exeution
plans. We note in this respet that our ost framework is generi enough to
beapplied to various searh strategies suh as exhaustive,iterative,simulated
annealing,geneti,et.
5 Conlusion
In this paper, wedesribed ourost framework for theoverall ost estimation
of andidate exeution plans in an XML-based mediationsystem. The losest
relatedwork isDISCOsystem[NGT98℄,whihdenesageneriostmodelfor
anobjet-basedmediation system.Comparedto DISCOwork andothermedi-
ationsystems,wehavethefollowingontributions:First,toourknowledge,our
ostframeworkistherstapproahproposedforaddressingtheostingproblem
in XML-basedmediationsystems.Seond,ourostommuniationlanguageis
ompletely generito express anytype ofost, whih isan improvementom-
paredtothelanguageproposedin DISCO.Third,ourostframeworkisgeneri
enoughtottooverallostomputationwithinvarious mediationsystems.
As futur work, we plan to dene a generi ost model for XML soures
with ost formulas that anompute the ost with given parametersthat are
omponents in TGV. This ost model would be generi for all types of XML
soures. We will also onentrateon the design of an eient searh strategy
that willbeusedin ourost-basedoptimizationproedure.
Aknowledgment
ThisworkissupportedbyXaliaS.A.(Frane)andbyANRPADAWAN projet.
Referenes
[AAN01℄ A. Aboulnaga, A.Alameldeen,andJ.Naughton. EstimatingtheSele-
tivityofXMLPathExpressionsforInternetSaleAppliations. VLDB,
2001.
[ACP96℄ S. Adali, K. Candan, and Y. Papakonstantinou. Query Cahing and
OptimizationinDistributedMediatorSystems.InACMSIGMOD,1996.
[BMG93℄ J.A.Blakeley,W.J.MKenna,andG.Graefe. ExperienesBuildingthe