• Aucun résultat trouvé

Cost Framework for a Distributed Semi-Structured Environment

N/A
N/A
Protected

Academic year: 2021

Partager "Cost Framework for a Distributed Semi-Structured Environment"

Copied!
13
0
0

Texte intégral

(1)

HAL Id: hal-00733435

https://hal.archives-ouvertes.fr/hal-00733435

Submitted on 28 Dec 2012

HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Cost Framework for a Distributed Semi-Structured Environment

Tianxiao Liu, Tuyet-Tram Dang-Ngoc, Dominique Laurent

To cite this version:

Tianxiao Liu, Tuyet-Tram Dang-Ngoc, Dominique Laurent. Cost Framework for a Distributed Semi-

Structured Environment. International workshop Database Management and Application over Net-

works - DBMAN (APWeb/WAIM Workshop), Jun 2007, France. pp.1-11. �hal-00733435�

(2)

Semi-Strutured Environment

TianxiaoLiu

1

,TuyêtTrâmDangNgo

2

,andDominiqueLaurent

3

1

ETISLaboratory-UniversityofCergy-Pontoise&XCaliaS.A,Frane.

Tianxiao.Liuu-ergy.fr;Tianx iao. Liux ali a.om

2

ETISLaboratory-UniversityofCergy-Pontoise, Frane.

Tuyet-Tram.Dang-Ngou-ergy.f r

3

ETISLaboratory-UniversityofCergy-Pontoise, Frane.

Dominique.Laurentu-ergy.fr

Abstrat. Thispaperproposesageneriostframeworkforqueryop-

timization inanXML-basedmediation system alled XLive, whihin-

tegratesdistributed,heterogeneous and autonomous datasoures.Our

approahreliesonostannotationonanXQuerylogial representation

alledTreeGraphView(TGV).Ageneriostommuniationlanguage

is usedto give anXML-based uniform format for ost ommuniation

within the XLive system. This ost framework is suitable for various

searhstrategiestohoosethebestexeutionplanforthesakeofmini-

mizingtheexeutionost.

Keywords: mediationsystem,queryoptimization,ostmodel,TreeGraphView,

ostannotation

1 Introdution

Thearhitetureof mediationsystemhasbeenproposed in [Wie92℄for solving

theproblemofintegrationofheterogeneousdatasoures.Insuhanarhiteture,

userssendqueriestothemediator,andthemediatorproessesthesequerieswith

thehelp ofwrappersassoiatedtodatasoures.Currently,thesemi-strutured

data model represented by XML format is onsidered as a standard data ex-

hange model.XLive[NJT05℄, mediation systembasedonXML standard,has

a mediator whih an aept queriesin the form of XQuery[W3C05℄ and re-

turnanswers.ThewrappersgivethemediatoranXML-baseduniformaessto

heterogeneousdatasoures.

For agiven userquery, the mediator angenerate various exeution plans

(referred to as"plan" in theremainder of this paper) to exeute it, and these

plansandierwidelyinexeutionost(exeutiontime,prieofostlyonne-

tions,ommuniationost,et.An optimizationproedure isthusneessaryto

determine the mosteientplan with the least exeutionost. However, how

to hoosethebest planbasedontheostisstillanopenissue.Inrelationalor

(3)

foreahoperatorappearingintheplan.Butinaheterogeneousanddistributed

environment,theostestimation ismuh morediult, due tothe lakofun-

derlyingdatabasesstatistisandostformulas.

Varioussolutionsfor proessingtheoverall ostestimation atthe mediator

level have been proposed. In [DKS92℄, aalibration proedure is desribed to

estimate the oeients of a generi ost model, whih an be speialized for

a lass of systems. This solution is extended for objet database systems in

[GGT96℄[GST96℄. Theapproahproposedin [ACP96℄ reordsostinformation

(results)foreveryqueryexeutedandreusesthatinformationforthesubsequent

queries. [NGT98℄ uses a ost-based optimization approah whih ombines a

generiostmodelwithspeiostinformationexportedbywrappers.However,

noneof thesesolutionshasaddressedtheproblem ofoverall ostestimationin

asemi-struturedenvironmentintegratingheterogeneousdatasoures.

Inthispaper,weproposeageneriostframeworkforanXML-basedmedi-

ationsystem,whihintegratesdistributed,heterogeneousandautonomousdata

soures. This framework allows to take into aount various ost models for

dierenttypesofdatasoureswithdiverseautonomydegrees.Theseostmod-

els are stored as annotations in an XQuery logial representation alled Tree

GraphView(TGV)[DNGT04℄[TDNL06℄.Moreover,ostmodelsareexhanged

betweendierentomponentsoftheXLivesystem.Weapplyourostframework

toomparetheexeutionostofandidateplansinordertohoosethebestone.

First,wesummarizedierentostmodelsfordierenttypesofdatasoures

(relational,objetorientedandsemi-strutured)anddierentautonomydegrees

ofthesesoures(proprietary,non-proprietaryandautonomous).Theoverallost

estimation relies on the ost annotation stored in orresponding omponents

TGV.This ostannotationderivesfrom ageneriannotationmodel whihan

annotateanyomponent(i.e.oneoragroupofoperators)ofaTGV.

Seond,inordertoperformtheostommuniationwithintheXLivesystem

duringqueryoptimization,wedeneanXML-basedlanguagetoexpresstheost

informationinauniform,ompleteandgenerimanner.Thislanguage,whihis

generienoughtotakeintoaountanytypeofostinformation,isthestandard

formatfortheexhangeofostinformationinXLive.

Thepaperis organizedasfollows:InSetion 2,weintrodueXLivesystem

withitsTGVmodelingofXQueryandwemotivateourapproahtoost-based

optimization. In Setion 3,wedesribe thesummarized ost models and show

howto representandexhangetheseostmodelsusingourXML-basedgeneri

language. Setion 4 provides the desriptionof TGV ost annotation and the

proedurefortheoverallostestimationatthemediatorlevel.Weonludeand

givediretionsforfutureworkinSetion 5.

2 Bakground

XQuery proessing in XLive A user'sXQuery submittedtotheXLiveme-

(4)

formationonevaluation,suhasthedatasoureloations,ostmodels,soures

funtional apabilities of soures,et. The optimal annotatedTGV is thense-

leted basedonaost-basedoptimization strategy.Inthis optimization proe-

dure, TGV is proessed as the logial exeution plan and the ost estimation

ofTGVis performedwithooperationbetweendierentomponentsofXLive.

This optimal TGVisthentransformedinto anexeutionplanusing aphysial

algebra.Tothisend,wehavehosentheXAlgebra[DNG03℄thatisanextension

to XML of the relationalalgebra. Finally, the physial exeution plan is eval-

uatedand an XML resultis produed, Fig.1 depitsthe dierent stepsof this

proessing.

Query Result (XML)

Mediator

Users

Canonized XQuery Canonization

XQuery

Tree Graph Views(TGV)

XAlgebra

Annotated TGV Modeling

Evaluation

Annotation Transformation

Query Response

Search Strategy

Static wrappers cost information Dynamic Wrappers Cost information Mediator

Operators

. . . . . .

Wrappers Operators

. . . . . .

Mediator Information Repository

Wrapper Information Repository Mediator cost information

Wrapper Wrapper Wrapper

Relational data bases

XML data sources

Web services Cost-based Optimization Equivalent rules

Mediator Users

Wrapper Sources

Fig.1.Cost-basedoptimizationinproessingofXQueryintheXLivesystem

TreeGraphView TGVisalogialstruturemodelimplementedintheXLive

mediatorforXQueryproessing,whihanbemanipulated,optimizedandeval-

uated [TDNL06℄. TGV takes into aount the whole funtionality of XQuery

(olletion, XPath,prediate,aggregate,onditional part,et.) andusesanin-

tuitiverepresentationthatprovidesaglobalviewoftherequestinamediation

ontext. Eah element in theTGVmodel hasbeendened formally usingAb-

stratDataTypein[Tra06℄andhasagraphialrepresentation.InFig.2(a),we

giveanexampleofXQuerywhihdelarestwoFORlauses($aand$b),ajoin

onstraintbetweenauthorsandaontainsfuntion,thenareturnlauseprojets

thetitlevalueoftherstvariable.ThisqueryisrepresentedbyaTGVinFig.2

(b).Weandistinguishthetwodomainvariables$aand$boftheXQuery,den-

ing eah nodes orresponding to the given XPaths. A join hyperlink links the

(5)

ReturnTreePatternin projetionpurposes.

for $a in col("catalogs")/catalog/book

for $b in col("reviews")/reviews/review where

$a/author = $b/author and contains($b/author,"Hobb")

return <books>

{$a//title}

</books>

(a) An XQuery query (b) TGV representation

=

$a $b

books

contains("Hobb") catalog

book

title author

reviews

review

author

"catalogs" "reviews"

Fig.2.AnexampleofXQueryanditsTGVrepresentation

TGV generi annotation The motivation to annotate a TGV is to allow

annotatingsubsetsofelementsofaTGV modelwithvarious information.Pre-

isely,foreaharbitraryomponent(i.e.oneoragroupofoperatorsofTGV),we

addsomeadditionalinformationsuhasostinformation,systemperformane

information, soure loalization, et. Our annotationmodel is generi and al-

lowsannotationofanytypeofinformation.Thesetofannotationbasedonthe

sameannotation type is alled an annotated view. There anbeseveral anno-

tatedviewsforthesameTGV,forexample,time-ostannotatedview,algorithm

annotatedview,soures-loalizationannotatedview,et.

3 Heterogeneous Cost Models and Cost Communiation

within XLive

3.1 Cost Modelsfor HeterogeneousAutonomousData Soures

Costmodelssummary Wesummarizedierentexistingostmodelsforvar-

ious typesof datasoures in Fig.3.This summary is notonly basedontypes

ofdatasouresbutalsoonautonomydegreesofthesesoures.Inaddition,this

summary givessomerelationsbetweendierentworksonost-basedqueryop-

timization. The ost models with the name "operation" ontain aurate ost

formulas foralulatingthe exeution ostof operators appearing in theplan.

Generally, ostinformation suh assoure statistisis neessaryforthese ost

models,beausethesestatistisareusedtoderivethevalueofoeientsinost

formulas.Itisoftendatasouresimplementerswhoareabletogiveaurateost

(6)

Extended Cost models based on

operation implemetation

Generic cost models

Calibration procedure unavailable

Specific methods for obtaining cost

Applied

Applied

Proprietary data sources Heterogeneous autonomous data sources

Path [GGT96]

Flora [Flo96]

[Gru96]

Hybrid cost model [NGT98]

Operation [CD92]

[BMG93]

[DOA+94]

Historical cost [ACP96]

Wrappers [HKWY97]

[ROH99]

Calibration [GST96]

Adaptive [Zhu95]

Operation [GP89]

[ML86] [SA82]

Calibration [DKS92]

Sampling [ZL98]

Operation [AAN01]

[MW99]

XQuery Self-Learning [ZHJGML05]

Relational data sources

Object-oriented data sources

Semi-structured data sources

Fig.3.Costmodelsfor heterogeneoussoures

Whenthedatasouresare autonomous,ostformulasandsourestatistis

are unavailable.Forobtainingostmodelsweneedsomespeial methods that

vary with the autonomy degree of data soures. For example, the method by

Calibration [DKS92℄estimates theoeientsof ageneriost model foreah

typeof relationaldatasoures.This alibrationneeds toknowaess methods

used by the soure. This method is extended to objet-orienteddatabases by

[GST96℄. Ifthisalibration proedure annotbeproesseddue to datasoure

onstraints,asamplingmethod proposed in[ZL98℄anderiveaostmodelfor

eahtypeofquery.Thequerylassiationin[ZL98℄isbasedonasetofommon

rules adopted by many DBMSs. When no implementation algorithm and ost

informationareavailable,weanusethemethoddesribedin[ACP96℄,inwhih

ostestimationofnewqueriesisbasedonthehistoryofqueriesevaluatedsofar.

Generi ost model Here, we showhowto reuse the summary in Fig.3 to

deneourgeneriostmodelusedforXQueryoptimizationintheXLivesystem.

First, a ost model is generally designed for some type of data soure (but

thereare alsosomemethodsthat anbeusedfordierenttypesofsoures,for

example,themethodbyhistory[ACP96℄).Seond,thisostmodelanontain

someaurate ost formulas withoeients'value derived from data soures

statistis,or aspei method forderiving theost formulas.This ostmodel

may also have only a onstant value for giving diretly the exeution ost of

operators. The possible attributes of our generi ost model are desribed in

Table1.ThisdesriptivedenitionofostmodelisusedforTGVostannotation

forthepurposeofoverallostestimation inthemediatorlevel(ref.Setion4).

Foraostmodel,allattributesareoptionalbyreasonofgenerality.Weapply

(7)

al osts,butithasalowerauraylevelthanostmodelsbasedonoperation

implementation.Thatmeansiftheostmodelsbasedonoperationsimplemen-

tationareavailable,weuseneitherthemethodbyalibration norhistory.

Attribute Desription

DatasouretypeThis typean be relational,objet-oriented,semi-strutured,

les,Webservies,et.

Method Thespeimethod storedin this eldanbeusedto derive

thepratiableostformulas.These ostformulasmaybein-

aurate,butanatleastroughlyestimatetheexeutionost.

This respet ouras aurate as possible priniple. Generally,

someAPIsorrespondingtothespei methodareavailable

inthiseld,theseAPIsareimplementedbyXLivesystemand

an give some useful servies suh as "provide the value of

oeientsintheformulas".

Formulas Thisistheoreofaostmodel,buttheyareoftenunavailable

in a heterogeneousenvironment. These formulas are given in

form of equations.Thevaluesof oeientsappearingin the

formulasanalsobe representedin form ofequations,forex-

ample,Cardinality=10000.Alltheseformulasformsanequa-

tionssystem.Forsomeostmodels,onlyaonstantostvalue

isavailable.Thisvalueanbeprovidedbydatasoure(stored

in wrapper information repository),orderivedfrom resultsof

exeutedqueries(historial ost)

Table1.Denition ofgeneriostmodel

3.2 Generi Language for CostCommuniation(GLCC)

XML-based generi language To perform ost ommuniationwithin our

XLivesystem,wedenealanguagetoexpresstheostinformationinauniform,

omplete and generimanner. This languagets to our XML environment, to

avoid ostly format onverting. It onsiders everyost model type and allows

wrappers to export their spei ost information. In our XLive ontext, this

language is generi enough to express ost information of dierent parts of a

TGVandisapabletoexpressostforvariousoptimizationgoals,forexample,

response time,prie,energyonsummation,et.

Our languageextends theMathML language[W3C03℄, whih allowsus to

dene all mathematial funtions in XML form. MathML ts to ost ommu-

niation within XLive due to its semi-strutured nature. We use the Content

Markup inMathMLtoprovideexpliitenodingforostformulas.Wejust add

somerulestoMathMLtodenethegrammarofourlanguage.Furthermore,this

grammarisextensible sothatusersanalwaysdeneitsowntagsforanytype

(8)

Mediator

Extract

Operators Evaluation

Information transfered using GLCC.

Provides Cost Information

Adjustment of cost models Records Historical

cost

Adjustment of cost models

Cost information Cost models

Parser Wrapper Information Repository TGV cost

computation

Historical Records

Data Source

Wrapper

<cost source="relational">

<apply><eq/>

<apply><ci>CostRe</ci></apply>

<apply><plus/>

<ci>CostRestriction</ci>

<ci>CostProjection</ci>

</apply>

<apply>

...

</cost>

Cost_Re =

Cost_Restriction + Cost_Projection An example of cost model and its representation on GLCC

Fig.4.DynamiostevaluationwithGLCCinXLivesystem

Cost formulas are representedin the form of equations set. Eah equation

orrespondstoaostfuntion thatmaybedenedbythesoureorbytheme-

diator.EahomponentofTGVisannotatedwithanequationsetinwhihthe

numberofequationsisundened.Onefuntioninasetmayusevariablesdened

in other sets. We dene somerules to ensure the onsistenyof the equations

system. First, every variable should have somewhere a denition. Seond, by

reasonofgenerality,therearenopredenedvariablenames.Forexample,inthe

grammar,wedo notdene aname"time" foraostvariable beausetheost

metrianbeaprieunit.It istheuserofthelanguagewhogivesthespei

signiant namesto variables. This gives a muh moregeneri ost denition

model omparedtothelanguagedenedin [NGT98℄.

Dynami ostevaluation Fig.4givesasimpleexamplefortheexpressionof

aost model andshowstherole ofour languagein ost ommuniation.After

extrating ost information from data soure, the wrapper exports that infor-

mationusingourlanguagetotheparser,whihderivesostmodelsthat willbe

storedinthewrapperinformationrepository.Whenthemediatorneedstoom-

pute the exeution ost of aplan (TGV), the wrapper information repository

providesneessaryostinformationforoperatorsexeutedonwrappers.Wehave

aahefor storinghistorialexeution ostof queriesevaluated, whih anbe

used toadjust theexportedostinformationfrom thewrapper.All theseom-

muniationsareproessedintheformofourlanguage.Ourlanguageompletes

(9)

4.1 TGV ost annotation

As mentioned in Setion 2, the TGV is the logial exeution plan of XQuery

within thequeryproessingin XLive.Thepurposeofourqueryoptimizationis

tondtheoptimal TGVwiththeleastexeutionost.Forestimatingtheoverall

ostofaTGV,weannotatedierentomponents(oneoragroupofoperators)of

TGV.ForanoperatororagroupofoperatorsappearinginaTGV,thefollowing

ostinformationanbeannotated:

Loalization: The operator(s) an be exeuted on the mediator or on the

wrappers(datasoures).

CostModel:Usedto alulatetheexeutionostoftheomponent.

Other information: Contains supplementary information that is useful for

ostestimation. Forexample,severaloperators'(suhasjoinoperator)im-

plementationallowsparallelexeutionbetweenitsrelatedoperators.

=

card: cardinality sel: selectivity restr: restriction proj: projection (1)

(9) (8)

(7) (6)

(5) (4) (3) (2)

$a $b

books

contains("Hobb") catalog

book

title author

reviews

review

author

"catalogs" "reviews"

Fig.5.AnexampleforTGVostannotation

Fig.5givesanexampleforTGVostannotation.Inthisexample,dierent

omponentsoftheTGVintroduedinFig.2(Ref.Setion2)areannotated.We

an see forthe operators exeutedon Soure1(S1),wehaveonly thehistorial

ost to use for estimate the total exeution ost of all the these operators;in

ontrast, foreah operator exeutedon Soure2(S2),wehaveaost model for

estimating itsexeutionost.Forthejoin operator(numbered(7))exeutedon

themediator,theoperatorslinkedtoitanbeexeutedinparallel.

4.2 Overallost estimation

Cost Annotation Tree (CAT) Wehaveseenhowto annotate aTGV with

(10)

annotated omponent of TGV generallydepends on theost of other ompo-

nents.Forexample,fortheostformulaannotatedin(6),weseethatitdepends

onthe ostof (2),(3), (4) and(5). From theost formulasannotatedforeah

omponentofTGV,weobtainaCost AnnotationTree(CAT).InaCAT,eah

node represents aomponent of TGVannotated by ost information and this

CAT desribes the hierarhial relations between these dierent omponents.

Fig.6(a)illustratestheCATof theTGVannotatedinFig.5.

(a) Cost Annotation Tree (CAT) (b) Overall cost estimation algorithm

1 7

4 9

3 5 2

6 8

1 Node that needs to call APIs for obtaining the necessary coefficients’ value

1 associateCost (node) { 2 node.analyzeCostModel ( );

3 if (node.hasSpecialMethod( )) { 4 node.callAPI( );

5 }

6 for (each child of node) { 7 associateCost(child);

8 }

9 node.configCostFormula( );

10 node.calculateCost( );

11 }

Fig.6.CostAnnotationTreeandthealgorithmforoverallostestimation

Overall ost estimation algorithm We now show how to use the CAT of

a TGV to perform the overall ost estimation. We use the reursivebreadth-

rstsearhalgorithmofatreeforperformingostestimationofeahnode.For

eah node ofCAT, we dene aproedure alled assoiateCost (Fig. 6(b)) for

operatingthe ostannotationofanode. Thisproedure rstanalyzestheost

annotation of the node and derives its ost model (line 2); If a spei ost

methodisfound,itallsanAPIimplementedbyXLiveforobtainingthenees-

saryvaluesofoeientsorostformulasforomputingtheost(line3-5);ifthe

ostof thisnodedependson theostof itshildnodes,itexeutesreursively

theassoiateCost proedureonitshildnodes(line6-8).Whenthese3stepsare

terminated, a proedure ongCostFormula ompletes the ost formulas with

obtained valuesof oeients(line 9) and exeution ost of this node will be

alulated(line10).Byusingthisalgorithm,weanobtaintheoverallostofa

TGV,whihistheostoftherootofCAT.

4.3 Appliation: plan omparisonand generation

It hasbeenshown in [TDNL06℄that for proessingagivenXQuery,anumber

(11)

hangingtheresult.TheexeutionostofaTGVanbeomputedbyusingour

generi ost framework and thus we an ompare the osts of these plans to

hoosethebestonetoexeutethequery.

However,asthe number of rulesis huge,this impliesan exponential blow-

up of the andidate plans. It is impossible to alulate the ost of all these

andidateplans,beausetheostomputationandthesubsequentomparisons

willbeevenmoreostlythantheexeutionoftheplan.Thus,weneedasearh

strategy to redue the size of the searh spaeontainingandidate exeution

plans. We note in this respet that our ost framework is generi enough to

beapplied to various searh strategies suh as exhaustive,iterative,simulated

annealing,geneti,et.

5 Conlusion

In this paper, wedesribed ourost framework for theoverall ost estimation

of andidate exeution plans in an XML-based mediationsystem. The losest

relatedwork isDISCOsystem[NGT98℄,whihdenesageneriostmodelfor

anobjet-basedmediation system.Comparedto DISCOwork andothermedi-

ationsystems,wehavethefollowingontributions:First,toourknowledge,our

ostframeworkistherstapproahproposedforaddressingtheostingproblem

in XML-basedmediationsystems.Seond,ourostommuniationlanguageis

ompletely generito express anytype ofost, whih isan improvementom-

paredtothelanguageproposedin DISCO.Third,ourostframeworkisgeneri

enoughtottooverallostomputationwithinvarious mediationsystems.

As futur work, we plan to dene a generi ost model for XML soures

with ost formulas that anompute the ost with given parametersthat are

omponents in TGV. This ost model would be generi for all types of XML

soures. We will also onentrateon the design of an eient searh strategy

that willbeusedin ourost-basedoptimizationproedure.

Aknowledgment

ThisworkissupportedbyXaliaS.A.(Frane)andbyANRPADAWAN projet.

Referenes

[AAN01℄ A. Aboulnaga, A.Alameldeen,andJ.Naughton. EstimatingtheSele-

tivityofXMLPathExpressionsforInternetSaleAppliations. VLDB,

2001.

[ACP96℄ S. Adali, K. Candan, and Y. Papakonstantinou. Query Cahing and

OptimizationinDistributedMediatorSystems.InACMSIGMOD,1996.

[BMG93℄ J.A.Blakeley,W.J.MKenna,andG.Graefe. ExperienesBuildingthe

Références

Documents relatifs

For this reason we present here a cost model for the evaluation of SPARQL queries tailored to act as part of an optimizer for adaptive, distributed query processing [4,11]

Step 2 Secondly, using an Ontology repository, the Type description and the Elements description, a set of candidate classes for typing objects and a set of candidate data properties

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des

In this condition, the macro stage division may need to be redone according to pipeline design model and privacy amplification should be divided into different macro stages.. 4.3

In this paper, we propose a framework which helps cost model designers to work collaboratively and enables them to automatize and ease the tra- ditional work-flow dedicated to

All these sensing devices continuously produce streams of data that can be collected by distributed data stream processing (DSP) applications, to timely extract valuable

Fig. 6 shows the code for defining the class-to-table example in ETL. Inten- tionally, the code font size used is very small in the figure, since we will not discuss in detail each

Trie atm of trie present note is to briefly point out that (and how) motion can also result from the internai structure of trie object placed in the