Partitioning XML data, towards distributed and parallel management

(1)

HAL Id: tel-00759173

https://tel.archives-ouvertes.fr/tel-00759173v3

Submitted on 8 Dec 2015

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of

sci-entific research documents, whether they are

pub-lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diffusion de documents

scientifiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

management

Noor Malla

To cite this version:

(2)

ECOLE DOCTORALE OF INFORMATIQUE

P H D T H E S I S

toobtain the titleof

Dr. of Science

(Informatics)

Partitioning XML data, towards

distributed and parallel management

Defended on September 21,2012 by

Noor Malla

Thesis Advisors :

Prof. NicoleBidoit-Tollu

University ParisSud XI -LRI

Assistant Prof. DarioColazzo

University ParisSud XI -LRI

Jury :

President: Prof. ChantalReynaud - University ParisSud XI

Reviewers: Prof. Amann Bernd - University Paris6 - LIP6

Prof. Giovanna Guerrini - University of Genova (Italy)

(3)

(4)

I would like to express my gratitude to all those who gave me the possibility to

complete thisThesis.

My foremost thank goes to Dr. Dario Colazzo, my thesis adviser. He has

oeredinvaluablesuggestionsandadvicetoshapemyresearchskills,whichwillhelp

me greatly in the rest of my career. His understanding, encouraging and personal

guidance haveprovided a good basisfor this Thesis.

Secondly,Iowe mymost sinceregratitudetoProfessorNicoleBidoit, my

super-visor,whogave mehertimeandhelpedmeinvariousaspectsofmygraduatestudy.

Also, I would like to thank Dr. Carlo Sartiani for spending time to serve my

committee, and providing valuable feedbackto helpmeimprove thedissertation in

manyways.

I would also like to thank my colleagues at LRI for their supports. They are

Marina, Amine, Federico and Putchi. They are help me in various aspects of my

graduate study.

Finallyandmostimportantly,IthankmyparentsinSyriaandallmyfamilyfor

their love andmanyyears ofsupportalong thepath ofmyacademic pursuits.

Last but not least, I would like to express my deepest gratitude to my best

friends, Yasmin, Yasser and Assem, for their unlimited support, love and

(5)

(6)

1 Résumé en Français 3 1.1 introduction générale . . . 3 1.2 contributions . . . 6 1.3 l'organisation du manuscrit . . . 8 2 Introduction 13 2.1 contributions . . . 15

2.2 structure of the thesis . . . 17

3 Preliminaries 19 3.1 XML . . . 19 3.1.1 Textual Representation . . . 20 3.1.2 Well-Formedness of XML . . . 20 3.2 Querying XML . . . 22 3.2.1 XPath Language . . . 22 3.2.2 XQueryLanguage . . . 26

3.2.3 XQueryUpdateFacility . . . 27

3.3 conclusion . . . 30

4 XML Projection and its Limitations 31 4.1 path-based projection for queries . . . 32

4.1.1 Limitations of StandardProjection for Queries . . . 35

4.2 type-based projection for updates. . . 39

4.2.1 Limitations of UpdateType-based Projection . . . 45

5 Partitioning and Projecting XML Documents 53 5.1 preliminaries . . . 54

5.1.1 DataModel . . . 54

5.1.2 QueryLanguage . . . 56

5.2 path extraction . . . 58

5.3 iterative queries and partitioning paths . . . 61

5.4 projection . . . 63

5.5 The partitioning algorithm . . . 70

5.5.1 TheAlgorithm . . . 72

5.5.2 Dealingwitha Workload . . . 75

5.6 streaming implementation . . . 78

5.7 experimental evaluation . . . 88

(7)

5.7.3 Experiments. . . 90

5.7.4 Experimentson Queries

{N1

, N

2 , N

3},

and

{D1

, D

2}

. . . . 94

5.7.5 Summing Up . . . 98

6 Partitioning for XQuery Updates 101 6.1 Overview . . . 102

6.2 preliminaries . . . 103

6.2.1 Simple XQuery UpdateFacilities (SXUF) . . . 103

6.3 iterative updates . . . 105

6.4 partitioning for iterative updates . . . 115

6.4.1 PartitioningAlgorithm . . . 116

6.4.2 FusionOperation . . . 119

6.5 streaming implementation . . . 121

6.5.1 Partitioning . . . 121 6.5.2 Fusion . . . 132 6.6 experimental evaluation . . . 135 6.6.1 ExperimentalSetup . . . 136 6.6.2 TestsResults . . . 136 6.6.3 Experiments. . . 138 6.6.4 Summing Up . . . 141 6.7 conclusion . . . 143

7 Parallel Query and Update Evaluation 145 7.1 MapReduce . . . 145

7.1.1 LogicalView . . . 147

7.1.2 Execution Overview . . . 148

7.2 parallel evaluation of iterative queries and updates via MapReduce . . . 150

7.3 conclusive remarks. . . 152

8 Related Works and Conclusion 153 8.1 related works . . . 153

8.2 conclusive remarks and future directions . . . 154

A XQuery Expressions and XQuery Updates 157 A.1 XMark Queries proposedin[SWK

+

02a] . . . 157

A.2 UpdateExpressions usedin[BBC

+

11] . . . 161

A.3 XQueryUpdateexpressions in[Sah11] . . . 162

A.4 XQueryUpdateFacilities 1.0Use Cases . . . 165

(8)

3.1 Textualrepresentation ofan XML fragment. . . 20

3.2 DTDofaddressBook.xml XMLdocument. . . 22

3.3 Awell-formedXML document. . . 23

3.4 Treerepresentation ofaddressBookXML document. . . 24

3.5 NavigationalXPath axes. . . 25

3.6 TheW3Csyntaxof simple XQuery updates.. . . 28

3.7 SimpleXQueryupdates. . . 28

3.8 ComplexXQueryupdates. . . 30

4.1 Afragment of the inputXMark document

D

. . . 33

4.2 An XMLdocument fragment. . . 34

4.3 Loadingalgorithm of[MS03]for buildinga projection. . . 35

4.4 Asimple example withtype-based projection. . . 42

4.5 DTDofthe XML document

t

illustratedinFigure 4.4. . . 43

4.6 Dealingwithinsertion. . . 43

4.7 Dealingwithstringand mixed-contents. . . 44

5.1 Projecting-partitioningscenarioforaninputdocument

D

andagiven query

Q

and partitioning path

P P

. . . 54

5.2 Representation ofXML trees asstoresand projection. . . 56

5.3 Two possiblepartitions oftheXML tree

t

ofFigure5.2. . . 56

5.4 Querylanguage grammar. . . 57

5.5 Path extraction function. . . 58

5.6 Scenario ofnding candidate pathsof Example2. . . 62

5.7 Partitioning pathsof some iterative XMarkqueries. . . 64

5.8 Path

/child :: a/dos :: c

transformations. . . 65

5.9 Partition plus projection. . . 66

5.10 An inputXML tree

t

. . . 75

5.11 Standardprojections

t

0 Q

1

,

t

0 Q

2

XML trees created fromtheinput

t

. . 76

5.12 Partitioning scenarioon

t

for a giveniterativequery

Q

1

. . . 76

5.13 Globalprojection

t

0

for the workload (

Q

1 , Q

2

). . . 77

5.14 Partitioning scenarioon theglobal projection

t

0

ofworkload(

Q

1 , Q

2

). 77 5.15 An inputdocument

t

and its projected parts

t

0

1 , t

2

0

. . . 84

5.16 Projection-partitioning processing: thecurrent open-tagis <doc>. . . 84

5.17 Projection-partitioning processing: thecurrent open-tagis <a>. . . . 84

5.18 Projection-partitioning processing: thecurrent open-tagis . . . . 85

5.19 Projection-partitioning processing: thecurrent open-tagis <c>. . . . 85

5.20 Projection-partitioning processing: thecurrent close-tag is</c>. . . 86

(9)

5.23 Projection-partitioning processingfor parsing thesubtree <a><f><c>. 87

5.24 Projection-partitioning processingforparsingthefollowingclose-tags

</c></f></a>. . . 88

5.25 Parsing the subtree <a><c>to</c></a>, and create a new projected part

t

0

2

. . . 88

5.26 Final projected parts

t

0

1 , t

2

0

produced by projection+partitioning al-gorithm. . . 89

5.27 Projection vs partitioning+projection - withinput document 1GB -using Saxon.. . . 91

5.28 Projection vs partitioning+projection - withinput document 5GB -using Saxon.. . . 92

5.29 Scalability ofthe partitioning approach - usingSaxon. . . 92

5.30 Scalability ofthe partitioning approach: workload - usingSaxon. . . 93

5.31 Projection vspartitioning- withinput document 2GB - usingQizx.. 93

5.32 Scalability ofthe partitioning approach: workload - usingQizx. . . . 94

5.34 Scalability ofthe partitioning approach - usingSaxon. . . 95

5.37 Scalability ofthe partitioning approach - usingQizx. . . 97

5.38 Relation between

maxSize

(pSize) and the performance of the par-titioning approach. . . 98

5.39 Scalability ofthe partitioning approach - usingQizx. . . 100

6.1 Partitioningupdatescenario. . . 102

6.2 Syntax ofSXUF. . . 104

6.3 An XML document

t

and apossiblepartition. . . 106

6.4 Equivalence between

U

8 (t)

and

U

8 (t

1 ) U

8 (t

2 )

. . . 107

6.5 Another possibleparts

t

0

1 , t

2

0

oftheXML document

t

. . . 107

U

8(t)

and

U

8(t

0

1 ) U8(t

2

0 )

. . . 107

U

9 (t)

and

U

9 (t

1 ) U

9 (t

2 )

. . . 108

6.8 An XML document

D

and twodierent kindsof partition. . . 108

U

10 (D )

and

U

10 (D

1 ) U

10 (D

2 )

.. . . 109

6.10 Non-equivalence between

U

10(D )

and

U

10(D

0

1 ) U10(D

2

0 )

. . . 109

6.11 Non-equivalent case between

U

11 (t)

and

U

11 (t

1 ) U

11 (t

2 )

. . . 109

U

11 (t)

and

U

11 (t

0

1 ) U

11 (t

2

0 )

. . . 110

U

12 (t)

and

U

12 (t

1 ) U

12 (t

2 )

. . . 110

6.14 Path extractionfunction for updates. . . 112

6.15 An XML document

t

and itsparts

t

1 , t

2 , t

3

. . . 116

6.16 Partitioning updatescenario on the input document

t

and its parts, for a given iterative update

U

. . . 119

6.17 Fusion scenario ondistinct (updatedand non-updated)parts. . . 120

(10)

6.20 Partitioning scenario: parsing the open-tags <d><c>. . . 128

6.21 Partitioning scenario: parsing the close-tags</c></d>. . . 128

6.22 Partitioning scenario: parsing the subtree <f><c>go. . . 129

6.23 Partitioning scenario: parsing close-tags</c></f>, andcreate anew part

t

2

. . . 130

6.24 Partitioning scenario: parsing open-tags<f><g>to . . . 131

6.25 Parsing close-tags </g></f>,and create anew part

t

3

. . . 131

6.26 Finalparts

t

1 , t

2 , t

3

produced bythepartitioningtechnique. . . 132

6.27 Updatedparts

U

(t

1 ), U (t

2 )

,non-updated

t

3

,andthefusionnalresult.135 6.28 Projection vs partitioning- withinputdocument 1GB - using Saxon. 138 6.29 Projection vs partitioning- withinputdocument 5GB - using Saxon. 139 6.30 Projection vs partitioning- withinputdocument 1GB - using Qizx.. 139

6.31 Projection vs partitioning- withinputdocument 5GB - using Qizx.. 140

6.32 Projection vs partitioning- withinputdocument 10GB - using Qizx. 140 6.33 Projection vs partitioning- withinputdocument 15GB - using Qizx. 141 6.34 Scalabilityof the partitioning+update+fusion approach - using Saxon.142 6.35 Scalabilityof the partitioning+update+fusion approach - using Qizx. 144 7.1 Executionoverview. . . 149

(11)

(12)

(13)

(14)

4.1 Sizeof projected documents.. . . 36

4.2 Qizxand Saxonperformances onprojectedDBLP document. . . 36

4.3 Saxonperformanceon projected documents. . . 38

4.4 Qizxperformanceon projected documents.. . . 39

4.5 Thecompositionof3-leveltypeprojectorfor20updatesusedin[Sah11]. 46 4.6 Sizereduction bytype projection. . . 49

4.7 Qizxand Saxonperformances for type-basedprojecteddocuments. . 50

5.1 Thefunction

]

. . . 67

5.2 Rewritingfunctions

Down(τ )

and

Res

(α; τ )

. . . 68

5.3 Globalprojectionssize.. . . 91

(15)

(16)

With the widespread diusion of XML as a format for representing data

gen-erated and exchanged over the Web, main query and update engines have been

designed and implementedinthe last decade. A kind of engines thatareplaying a

crucial role in many applications are main-memory systems, which distinguish for

thefact thatthey areeasy to manage and to integrate ina programming

environ-ment. On the other hand, main-memory systems have scalability issues, as they

loadthe entire document inmain-memorybeforeprocessing.

This Thesis presents an XML partitioning technique that allows main-memory

engines to process a class of XQuery expressions (queries and updates), that we

dub iterative, on arbitrarily large input documents. We provide a static analysis

technique to recognize these expressions. The static analysis is based on paths

extracted from the expression and does not need additional schema information.

Weprovidealgorithmsusingpathinformationforpartitioningtheinputdocuments,

so that the query or update can be separately evaluated on each part in order

to compute thenal result. These algorithms admit a streaming implementation,

whose eectiveness isexperimentally validated.

Besides enabling scalability,our approach isalso characterized bythefact that

it is easily implementable into a MapReduce framework, thus enabling parallel

query/update evaluationonthe partitioned data.

Keywords : XML,XQuery,XQueryupdates,Projection,DataPartitioning,

(17)

(18)

Résumé en Français

Contents

1.1 introduction générale . . . 3

1.2 contributions . . . 6

1.3 l'organisation du manuscrit . . . 8

1.1 introduction générale

L

a dernière décennie a vula diusion rapide des données semi-structurées eten

particulier le standard XML (eXtensible Markup Language) dans nombreux

applications qui s'appuient sur le web pour l'échange et le partage de données.

XMLestunsuccesseurdeSGML,ilaétérapidement adoptécommeformatnaturel

pour représenter les données semi-structurées pour lesquelles lemodèle relationnel

et lemodèle objetne sont pas appropriés. La grande exibilité des données XML

a rendu ce format universel eta permis son utilisation pour échanger desdonnées

entredesapplications diérentes surleWeb.

An de permettre la diusion de XML, plusieurs outils ont été déni pour

la transformation, l'interrogation, la manipulation et la modélisation des données

XML. En particulier, le World Wide Web Consortium (W3C) a introduit XQuery

[W3S10]commelangagede requêteetXQueryUpdate[Gro11a,Gro11b]pour

met-treàjour desdocumentsXML.Lesdeuxlangues ontétéintensivementétudiéespar

lacommunautéscientique,enparticulierdansunbutd'optimisationdel'exécution

desrequêtes etdesmises àjour.

UneprincipaleutilisationdeXQueryestl'interrogationetlamiseàjourdes

don-néesXML quisont simplement stockéesdansdeschiersou généréesenstreaming.

Engénéral,danscescontextes,toutescesfonctionnalitéscomplexesquicaractérisent

les DBMS traditionnels ne sont pas nécessaires. Le besoin principal dans ces

con-textes est la disponibilité d'un moteur de requête et mise à jour facile à installer

et à intégrer dans un environnement de programmation. Pour cette motivation,

de nombreux moteurs XQuery ont été mis au point pendant les dernières années,

commeGalax[gal],Saxon[sax],Qizx[qiz]eteXist[exi]. Cessystèmessont

(19)

centrale, puis traitées (interrogés ou mises à jour). Pour cette raison, ces systèmes

sont généralement classés commede systèmes mémoire-centrale.

En citant Cong etal. [GCL12], les systèmes mémoire-centrale sont le meilleur

choix dans

... plusieurs domaines comme les sciences de la vie (par exemple,

Bi-ologie), l'astronomie,etmême pour lagestiondesdocumentsXML

typ-iques correspondant aux chiers Microsoft Oce (étant donné que les

présentations PowerPoint, les chiers Word et Excel sont actuellement

stockées auformat XML).Danstouscesdomaines, lagestiondes

docu-ments XMLest centréesur deschiers etaucun systèmede gestiondes

données XMLtraditionnels n'est misen place.

Enparticulierdanslesdomainestelsquelessciencesdelavieetdel'astronomie,

les documentsXMLont unetailleimportante (plusieursGBs),ce quipeut

compro-mettrelapossibilitéd'utiliserunmoteurdemémoire-centrale pourletraitementdes

requêtes.

Actuellement, les systèmes mémoire-centrale qui sont très exibles et faciles à

installer etàutiliser, ne peuvent paspasserà l'échelle.

Unesolutionpartiellepourceproblèmeestproposée. Cettesolutionestbaséesur

laprojection. LaprojectionXMLestunetechnique d'optimisationproposéedansle

but de surmonterles limitationsdesmoteurs mémoire-centrale pour l'interrogation

desdocumentsXML.Cettetechniquereposesuruneobservationsimpleselon

laque-lle les requêtes sont en général sélectives cad qu'elles ciblent seulement une

sous-partiedesdocumentsinterrogés. L'idéeconsistealorsàidentierdemanièrestatique

lespartiesnécessairesàl'évaluationdesrequêtesetàutilisercette informationpour

ne charger en mémoire centrale que les parties du document quisont accédéespar

la requête. La projection permet ainside traiter desdocuments volumineux même

sous descontraintes de mémoire importantes.

La projection a été utilisée pour la première fois dans [MS03] puis étendue

dans [BCCN06, KSS08] en prenant en compte le schéma du document interrogé.

L'utilisationdesschémaspermetderéduirelatailledelaprojection enexploitantla

possibilité d'inférer de manière précise les données nécessaires à l'évaluation d'une

requête. Dans les techniques de [BCCN06, KSS08], l'information inférée consiste

en l'ensembledesétiquettes desnoeudsnécessaires àl'évaluation desrequêtes. Cet

ensemble estappelétype-projecteur.

Lesapproches précédentes et basées sur laprojection ne fournissent qu'une

so-lution partielle aux problèmes de scalabilité des systèmes mémoire-centrale, et les

documentsd'entrées projetées pourraient encoredépasserlacapacitédelamémoire

centrale. Cela peut être le cas lorsque (i) le chier d'entrée est énorme, (ii) la

sélectivité de la requête est faible (elle a besoin d'une grande partie du document

(20)

re-taille de la projection globale peut dépasser la taille de la mémoire centrale. La

projection globale peut être inutile puisque tout le document en entrée peut être

nécessairepour leworkload.

Ilestimportantdedirequelesproblèmes descalabilitédépendent égalementdu

type particulier de moteur qu'on veututiliser, etsurles paramètres de lamémoire

interne. Enfait,laplupartdessystèmesmémoire-centralesontimplémentésenJava,

etleurscalabilitédépenddelaquantitédemémoirecentralepréciséeenparamètrede

laJVM(JavaVirtualMachine). Danstouslescas,mêmepourlesgrandesquantités,

les problèmes de scalabilité de la projection standard sont toujours optimisés, la

taille de la projection de documents augmente lorsque la taille du document en

entréeaugmente.

L'objectif principal de cette Thèse est de proposer une technique qui assure la

scalabilité pour les requêtes etlesmise àjours indépendamment:

dutype du systèmemémoire-principal.

de laquantité demémoire centrale quiestvalable.

de l'utilisationdu schéma d'informations de schéma.

Àcetten,danscetteThèse,nousproposonsunetechniqued'optimisationbasée

sur lepartitionnement des données XML. Cettetechnique repose surl'observation

que, dansplusieurs caspratiques,les requêtesXQueryet les misesà jour

sélection-nent d'abordune séquencede sous-arbresàl'aide d'unesous-requête (parexemple,

une expression XPath), puis évaluent des opérations sur cette séquence des

sous-arbres. Par exemple, en ce quiconcerne les requêtes, 13des 20requêtes deXMark

Benchmark [SWK

+

02b]vérient cettepropriété etpour lesmises à jour,16 des20

mises àjour qui ont étéproposées dans[BBC

+

11,Sah11]sont itératives.

Dans le cas de requêtes, lorsque cette propriété est satisfaite par une requête

Q

, le document d'entrée peut être divisé en un ensemble de parties

{D

1 , . . . , D

κ

}

,

de sorte que l'évaluation

Q(D )

de la requête

Q

sur le document d'entrée

D

est

égale à la concaténation des évaluations

Q(D

i

)

de la requête

Q

sur les parties

D

i

du documentd'entrée

D

.

Danslecasdesmisesàjour,lamêmestratégie peutêtreadoptée, àladiérence

quelesmisesàjourpartielles

U

(D

i

)

doivent êtrerecombinées pourobtenirle

docu-ment misàjour

U

(D )

. Alorsquedanslecasderequêtes,unesimpleconcaténation

des résultatspartiels est susant. En particulier, nous utilisons lacommande cat

pourfusionnercesrésultatspartiels andeproduirelerésultatnal. Pourlesmises

àjour,etpuisquenousutilisonsdesinformationssupplémentaireslorsdelacréation

despartitionsandes'assurerquelespartiescrééessontbienformées,des

informa-tions supplémentaires par rapport desbalises supplémentaires sont nécessaires an

de correctement re-combiner des parties mises à jour et éliminer ces balises pour

(21)

Aveclascalabilité,notretechniquedepartitionnementpeutêtrefacilement

adap-téedansunenvironnentMapReduce[DG08],cequipermetl'interrogation etlamise

à jour parallèle des parties. Cette évaluation parallèle est possible puisque dansle

cas des requêtes et des mises à jour itératives, l'évaluation de chaque partie peut

se faire indépendamment de l'évaluation des autres parties. Par conséquent, cette

approche peutaisémenttransposéedansunenvironnement MapReducequijoueun

rôle très important danslesplates-formesbasée surlecloud.

1.2 contributions

CetteThèseproposeunenouvelletechniquedepartitionnementbasésurl'évaluation

de requêtes XQueryetles misesà jour.

La première contribution de cetteThèse seconcerne les requêtes. Dans ce

con-texte, les contributions principales sont les suivantes et sont également présentés

dans[Nic12]:

Nous présentons d'abord une caractérisationformelle de laclasse derequêtes

qui satisfont la propriété de division décrite ci-dessus: nous appelons ces

re-quêtes requêtes itératives. En s'appuyant sur cette caractérisation formelle,

nousdévelopponsunetechnique d'analysestatiquequi extraitdes cheminset

des informations sur les variables liées à la requête, et puis les analyse an

de détecter statiquement comment le document d'entrée est navigué par la

requête. En se fondant sur les informations de chemin nous pouvons éviter

l'utilisation d'informations de schéma quin'est pastoujours disponible.

Nous présentons ensuite un algorithme de partitionnement qui exploite les

chemins extraites lors de l'analyse statique pour identier la partition

cor-recte pour le document d'entrée. Nous présentons d'abord une spécication

d'algorithme basée sur la représentation DOM puis nousutilisons le parseur

SAX quipermetlapossibilitéd'eectuer lepartitionnement en streaming, en

utilisant peu de mémoire. Pour améliorer encore les avantages de notre

ap-proche, nous combinons le partitionnement avec la projection standard, de

sorte que lors de la création de parties de document, les sous-arbres qui ne

sont pasnécessaires par la requête sont éliminées. L'utilisation de la

projec-tion standard n'est pas cruciale pour assurer la scalabilité, ce qui est notre

objectif principal puisque dans notre approche, lataille maximale de chaque

partie peut réglée par l'utilisateur. La projection contribue à réduirele coût

du partitionnement, carelle accélèrel'exécution desrequêtes surlapartition.

Ensuite, nousprésentons uneévaluationexpérimentale intensive quiconrme

que,lorsquedel'utilisationdenotreapprochedepartitionnement,desmoteurs

mémoire centrale peuvent traiter des documents de taille arbitraire, au prix

(22)

partitionnement permet la scalabilité pour les workloads, car dans ce cas le

document enentrée estdiviséunefoispourtoutes lesrequêtes(ou lesmisesa

jour)du workload.

La deuxième contribution de cette Thèsese concerne les mises à jour. Dans ce

contexte, lescontributions principales sont lessuivantes:

Nous analysons d'abord les cas où l'évaluation des mises à jour peut être

correctement appliquée sur les partitions, puis nous fournissons une

anal-yse statique pour caractériser ces mises à jour, que nous appelons mises à

jouritératives. Cettecaractérisationexigedesrestrictionssurlesmécanismes

d'interrogation quisontutilisésdanslesexpressionssource ettarget desmises

à jour. Nousallonsmontrer queces restrictionssont acceptables puisqueune

large classede misesà jour peutêtre traitéeavec notre approche.

Etpuis,nousprésentonsunetechniquedepartitionnementquisedistinguede

latechnique desrequêtes par les aspectssuivants:

Premieraspect: laprojectionn'estpasutilisée,and'avoirunerecombinaison

simple et ecace des mises à jour partielles. Ceci est également justié par

lefait quele partitionnement estdéjà susant pour générer susamment de

petites pièces (parties du document d'entrée). L'utilisation de la projection

exige un processus sophistiqué de la recombinaison (puisque les sous-arbres

élaguésaucoursdepartitionnementdoiventêtrereconnus)etderemettredans

le résultat nal du processus. Ce type d'opération a été fait par [BBC

+

11],

où l'utilisation des informations de schéma a été cruciale pour assurer une

formalisationclaire etecace.

Deuxième aspect: les chemins utilisés au cours de partitionnement sont

dé-duite en le mettant en compte la nature particulière de mises à jour. Ces

chemins sont utilisés pour assurer que les sous-arbres qui éventuellement été

sélectionnées par les chemins Target ne sont jamais divisés pendant le

par-titionnement. L'atomicité de ces sous-arbres est nécessaire pour assurer que

l'évaluation de la mise à jour peut être correctement répartir sur toutes les

partiesd'entrée.

Ensuite, nous présentons les résultats des tests étendus montrant l'ecacité

de notre technique. A la diérence du cas des requêtes, la sur-coût du au

partitionnement n'est pas négligeable. Toutefois, les résultats de ces tests

montrent que notreobjectif principal,lascalabilité est largement réalisée.

Concernant les résultats des tests, nous avons utilisé deux moteurs

mémoire-centrale principaux, Saxon [sax] et Qizx [qiz]. Notre choix est motivé par le fait

que Saxon est un système très populaire, qui se distingue pour son exhaustivité

(23)

est spécialisée danslarequête XQuery etlamise àjour, etsoutient des techniques

sophistiquées pour optimiserle tempsd'exécution etlaconsommationde mémoire.

La troisième contribution de cette Thèse montre est le fait que la technique

proposée est été facilement adapté pour être exécuté dans un cadre MapReduce

[DG08]. À cette n, les notions principales de ce paradigme sont introduites puis

l'architecturedelamiseenoeuvredenotretechniquesurMapReduceestétéillustrée

etdiscutée.

1.3 l'organisation du manuscrit

Cemanuscritestcomposédehuitchapitresdontunchapitre derésuméenfrançais,

etun autrechapitre introduction.

Lessixautres chapitressont organisés comme suit:

Chapitre 3Le chapitre préliminaire estconsacréà laprésentation des

nota-tions et des langages (XPath et XQuery [Gro03, W3S10]) de requêtes et de

mises à jour (XQuery update Facility [Gro11a]) utilisés tout au long de ce

manuscrit.

Chapitre 4Danscechapitre,nousexaminonslesprincipalescaractéristiques

desdeuxapprochesprincipalesproposéespourlaprojectionXML.Lapremière

approche[MS03]concernelesrequêtes,etestbasésurl'extractiondeschemins

de la requête et l'utilisation de ces chemins pour projeter le document en

entrée. Ladeuxièmeapprochepourlesrequêtesaétéproposédans[BCCN06],

etexigedesinformationssurleschémadesdonnées. Nousneparleronspaspar

rapportacetteapprochecarcettethèsen'utilisepasleschémadesdonnées,et,

pourlefragmentXQueryquenousconsidérons,lesperformancesde[BCCN06]

sont trèsprochesàcelleproposéedans[MS03]entermesde laréductiondela

taille desdocuments.

La deuxième technique que nous allons discuter concernant des mises à jour

[BBC

+

11, BCMS09a, BCMS09b],qui et est la seule technique de projection

existant pour les mises à jour. Elle est basé sur les informations de schéma

et sur l'inférence des types, plus une opération Merge qui, comme nous le

verrons, estnécessairepourrecombiner lamiseàjour delaprojection avec le

document original.

Dans ce chapitre, en plus d'illustrercomment laprojection peutêtre utilisée

pour traiterune large classede requêtes etmises àjour XML pour des

docu-ments de grandetaille, nousallonsmontrer que cestechniques, même sielles

(24)

Le chapitre est organisé comme suit. La section 4.1 introduit la projection

standard XML qui est proposée par [MS03]avec quelques dénitions

princi-pales,l'algorithmeanalyseduchemin quiextraitl'ensembledescheminsdela

projection àpartird'unerequêteXQueryarbitraire. Ensuite,nousexpliquons

l'algorithme de chargement dans la mémoire utilisé pour créer la projection.

Lasection 4.1.1 illustreleslimitations delatechnique de projection standard

XML en testant plusieurs requêtes sur des documents XMark et de base de

donnéesDBLP.Danslasection4.2,nousintroduisons,àtraversdesexemples,

le concept de la technique de projection basee sur le typage et proposé par

[BBC

+

11]. Et puis, dans la section 4.2.1, nous illustrons les limitations de

cette technique dans la utilisant des mises à jour. Enn, nous concluons ce

chapitre danslasection4.3.

Chapitre 5 Danscechapitre, nousavonsprésentéunenouvelle technique de

laprojectiondepartitionnementde documentd'entréeXML.Cettetechnique

segénéralisedesapprochesexistantes etbasées sur lechemin, ets'appliqueà

une largeclasse de requêtes.

L'approcheproposéeanalyseunerequêted'entréeet,silarequêteestitérative,

l'approchevaextrairetouslescheminspertinentsetlesutilisepourexécuterla

projection etlepartitionnement surledocumentd'entrée,et puisobtenir des

petitesparties. Notreétudeexpérimentaleassurequel'exécutiondelarequête

d'entréesurchaquepartieindépendammentetencombinantlesrésultats

par-tielsobtenusparcesparties,n'importequelmoteurmémoire-centraleexistant

peuttraiter une requêteitérativesur destrèsgrand documentsd'entrée.

Ce chapitre contient trois parties principales. La première partie (les

sec-tions 5.1, 5.2, 5.3) présente notre technique d'analyse statique utilisée pour

caractériser des requêtes itératives, pour lesquels les données XML peuvent

êtrepartitionnéspourl'évaluationde larequête. Ladeuxième partie (Section

5.5)présentenotrealgorithmedepartitionnement. D'abord,unespécication

précise est formalisée en s'appuyant sur une représentation basée sur DOM

formalisation pour des arbres d'entrée. Et puis une version basée sur SAX

estfournie. Commeindiqué dansl'introduction,pouraccentuer lesavantages

de notre stratégie, la projection est utilisée pendant le partitionnement. La

troisième partie (les sections 5.6, 5.7) explique la mise en oeuvre des

algo-rithmes basés sur SAX parseur, et présente les résultats des tests obtenus à

partir d'expériences quenousavonsmenées enutilisant deux moteurs

princi-pauxpour XQuery. Enn, nousconcluonsce chapitre danslasection5.8.

Chapitre 6 Dans ce chapitre, nous présentons une technique de

partition-nement pour les mises à jour XUF (XQueryUpdate Facility). Comme le cas

des requêtes, le partitionnement permettant le traitement des grands

(25)

mémoire-projection standard baséesurlatechnique proposée dans[BBC

+

11].

Danscechapitre,nouscaractérisonsuneclassedesmisesàjour,appeléesmises

à jour itératives, pour lesquelles une évaluationbasée sur lepartitionnement

estpossible: toutd'abord,lesdocumentssontpartitionnésenplusieursparties

puis les parties sont mises à jour indépendamment, etenn les parties mises

à jour sont fusionnées en utilisant une opération de fusion an d'obtenir le

résultat nalcad ledocument en entréemis à jour.

Pour caractériser desmisesà jour itératives, nousutilisons uneanalyse basée

sur des chemins. Les chemins extraits seront également utilisés pour le

par-titionnement. A la diérence des requêtes, le partitionnement ne s'appuiera

passur laprojection, lescheminssont utilisés pour s'assureruniquement que

chaquepartiecontienttoutcequiestnécessairepourchaqueopérationdemise

à jour. La projection n'est pas utilisée, an d'éviter les opérations de fusion

complexes sur des parties mises à jour, opération nécessaires pour récupérer

les sous-arbres élagués lors de la construction du document global actualisé.

L'ecacité de l'approche proposée est démontrée par des expériences

appro-fondiescomparantnotreapproche baséesurlepartitionnement avecla

projec-tionproposédans[BBC

+

11,MS03]. Ilestimportantdedirequecettedernière

approchebaséesurletypedesdonnéesestlaseuleapprochedeprojectionpour

traiter les misesà jourXQuery.

Le chapitre est structuré commesuit. Dans lasection 6.2, nousintroduisons

quelquesnotations préliminairessurle langagedesmisesà jour utiliséesdans

cetteapproche,etpuisnousprésentonsnotrefonctiond'extractiondechemins.

Danslssection6.3,nousdécrivonsformellementlesmisesàjouritératives.

En-suite,danslasection6.4,nousprésentonsnotretechniquede partitionnement

pour les mises à jour itératives, et introduisons les dénitions formelles et

les spécications basés sur DOM du partitionnement et de la fusion. Dans

la section 6.5, nous fournissons les algorithmes (basés sur le streaming) de

partitionnement et de fusion utilisés pour exécuter notre scénario de

parti-tionnementpourlesmisesàjour. Lechapitresetermineavec lesrésultatsdes

testsdanslasection6.6etquelquesconclusionsprésentéesdanslasection6.7.

Chapitre 7 Avec lascalabilité, notretechnique departitionnement présentée

dansles chapitres précédents possède unautre avantagecelui depouvoir

exé-cuter les requêtes et les mises à jour en parallèle. Ceci est possible puisque

une large classedesrequêtes etdesmises à jour sont itérativeset permettent

l'évaluation decelles cisurchaquepartie indépendamment de l'autre.

Dans ce chapitre, nous présentons les idées essentielles d'une mise en oeuvre

parallèle possible de notre technique de partitionnement à l'aide du

(26)

Sartiani (professeur adjoint à l'UniversitàBasilicate della, Italie) etMaurizio

Nole(étudiant du Master àl'UniversitàBasilicate della, Italie).

Nous présentons d'abord les bases du paradigme MapReduce dansla section

7.1, puis nous montrons comment notre technique peut être mise en oeuvre

dans une plate-forme de MapReduce dans la section 7.2. Enn, nous tirons

notreconclusion danslasection 7.3.

Chapitre8Conclusionetperspectives: Danscechapitre,nousavonsprésenté

une nouvelle technique de partitionnement pour de document XML. Cette

technique généralise les approches existantes et basées sur le chemin, et

s'appliqueà une largeclasse de requêtesetmises àjour.

Une desparticularités denotreapprocheest qu'ellen'utilise pasleschéma. Il

utilise les informations de chemin provenant de larequête / mise à jour an

d'eectuerl'analysestatiquenécessairepourreconnaître lanatureitérativede

la requête / mise à jour et utilise les informations de chemin pour eectuer

lepartitionnement. Uneautre particularitéde cetteapproche estqu'ellepeut

s'appuyersur n'importe quel systèmemémoire-centrale, caraucune

interven-tiondanslemécanismeinternedusystèmen'est nécessaire. Enn, nousavons

vu que notre approche peut être mise en oeuvre dans une plate-forme

par-allèle comme MapReducede manière aisée permettant ainsi à l'interrogation

etla miseà jour en parallèle. Pour les ensembles de documents de taille

im-portante, et pour de grands cluster de machines, cette utilisation permet de

réduire considérablement le temps comparé à une exécution sequentielle des

requêtes/misesà jour.

Il existeplusieurs perspectives. Tout d'abord, nousprévoyonsd'étendrecette

approche aux autres fragments de XQuery en particulier à desrequêtes

con-tenant des opérateurs d'agrégation (telles que le group-by). En plus, nous

prévoyonsd'étendre cette technique danslecas oùles requêtes eectuent des

jointures. Dans ce cas, destestseectuésont révéléquele temps d'exécution

peutêtre important enutilisant dessystèmesmémoire-centrale. Pour

perme-ttre lepartitionnement de larequête / miseà jour on doit redénirl'analyse

statiquepourtenircomptedesconditionsdejointureetprobablementrecourir

à laréécriture desrequêtes /mises à jour. À notreavis, dansce scénarioune

approche MapReducepourrait aiderà réduireletemps d'exécution.

Commedeuxième perspective,nousaimerionsexplorerles possibilitésde

ma-nipulation des workloads constitués de requêtes etde mises à jour. Une fois

l'analysedechemineectuéepourcaractériserlanatureitérativeduworkload,

le partitionnement peut êtreeectué pour l'ensembledes requêtes etmises à

jour composant ce workload.

(27)

En particulier, nous allons nous concentrer sur notre implémentation, pour

adapter notre code dans la plate-formeMapReduce. Dans ce contexte, nous

allons également nous concentrer sur les tests expérimentaux an de dénir

pour quel type de requête / mise à jour l'exécution de MapReduce est plus

(28)

Introduction

Contents

2.1 contributions . . . 15

2.2 structure of the thesis . . . 17

T

helastdecadehasseentherapiddiusionoftheeXtensibleMarkupLanguage in

manyapplicationelds. XMLisasuccessorofSGML,andwasrapidlyadopted

asa natural format for representing semi-structured data, whose structure can not

beeasilymodeledaccordingtostandardrelationalandobject-orienteddatamodels.

Thegreat exibilitywhich isbehindtheXML datamodelmadeitauniversaldata

representation format, and allowed the use of XML as a convenient medium for

exchangingdatabetween dierent Webapplications.

TosupportthediusionofXML,severaltoolsfortransforming,querying,

manip-ulating, andmodeling XMLdatahave been dened. Inparticular, theWorld Wide

Web Consortium (W3C) introduced XQuery [W3S10] as the standard query

lan-guageforXMLdata,and,morerecently,XQueryUpdateFacility[Gro11a,Gro11b]

as an extension of XQuery to update XML documents. Since their introduction,

bothlanguageshavebeen intensively studied bytheresearchcommunity,in

partic-ularindirections aimingat optimizingquery andupdateexecution.

Oneofthe mainuseofXQuery,istoqueryandupdateXMLdatathatare

sim-ply stored inles or generated bya stream. Generally, in these contexts all those

complexfunctionalitiescharacterizingtraditionalDBMSsarenotneeded. Themain

need inthese context is the availability of a query/updateengine which is easy to

installandtointegrateinaprogrammingenvironment. Withsuchmotivationmany

light-weight XQuery processors have been devised in recent years, like Galax [gal],

Saxon [sax], Qizx [qiz], and eXist[exi]. Thesesystems usually provide full

compli-ancewithrespecttotheW3Cspecications,andprocessdatainmainmemory

fash-ion: dataarerst entirely loaded inthemain-memoryandthenprocessed(queried

or updated). For this reason, these systems are usually classied as main-memory

systems.

By quotingCong andal. [GCL12],main-memorysystemsarethebestchoicein

(29)

Oceles(sincepowerpointpresentations,Wordles,andExcel

spread-sheets areallcurrently stored asXML). Inall these domains, the

man-agement of XML documents is le-system centric and no traditional

XML data management systems is yetin place (since non-expertusers

oftennd these lattersystemsto behard to useand maintain).

Especially in domains like Life science and Astronomy, XML documents are

likely to be huge (several GBs), which can jeopardize the possibility of using a

main-memory engine for queryprocessing. In other words, main-memorysystems,

while very exible and easy to set-up and use, cannot scale up with document

size. A partial solution to this problem is oered by projection-based techniques

[BCCN06,KSS08,MS03]thatallowoneto pruneout,at loading time,parts ofthe

datathatarenotnecessaryforthequeryortheworkloadbeingprocessed. Forsome

of the existing projection techniques, schema information in theform of DTDs or

XML Schemadenition isneeded[BCCN06,KSS08].

Projection-based approaches provide only a partial solution to the scalability

issues ofmain-memory systems,asthe projected inputdocumentsmay stillexceed

themain-memorycapacity. Thismaybethecasewhen(i)theinputleishuge,(ii)

thequeryselectivityislowanditneedsalarge partoftheinput,or (iii)aworkload

(i.e.,asetofqueries)hastobeevaluatedonthedocument. Inthelastcase,asingle

global projection meeting thequeryneedsofthewholeworkload islikely toexceed

themain-memorysize,whilerunningaqueryatatime,andprojecting(andloading)

data for each run would result ina quiteinecient and still failure-prone process.

Thisdue to thattheglobal projection normallywill be huge, andintheworst case

it will be contained the whole inputdocument for satisfyall queries composedthe

workload. Therefore, the standard projection still failure in case of processing a

query workload.

It is worth observing that scalability issues also depend on the particular kind

of engine one wants to use, and on internal memory settings. In fact, most of

main-memorysystemareimplementedinJava,andtheir scalabilitydependsonthe

amount of main-memory given to the Java Virtual Machine. In anycase, even for

large amounts, scalability problems of standard projection still persist, as thesize

of document projection increases asthesize of theinputdocument increase.

Themain objective ofthisThesis isto oeratechniquethatensuresscalability

for both queriesand updatesindependently of:

thekind ofmain-memory system.

theamount ofavailable main-memory.

thepresenceof schema information.

(30)

cases,XQueryqueries andupdates rstselectasequence ofsubtreesbymeansofa

subquery (e.g,an XPath expression),and then iterate operations on this sequence

of subtrees. For instance, concerning queries, 13 out of 20 queries of the XMark

benchmark meet this property, while concerning updates, 16 out of 20 updates in

thebenchmark adoptedin[BBC

+

11,Sah11]areiterative.

In the case of queries, when this property is satised by a query

Q

, the input

documentcanbesplitintoacollectionofparts

{D

1 , . . . , D

κ

}

,sothattheevaluation

Q(D )

ofthequery

Q

overthedocument

D

turnsouttobeequaltotheconcatenation

of theevaluations

Q(D

i

)

ofthe query

Q

overthe document parts

D

i

.

For updates,thesame strategy can be adopted,with thedierencethatpartial

updates

U

(D

i

)

haveto berecombinedsothattheupdateddocument

U

(D )

can be

obtained. While in the case of queries a simple concatenation of partial result is

sucient. In particular we use the command cat to combine these partial results

in order to produce the nal one. For updates, and since we use additional tags

during the creation of the partitions in order to hold the well-formedness of the

created parts, auxiliary information about these additional tagsis needed in order

to correctly re-combine updatedparts and eliminate these tags to obtain the nal

update result

U

(D )

. This auxiliary information is opportunely built up during

partitioning.

Besides scalability, our partitioning technique can be easily adapted to be

adoptedinaMapReduce[DG08]framework,enablingparallel queryingorupdating

of parts composing a partition. This is due to the fact that iterative queries and

updatesenjoythepropertythatevaluationoneach partdoesnot need information

coming from evaluation on another part. The possibility of an easy transposition

ina MapReduce framework plays an important role nowadays, given thecurrently

rapidand large diusionof cloud-based platformbased onthis paradigm.

2.1 contributions

ThisThesisproposesanoveltechniquefor partitioning-basedevaluationofXQuery

queries andupdates.

The rst contribution of this Thesis focuses on queries. In this context, main

contributions arethe following ones, andarealso reported in[Nic12]:

Werstpresent aformalcharacterizationoftheclassofqueriesthatenjoythe

above described splittingproperty: we dub these queries as iterative queries.

By relying on this formal characterization, we develop a static analysis

tech-niquethatrstextractspathsandinformationaboutboundvariablesfromthe

query,andthenanalysesthem inorderto staticallydetect howthedocument

(31)

We then present a partitioning algorithm that exploits the paths extracted

during the static analysis to identify the correct partitioning for the input

document. We rst present DOM-based specication of the algorithm, and

then a SAX basedon enabling the possibility ofperforming partitioningin a

streaming fashion, withavery limitedmemory footprint. To furtherimprove

the benets of our approach, we combine partitioning with standard

projec-tion, sothat during thecreation of document parts,sub-trees not needed by

thequery areprunedout. Theuseof projection is not crucialto ensure

scal-ability,whichis ourmain purpose,since our approach issothatthemaximal

size of each part can be tuned bythe user. Projection helps in reducing the

overheadof partitioning, since itspeedsup queryexecution on thepartition.

Then, we present extensive experimental evaluation that corroborates that,

whenusingourpartitioningapproach,main-memoryenginescan process

doc-uments of arbitrary size, at the price of a modest overhead with respect to

schema-less projection techniques; our experiments also show that

partition-ing allows for ascalable management of workloads, astheinput document is

partitioned oncefor all.

Thesecond contribution of thisThesis concerns updates. Inthis context, main

contributions are the following ones:

We rst analyze cases in which update evaluation can be correctly done on

partitions, and then provide a static analysis to characterize such updates,

which we call iterative updates. This characterization requiresrestrictions on

the querying mechanisms that can be used in source and target expressions

of updates. We will showthat these restrictionsaremild, inthesense that a

wide classof updates can be dealt withour approach.

We then present a partitioning technique which distinguishes from that of

queries for thefollowing twoaspects.

First, projection is not used, in order to have a simple and ecient

re-combination process ofpartial updates. Thisisalso justiedbythefactthat

partitioning is already sucient to generate small enough parts. The use of

projectionwouldrequire asophisticatere-combination process,since subtrees

prunedout duringpartitioningshouldberecognizedand reportedinthenal

result oftheprocess. Thiskind ofoperation hasbeen done [BBC

+

11],where

theuseof schemainformation wascrucialto ensure aclear formalization and

eciency.

Second, paths used during partitioning are inferred by keeping into account

theparticularnatureofupdates. Thesepathsareusedinordertoensurethat

subtreeseventuallyselectedbytargetpathsareneversplitduringpartitioning.

(32)

Then, we present extensive test results showing theeectiveness of out

tech-nique. Dierently from thecase of queries, the overhead due to partitioning

is not negligible. Howevertest results show that our main goal, scalability is

largely attained.

Concerningtestresults,weusedtwomain-memoryengines,Saxon[sax]andQizx

[qiz]. Our choice is motivated as follows. Saxon is a very popular system, which

distinguishesfor its exhaustiveness incovering most W3Cstandards for XML

pro-cessing(e.g., XML Schema, XSLT, XQueryqueriesand updates). Dierently,Qizx

is specialized inXQuery query and update, and supports sophisticated techniques

to optimize both execution time and memory consumption.

As athird contribution, this Thesis shows thattheproposed framework can be

easily adaptedinorder to be run inaMapReduce framework [DG08]. To this end,

main notions behindthis paradigm are introduced rst, and thenthe architecture

of theMapReduceimplementation of our frameworkis illustrated anddiscussed.

2.2 structure of the thesis

TheThesis is organizedasfollows:

Chapter 2 Introduces XMLand XQuery UpdateFacilityand providessome

basic notions anddenitions.

Chapter 3 Presentsstandard projection techniquesandshows limitationsof

these onesintermsof scalability.

Chapter 4 Presentsour partitioning technique for XQueryqueries, together

withexperimental results.

Chapter5Presentsourpartitioningtechnique forXQueryupdates,together

withexperimental results.

Chapter 6 Illustrates how our partitioning techniques can ensure parallel

queryand updateevaluation bymeans of theMapReduceparadigm.

Chapter 7 Discusses related works, conclusive remarks and directions for

(33)

(34)

Preliminaries Contents 3.1 XML . . . 19 3.1.1 TextualRepresentation . . . 20 3.1.2 Well-FormednessofXML . . . 20 3.2 Querying XML . . . 22 3.2.1 XPathLanguage . . . 22 3.2.2 XQueryLanguage . . . 26

3.2.3 XQueryUpdateFacility . . . 27

T

his chapter has two essential sections. In the rst one, we present some basic

notions about XML data and its characteristics. In the second section, we

rst introduce theXML query languages: XPath and XQuery, and then introduce

the update extensions provided byXQuery UpdateFacility language. All of these

languagesareW3Cstandards [Gro03,Gro11a,W3S10].

3.1 XML

XML (eXtensible Markup Language) is among the most popular data formats for

representingdatageneratedandexchangedbyWebapplication. Inparticular,XML

is widely adopted to describe dierent kinds of data such as HTML (HyperText

Markup Language) data, relational and object database, multimedia les (audio,

video), andsoon.

XML actually is a simplied form of SGML (Standard Generalized Markup

Language), and it is a W3C standard 1998 [BPMM08]. The syntax of XML data

isvery similarto thatofHTML. However, there aresomedeep dierencesbetween

both ofthem. Themost important one isthat HTMLhaspredened element tags

and attributes whose behavior iswell specied,while XML doesnot. For instance,

inXMLthe usercanadopta <name>tag, whileinHTMLtheuserisobliged to use

predened tagssuch as<body>,<head>, <title>,, etc.

The possibility of using non-predened tags makes XML data self-describing.

This,togetherwiththepossibilityoffreeelementnestingandmixedcontents,make

(35)

3.1.1 Textual Representation

According to theW3C, the basic component of an XML document is theelement,

which consists of a piece of text enclosed by an open-tag and its corresponding

close-tag. Thecontent ofeachXMLelementcanbesimpletextvalue,asequenceof

elements,oramixedsequencewhichincludesthetwopreviousforms(textvaluesand

elements). Figure3.1represents asimple fragment ofan XML document. It shows

that elements aredenoted bymarkup tags. For example, theopen-tag <name> and

the close-tag </name> represent an XML element, and the text value Jean Scott

included between bothofthemreferstothecontentofthisXMLelement. Elements

withemptycontentarecalledemptyelements, andhaveanabbreviatednotation,as

indicatedbytheemptyelement <email/>. Theelement <note>contains acomplex

sequence which includes elements such as <telephone> and text values. Elements

can be annotatedwithattributes thatcontain metadata about theelementand its

contents. For example,theelement <person> hasasingle attribute namedgender

witha simple value M.

<name> Jean Scott </name>

<note> The personal phone of Jean is :

</note>

</person>

Figure3.1: Textualrepresentation ofan XML fragment.

3.1.2 Well-Formedness of XML

According to the W3C, an XML document is considered as well-formed if the

fol-lowing constraintsaremet. Wesummarize belowthemain ones.

An XML document mustbe containat leastone element.

Only oneelement mustbe containthe wholeXMLdocument; this element is

calledthe rootelement.

All element tags must be nested properly, and there is no overlap between

them.

TagsinXMLarecasesensitive. Thismeansthat<Name>,<NAME>and<name>

arenot thesame.

(36)

<name>Jean Scott</lastName>

Theopen-tag andclose-tag do not match.

Theelement tags arenot nested properly.

Dueto casesensitivity,open-tag and close-tag donot match.

Theattribute valuemissesquotes.

Asalready said,inthis Thesis wefocusona schema-lessapproach,inthesense

that we do not rely on schema information. However we briey introduce DTD

(Document Type Denition) which is a widely used schema language. This

intro-duction will help in understanding related works on updates [BBC

+

11] that make

useof schemas inthe formof DTD.

Inanutshell, ADTDschemaconsistsofasetofdeclarationsusedfordescribing

thestructureofelementsandattributes. Thecontentofeachelementisdescribedby

meansofregularexpressions. elements,attributesandanotherconstructorsareused

to describe the formal structure of the content for a well-formed XML document.

To this end,regularexpressions areused.

DTD declarationshavethe following form:

<!ELEMENT element-name (element-content)>

whereelement-namerepresentsthenameofelementtaginanXMLdocument(such

asperson,name, email, etc.) while element-contentiseitheranemptycontent or

a regular expression over tags and text-symbols representing thestructure formof

theelement-content.

EachDTDstartswiththedeclarationoftherootelement,andthenitcontinues

withspecicationof otherelements. ADTDforour

(addressBook.xml)

document

isdescribedinFigure3.2. Inparticular,thedeclarationsaysthatitscontent hasto

beasequenceofzeroormoreofelementstaggedasperson. TheDTDalsospecies

thatcontentofeachelementpersonconsistsoftwoelementsnameandage,followed

by two optional telephone and email elements, and nally an essential note

ele-ment. Thevalue #PCDATA isused to declare thetext-content of each element node

in the document

(addressBook.xml)

. This text-content consists of a sequence of

characters(stringvalues)withoutinterleaved XMLelement nodes. Thedeclaration

for the person attribute says that two possible values are admitted, and thatM

isthedefault one.

In many contexts, it is convenient to have a tree representation of an XML

(37)

rep-<!DOCTYPE addressbook[

<!ELEMENT addressbook (person *)>

<!ELEMENT person (name, age, telephone?, email?, note)>

<!ATTLIST person gender (M|F) "M">

<!ELEMENT name (#PCDATA | (firstname, lastname))>

<!ELEMENT firstname (#PCDATA)>

<!ELEMENT lastname (#PCDATA)>

<!ELEMENT age (#PCDATA)>

<!ELEMENT telephone (#PCDATA)>

<!ELEMENT email (#PCDATA)>

<!ELEMENT note (#PCDATA | email | telephone)*>

]>

Figure 3.2: DTDof addressBook.xml XML document.

the root element, childrenof this elements correspond to sub-elements and textual

nodes, and so on. A tree representation of our addressBook element is given in

Figure 3.4.

In the next chapters, we will mainly focus on documents only containing

ele-ments. This is to simplify the formal treatment; our approaches easily extend to

attributes. Asaconsequence,gureswillbesimpler too,asonlyelementnodeswill

occur.

Figure3.4usesagraphicaltreerepresentationtodescribetheaddressBook

docu-ment. Inthis Thesis,wewilloftenrelyongraphicaltreerepresentationto illustrate

our concepts.

3.2 Querying XML

This section introduces two XML query Languages: XPath and XQuery, both

W3C standards. Anexcellent overviewabout theXQuerylanguage ispresentedin

[KCD

+

03], and another overview about XPath language is introduced by [Gro03].

A formal introduction to these languagesis out ofthe scope of thisThesis. In this

section, we only focus on the basic structures of XPath expressions and XQuery

languages, and introduce themmainly bymeans ofexamples. Subsequent chapters

will then provide formal characterizations of the fragments of these languages we

will deal with.

3.2.1 XPath Language

XML Path Language (XPath) is one of themost popular languages used in XML

technologies. It provides support for navigating through XML trees in order to

select nodes satisfyingsome structural and value-based properties.

(38)

<name> Jean Scott </name>

<note>The personal phone of Jean is :

<telephone>+33110203040</telephone> </note> </person> <person> <name> <firstname>Steven</firstname> <lastname>Wesley</lastname> </name> <age>38</age> <telephone>+33155209940</telephone> <email>steven.wesley@ITcompany.com</email> <note>

Work administrator, his mobile phone:

his email:<email>steven.boss@speedymail.com</email>

</note>

</person>

</addressbook>

Figure3.3: A well-formedXML document.

consistsofthreeparts;twomandatorypartsareaxis andnodetest,whileanoptional

partis predicate.

Informally,the threecomponents ofstep aredened asfollows:

1. an axis denes the relationship between the context node and the nodes

se-lected by thestep.

2. a node test species the node type and the expanded-name of the selected

nodes.

3. zero or more predicates, which usearbitrary expressions to further rene the

setof selectednodes.

The evaluationofeach step returnsasequenceofnodes. The current nodeover

whichastep isevaluatediscalledcontext node,andthevaluereturnedbyanXPath

expressionis thevalue returned bythelast step of this expression.

For example, when the following step child::person is evaluated, the axis

child selects all children nodes of the context node. Then, among these nodes,

(39)

Figure3.4: Tree representation of addressBookXML document.

document order. Also, itis important to note thatXPath assumes thatnavigation

throughadocument alwaysstartsfromwhatiscalledthedocument root, whichcan

be seen as a virtual node having as only child the document root element. The

document root is selected bythe simple expression

/

, sofor our previous

address-Bookdocument/child::addressbookselectstherootelementaddressbook,while

/child::addressbook/child::personselectthesequenceofallpersonelements.

Thefollowing briefdescriptionpresentssomeofavailableaxesinXPath (Figure

3.5 illustratesthese navigating axes):

self axisselectsthe context node itself.

child axisselects all childrenof thecontext node.

descendant axisselects all descendants(children, grandchildren, etc.) ofthe

context node.

descendant-or-self axisselectsalldescendantsof thecontext nodeandthe

context nodeitself.

parentaxisselectstheparent ofthe contextnode,whichiseither anelement

node or the root node (or an empty sequence ifthe context node is theroot

node).

ancestor axisselects all ancestors (parent, grandparent, etc.) of thecontext

node,from itsparent to the root node.

ancestor-or-self axis selects all ancestors of the context node, from its

parent to theroot, andthecontext node itself.

(40)

node(): selects nodesof anytype.

text(): selects text nodes.

tag: selects only nodes that have the element-name tag. For example, the

element-name age in the step child::age, which selects only nodes

corre-spondingto elements namedasage.

Figure 3.5: NavigationalXPath axes.

In the following we give some examples of XPath expressions. The next query

selectsallemailelementsthatarechildrenof personelements. Thisisperformedby

usingaspecic pathto be followed inordertoselecttherequested emailelements:

/child::addressbook/child::person/child::email

which canhavethefollowingabbreviated version(thechild:: partisomitted)

/addressbook/person/email

Another abbreviationthatisadmitted isthatallowing theuseof //a insteadof

/descendant-or-self::node()/child::a. Sothefollowing queryselectsall email

elements inthe document addressBook.xml.

//email

XPath uses predicates in its query syntax to limit the extracted data from an

inputXMLdocument. Thefollowing predicateisusedtoselectallpersonelements

(41)

3.2.2 XQuery Language

The XQuery language is a exible and powerful query language for XML data.

XQuery language is built on XPath expressions, and can be used inseveral tasks,

suchas:

Extract information froman XML databaseto usein aWeb service.

Generate summaryreports about datastored inan XML database.

Searchtextual documentson theWeb for relevant information.

TransformXML datato XHTMLto bepublished ontheWeb.

In all these contexts, XPath is not sucient, as mechanisms to select tuples of

nodes,andbuildnewonesareneeded. ThemostusedfragmentofXQueryconsistsof

FLWRexpressions. ThenameFLWRcomesfromtheinitial lettersofthefollowing

clauses:

for-clauses rst select a sequence of nodes, and then perform some query

operationson each node;

let-clausesbindasequenceof nodesto aspecicvariable, which canbeused

into another expression;

where-clauseslter nodesdependingon abooleanexpression;

returtn-clausesbuild valuesresulted bya query.

Most of these clauses are optional, except the return clause .This clause is

alwaysattached withatleastone fororletclause. Ingeneral,aFLWRexpression

may contain many for/letclauses beforethereturn clause.

Thesimplest FLWRexpressioncontaininga for clause hasthefollowing form:

for $x in Q1 return Q2

First of all, this query evaluates Q1, and then for each node in the resulting

sequence, it binds this node to the variable $x and evaluates Q2accordingly. Note

thattheevaluationof Q2isperformedaccordingtothesequenceorderof Q1result.

The nalresultis obtained byconcatenating all Q2results.

The following examples illustrate a query returns the sequence age element of

all personelements inthe document addressBook.xmlpresentedinFigure3.3:

for $x in doc("addressbook.xml")//person

return $x/age

(42)

where $x/@gender = "M"

return $x

XQueryalso provides if-then-elseexpressions. Forinstance, theabove query

isequivalent to thefollowing one using thiskind of expressions:

return

if $x/@gender = "M" then $x else ()

where () denotestheemptysequence.

The following queryproducestwo kindsof elementsdependingof thegenderof

persons:

return

if $x/@gender = "M" then <m/> else <f/>

An example illustrating how multiple for/let clauses can be combined is the

following one:

let $x := doc("addressbook.xml") return

for $y in $x//person

let $w := $y/age

where $w > 35

return $y/note

In the above example, each for/let clause is evaluated ina scope determined

by previousclauses. Thequery above willreturn thefollowing data:

<note>

Work administrator, his mobile phone:

and private email:<email>steven.boss@speedymail.com</email>

</note>

3.2.3 XQuery Update Facility

TheXQuerylanguageisprovided witha powerfulextension,calledXQueryUpdate

Facility (XUF), forupdating XMLdocuments. TheXUF languagebecame aW3C

candidate recommendation in 2009, and was nalized as recommendation in 2011

(43)

DeleteExpre

::=

"delete"

(

"node"

|

"nodes"

)

TargetExpr

RenameExpre

::=

"rename""node" TargetExpr "as" string-value

ReplaceExpr

::=

"replace"

(

"value of node"

|

"node"

)

TargetExpr

"with" SourceExpr

InsertExpre

::=

"insert"

(

"node"

|

"nodes"

)

SourceExpr InsertExpreTargetChoise TargetExpr

InsertExprTargetChoice

::=

"as"

(

"rst"

|

"last"

)

"into"

|

"after"

|

"before"

Figure3.6: TheW3Csyntaxof simpleXQueryupdates.

2. rename a nameof anelement node.

3. replace anexistingnode withanew node or severalnew nodes.

4. insert anode orseveral nodes into an existingnode.

The syntax of the XUF language, according to the W3C recommendation, is

reportedinFigure3.6. Inthissyntax,theTargetExprcomputesthetargetlocation

where the update operation is taking place, while theSourceExpr returns a new

fragment which willbe insertedor replacedinthetarget location.

In Figure 3.7, we illustrate the main updatemechanism by means of some

ex-amples. The inputdocument

D

is reportedinFigure 3.7-(a).

a

b

c

”gogo”

f

g

d

g

f

g

a

b

c

”gogo”

new

f

g

d

g

f

g

a

d

g

f

g

d

g

f

g

(a)InputXMLdocument

D

(b)

U

1 (D)

,(insertnewnode) (c)

U

2 (D)

,(replacenode)

a

b

c

”tata”

f

g

d

g

f

g

a

b

c

”gogo”

f

g

d

g

new

g

a

b

c

”gogo”

f

d

g

f

(d)

U

3 (D)

,(replacevalueofnode) (e)

U

4 (D)

,(renamenode) (f)

U

5 (D)

,(deletenodes)

Figure3.7: Simple XQuery updates.

Theresult ofthe rst simple update

U

1 (D )

on theinputdocument

D

reported

inFigure 3.7-(a) isillustrated inFigure 3.7-(b). Thisupdateinserts anemptynew