XML: Looking at the Forest Instead of the Trees

(1)

XML : Looking at the Forest Instead of the Trees

Guy Lapalme RALI-DIRO Universit´ e de Montr´ eal

P.O. Box 6128, Succ. Centre-Ville Montr´ eal, Qc, Canada, H3C 3J7 e-mail: lapalme@iro.umontreal.ca

http://www.iro.umontreal.ca/~lapalme/ForestInsteadOfTheTrees

November 18, 2005

(2)

Abstract

This report gives a high-level overview of the main principles of some XML technologies:

DTD,XML Schema,RELAX NG,XPath,XSLstylesheets, Formatting Objects,DOM and SAX models of processing. They are presented from the point of view of a computer scientist, without the hype too often associated with them. We do not give a detailed description but we focus on the relations between the main ideas of XMLand other computer language technologies. A singlecompact pretty-print example is used throughout the text to illustrate the processing of an XML structure with XML technologies or by programming in Java. We also show how to create an XMLdocument by programming in Java.

A first version report of this report was written in Fall 2002 during my sabbatical at the Universit´e de Grenoble and at Xerox Research Centre Europe. I wish to thank Gilles S´erasset, Christian Boitet, Pierre Isabelle and Marc Dymetman for many fruitful discussions.

Since then, the document has been improved (at least increased in the number of pages...) after using it in teaching undergraduate and graduate courses at the Universit´e de Montr´eal:

IFT3220 and IFT6281. I especially thank Fabrizio Gotti for his careful reading and for many insightful comments.

(3)

List of Tables

3.1 DTDsyntax . . . 22

3.2 XML Schema syntax . . . 27

3.3 RELAX NG Compact and RELAX NGsyntax . . . 43

4.1 Examples of XPathexpressions . . . 58

4.2 XSLT syntax . . . 60

(6)

List of Figures

1.1 SimpleXML structure . . . 7

1.2 Tree, Web browser, grid and table views of an XMLfile . . . 8

1.3 Overview of XMLtechnologies . . . 10

1.4 HTMLand in displayed HTMLcompact form . . . 12

1.5 Text and PDF compact form . . . 13

3.1 Graphical view of the Schema for the cellar book . . . 29

3.2 Graphical view of the Schema for the wine catalog . . . 33

3.3 Built-in datatypes forXML Schema . . . 38

4.1 HTMLdisplay the cellar . . . 63

4.2 HTMLdisplay of the red wines in the catalog . . . 67

4.3 HTMLdisplay of information about the cellar . . . 70

4.4 PDF output of compaction by Formating Objects . . . 81

4.5 Outline of theXSL-FO file produced by the nested box presentation . . . . 82

5.1 JTree display (on Mac OS X) of listing 2.2 . . . 100

(7)

Listings

2.1 Outline of CellarBook.xml which includes WineCatalog.xml . . . 15

2.2 [CellarBook.xml]: XMLinstance document for the content of the cellar . 16 2.3 [WineCatalog.xml]: XML instance document for the wine catalog . . . 18

3.1 [CellarBook.dtd]: DTD for the cellar book . . . 23

3.2 [WineCatalog.dtd]: DTD to validate the wine catalog . . . 24

3.3 [CellarBook.xsd]: XML Schema for the cellar book . . . 30

3.4 [WineCatalog.xsd]: Schema for the wine catalog . . . 34

3.5 Outline of CellarBook.xsd which imports WineCatalog.xsd . . . 42

3.6 [CellarBook.rnc]: RELAX NGcompact notation schema for the cellar book 45 3.7 [CellarBook.rng]: RELAX NG schema for the cellar book . . . 46

3.8 [WineCatalog.rnc]: Relax NG Schema for the wine catalog . . . 49

3.9 [WineCatalog.rng]: Relax NG schema for the wine catalog . . . 50

4.1 [compactHTML.html]: HTMLoutput produced by the transformation on the cellar book . . . 62

4.2 [compactHTML.xsl]: XSLT transformation to produce a bulleted list . . . . 64

4.4 [WineCatalog.xsl]: XSLT to select the red wines in the wine catalog . . . 66

4.3 [WineCatalog.html]: HTMLoutput of the red wines in the wine catalog . 67 4.5 [CellarBook.html]: HTML output about the cellar . . . 69

4.6 [CellarBook.xsl]: XSLT stylesheet to produce information about the cellar 73 4.7 [CellarBook.txt]: Text compaction of the cellar book . . . 77

4.8 [compact.xsl]: Stylesheet used to compact the cellar book . . . 78

4.9 [compactFO.xsl]: Stylesheet to transform into colored nested blocks . . . . 85

5.1 [DOMCompact.java]: Text compaction of the cellar book with Java usingDOM 92 5.2 [CompactErrorHandler.java]: DOM error handling . . . 95

5.3 [SAXCompact.java]: Text compaction of the cellar book with Java using SAX 96 5.4 [CompactHandler.java]: SAX Handler for text compacting an XML file . 98 5.5 [TreeViewer.java]: JTree building with DOM . . . 101

5.6 [JTreeHandler.java]: JTree building with SAX . . . 102

6.1 [DOMExpand.java]: Compact form parsing to create a DOM XML document 106 6.2 [CompactTokenizer.java]: Specialized stream tokenizer . . . 108

6.3 [SAXExpand.java]: XMLdocument creation using SAX events . . . 109

6.4 [CompactReader.java]: Compact form parsing to generate SAX events . . 110

(8)

Chapter 1 Introduction

XML has been developed to facilitate the annotation of information to be shared between computer systems. It is intended to be easily generated and parsed by computer systems on diverse platforms so its format is based on character streams rather than internal binary ones. Being character based, it also has the nice property of being readable and editable by humans using standard text editors.

XML is based on a uniform, simple and yet powerful model of data organization: the generalized tree. Such a tree is defined as either a single element or an element having other trees as its sub-elements called children, see middle of figure 1.1. This is the same model as the one chosen for the Lisp programming language almost 50 years ago. This hierarchical model is very simple and allows a simple annotation of the data. As in Lisp, the same tree notation used for data representation is also employed to write programs to transform tree structures into other tree structures. On top of this identity of data and program representation, in XML, the tree notation is also used to denote type information to validate XML data.

As is shown at the top of figure 1.1, an arbitrary name between<and >symbols is given to a node of a tree. This is called a start-tag. Everything up until a corresponding end-tag (the same tag except that it starts with </) forms the content of the node, which can itself be a tree. Such trees (e.g. wine, properties and color in figure 1.1) are called elements.

Elements can also contain character data and even mix character data and elements (e.g.

food-pairing). In Lisp (bottom of figure 1.1), trees are represented by embedded lists (i.e.

identifiers or lists enclosed between opening and closing parentheses) whose first element is the name of the node; character data is represented by character strings. An XMLelement with no content can be indicated with an end-tag immediately following a start-tag and can be abridged as an empty-element tag: a start-tag with a terminating / see (rating in figure 1.1). Comments can be added to an XML file by means of a special element that starts with .

Additional information can be added to an element tag with attribute pairs comprising the name of the attribute (e.g. format), an equal sign and the corresponding character string value within double or single quotes (e.g. "1l" or’1l’). Attributes can also be added to an empty element (e.g. rating).

(9)

< ? xml v e r s i o n = " 1.0 " e n c o d i n g = " UTF -8 " ? >

< w i n e n a m e = " M " c o d e = " 0 0 5 1 8 7 1 2 " f o r m a t = " 1 l " >

< c o l o r > red < / c o l o r >

< a l c o h o l i c - s t r e n g t h > 12 < / a l c o h o l i c - s t r e n g t h >

< o r i g i n >

< c o u n t r y > I t a l y < / c o u n t r y >

< r e g i o n > A b r u z z o < / r e g i o n >

C a n t i n a M i g l i a n i c o S C A R L

< / o r i g i n >

< r a t i n g s t a r s = " 2 " / >

< food - p a i r i n g > C o l d cuts , M e a t l o a f , P i z z a < / food - p a i r i n g >

9 . 9 5

< y e a r > 2 0 0 4 < / y e a r >

< / w i n e >

wine name:"M" code:"00518712" format:"1l"

properties origin rating stars:2 food-pairing price year

color alcoholic-strength

red 12

country region Italy Abruzzo

producer Can..SCARL

Cold cuts, bold Meatloaf

, Pizza 9.95 2004

( w i n e ( : n a m e " M " : c o d e " C 0 0 5 1 8 7 1 2 " : f o r m a t " 1 l " ) ( p r o p e r t i e s

( c o l o r red )

( a l c o h o l i c - s t r e n g t h 1 2 ) ) ( o r i g i n

( c o u n t r y I t a l y ) ( r e g i o n A b r u z z o )

( p r o d u c e r C a n t i n a M i g l i a n i c o S C A R L )) ( r a t i n g : s t a r s " 2 " )

( food - p a i r i n g C o l d cuts , ( b o l d M e a t l o a f ) , P i z z a ) ( p r i c e 9 . 9 5 )

( y e a r 2 0 0 4 ) )

Figure 1.1: A simpleXMLstructure (top) and a corresponding Lisp style structure (bottom).

In the middle is shown an equivalent tree structure in which the element names have been shown in bold and the attributes in italics. Thereal information is the character data which appears in roman font. This shows the relations between nodes: properties has wine as parent and color, alcoholic-strengthas children; a sibling of regionis country.

(10)

Figure 1.2: On top right, file of figure 1.1 as displayed in Internet Explorer; the +at the left of <properties> indicates that this element is hidden by collapsing. By clicking on it, the+ becomes - and the tree is displayed in full. The other parts of the figure show alternative views of the same file available on commercial XML editors in order tohide the tags from the view of the user: on the left, the <oXygen/> tree editor view; on the middle right is an XMLSpy grid view; on bottom, a table view offered by XMLSpy as a transpose of the grid view.

(11)

As is shown in the middle part of figure 1.1, these notations are equivalent to a tree data structure where each node is labelled with its name and attributes. Character data appears as leaf nodes. An empty element is a node with no sub-tree. In Lisp, the attributes can be represented by a list of pairs with names indicated by keywords (i.e. identifiers starting with a colon) followed by the corresponding value.

XMLhas the (well deserved) reputation of being verbose but it must be kept in mind that this notation is primarily aimed at communication between machines for which verbosity is not a problem but uniformity of notation is a real asset. In fact, humans should not be really required to type all these start-tags and end-tags. Indeed, many useful structural XML editors are now available which hide the verbosity, keeping only the important structural information or by displaying embedded tables instead of tags. Figure 1.2 shows alternative views of an XMLfile.

As has been shown by Lisp over the years, this tree notation is very general and can be used not only to represent data but also its processing. Programs for transformingXMLtree structures into other tree structures can be written inXSL(eXtensible Stylesheet Language) stylesheets which are a declarative notation for XMLtransformation also written in XML. An important aspect of XML(and one that differs from Lisp) is the a priori type checking that can be done on the file and the validation that can be performed before processing.

XML type information can be provided either with a DTD or with schemas, which offer a more powerful and flexible type system. A schema is also written as an XML file which can itself be type checked. An alternative schema notation called RELAX NG will also be presented later in this document.

XMLoriginated from the need for a flexible way of organizing natural language texts and thus its designers used standard representations of the characters —most often Unicode—

and standard encodings such as UTF-8 or UTF-16, which will not be discussed here.

XML is also widely used in computing systems to systematize structured data as an alternative to databases. Many relational databases also offer XML specific features for indexing and searching. Because of the portability of its encoding and the fact that XML parsers are freely available, it is also used for many tasks requiring flexible data manipulation to transfer data between systems, as configuration files of programs and for keeping information about other files. This document will not present these applications but will focus on a single one (creating a compact representation of an XMLfile) that will be used throughout so that one can feel the similarities and differences between some XML technologies.

Figure 1.3 presents the XML technologies we will describe in this report and their relations. The focus of the whole process is an XML instance document that contains the data (towards the top of figure 1.1). ThisXMLdocument can be validated against a specification described either as a Document Type Description (DTD) or aXML Schema, itself another XMLfile. The validation process will be described in chapter 3. Once validated,XMLdata can be used by application programs through specific Application Programming Interfaces (APIs) described in chapters 5 and 6. XML data can also be processed by transformations (chapter 4) written as stylesheets, a special kind of validatedXMLfile, to create new XML, HTML,PDF or text files.

(12)

Text

Validation Chapter 3 DTD - Schema

.dtd .xsd .rng .rnc

StyleSheets .xsl

Transformations Chapter 4 XML Instance

Document .xml

Formatting Objects

.xml

XHTML

Rendering Chapter 4.3

PDF HTML PDA

...

API Chapters 5,6

XML document

Process Chapter

Output Document

Types:

Application Programs

(13)

For example, the XML file at the top of figure 1.1 can be transformed with a stylesheet into aHTMLone. The top of figure 1.4 shows such a possibleHTMLoutput, displayed in a web browser shown at the bottom of the figure. This is the kind of tree to tree transformation for whichXSLTwas specifically designed. To better illustrate the power of the more general transformations thatXSLT allows, we will show how to obtain a morecompact form¹ shown in figure 1.5 either as a text file (top) or in PDF (bottom) through a transformation using Formatting Objects.

This chapter has shown thatXMLis a flexible notation for adding information tonatural language text but it is more and more used in other areas as well. The raw XMLis verbose and not very user-friendly but it can be hidden by appropriate tools. Programmers can also rely on freely available XML parsers and validators in order to get a well-organized data structure from an XMLfile.

This report tries to give an overall impression of some XMLtechniques and should not be considered as a definitive or exhaustive manual. We will describe the main principles and present general rules, and for the sake of simplicity, we will sometimes be making white lies that seasonedXML experts could point out.

[T]he right abstraction [for XML ...] is a labeled tree of elements. Each element has an ordered list of children in which each child is a Unicode string or an element. An element is labeled with a two-part name consisting of a URI and local part. Each element also has an unordered collection of attributes where each attribute has a two-part name, distinct from the name of the other attributes in the collection, and a value, which is a Unicode string. That is the complete abstraction. [...]. If you understand this, then you understand XML.

James Clark, in [34, pp. ix-x].

1This compaction notation is similar to the one used in the Formal Description of XML [15] and must be seen as a programming exercise and not as a compression technique forXMLfiles.

(14)

< h t m l x m l n s = " h t t p : // www . w3 . org / 1 9 9 9 / x h t m l " >

< h e a d > < t i t l e > H T M L c o m p a c t i o n of the XML f i l e < / t i t l e > < / h e a d >

< ul > < li x m l n s = " " > w i ne n a m e = " M " c o d e = " 0 0 5 1 8 7 1 2 " f o r m a t = " 1 l " < ul >

< li > p r o p e r t i e s

< ul >

< li > c o l o r red < / li >

< li > a l c o h o l i c - s t r e n g t h 12 < / li >

< / ul >

< / li >

< li > o r i g i n

< ul >

< li > c o u n t r y I t a l y < / li >

< li > r e g i o n A b r u z z o < / li >

< li > p r o d u c e r C a n t i n a M i g l i a n i c o S C A R L < / li >

< / ul >

< / li >

< li > r a t i n g s t a r s = " 2 " < / li >

< li > food - p a i r i n g

< ul > C o l d cuts , < li > b o l d M e a t l o a f < / li > , P i z z a < / ul >

< / li >

< li > p r i c e 9 . 9 5 < / li >

< li > y e a r 2 0 0 4 < / li >

< / ul >

< / li > < / ul >

< / h t m l >

Figure 1.4: Representation of the tree of figure 1.1 in source HTML and as it appears in a browser window. This HTML output (slighly reformatted here to fit in the page) was

(15)

< ? xml v e r s i o n = " 1.0 " e n c o d i n g = " utf -8 " ? >

w i n e [ @ n a m e [ M ]

@ c o d e [ 0 0 5 1 8 7 1 2 ]

@ f o r m a t [1 l ]

p r o p e r t i e s [ c o l o r [ red ]

a l c o h o l i c - s t r e n g t h [ 1 2 ] ] o r i g i n [ c o u n t r y [ I t a l y ]

r e g i o n [ A b r u z z o ]

p r o d u c e r [ C a n t i n a M i g l i a n i c o S C A R L ]]

r a t i n g [ @ s t a r s [ 2 ] ]

food - p a i r i n g [ C o l d cuts , b o l d [ M e a t l o a f ] , P i z z a ]

p r i c e [ 9 . 9 5 ] y e a r [ 2 0 0 4 ] ]

wine @name M

@code 00518712

@format 1l

properties color red

alcoholic-strength12

origin country Italy

region Abruzzo

producer Cantina Miglianico SCARL

rating @stars 2

food-pairing _ Cold cuts,

bold Meatloaf

_ , Pizza

price 9.95

year 2004

wine Page 1

Figure 1.5: Compact form of the tree of figure 1.1 in text and PDFformat. These outputs were produced by the stylesheets of listing 4.8 and listing 4.9. The overlap, in the PDF output, between the labelalcoholic-strengthand its value will be explained in section 4.5.

(16)

Chapter 2 Instance Document

Because there are many types of XML documents, either for transforming or validating data, an XML file that contains data is usually called an instance document. Any XML must bewell-formed which means that

• all element start-tags and end-tags must be properly nested

• there should only be one top-level element in the file.

But there also other peculiarities we will describe shortly in this chapter.

In the rest of this report, we will be using as input the XML instance files shown in listings 2.2 and 2.3 whose outline is shown in figure 2.1.¹ They describe a wine cellar containing wine bottles defined in a separate wine catalog.² The structure of these files is the following:

CellarBook.xml (listing 2.2) describes the cellar in four parts:

wine catalog described in an external file Wine-Catalog.xml owner name and address

1The XML and Java listings have been produced by the listings L^ATEX package which displays to indicate that a whitespace is significant because it appears within quotes. For the sake of brevity, some listings do not show the full content of the files. Ellipsis is indicated by ... The source files are available online at the companion website of this document athttp://www.iro.umontreal.ca/~lapalme/

ForestInsteadOfTheTrees

On the website, there areXMLinstance files having their name ending byDTD,XSD,RNCorRNGdepending on the type of validation used (e.g. WineCatalogXSD.xml). These instance files use file inclusion to build the complete instance file. In this document, we will instead use theplainnames of the instance files without indicating the validation type used (e.g. WineCatalog.xml). These files also exist on the website but they contain the full XMLtext and use no file inclusion. This can be useful to use with XMLeditors (such as XMLSpy) who do not supportXinclude^.

The source files containXMLor Java comments of the form|\label{...} which can be ignored by the reader. As the content of the listings in this document is most often taken directly from these source files, the labels are used for keeping references with the L^ATEX source file of this report.

2This application was inspired by theLivre de cave example used by Benoˆıt Habert in his book on the Common Lisp Object System (CLOS) programming [22].

(17)

Listing 2.1: Outline of CellarBook.xml (listing 2.2) which includes (line 4) WineCatalog.xml(Listing 2.3) which uses a given namespace. TheXMLprocessorreplaces this line at inclusion time by the content of the box.

< cellar - b o o k ...

x s i : n o N a m e s p a c e S c h e m a L o c a t i o n = " C e l l a r B o o k . xsd "

x m l n s : c a t = " h t t p : // www . iro . u m o n t r e a l . ca / l a p a l m e / wine - c a t a l o g " >

4 <xi:include href="WineCatalog.xml" ... />

< wine - c a t a l o g ... x s i : s c h e m a L o c a t i o n =

" h t t p : // www . iro . u m o n t r e a l . ca / l a p a l m e / wine - c a t a l o g W i n e C a t a l o g . xsd "

x m l n s = " h t t p : // www . iro . u m o n t r e a l . ca / l a p a l m e / wine - c a t a l o g "

5 x m l : b a s e = " W i n e C a t a l o g . xml " >

< w i n e n a m e = " D o m a i n e de l ’ Ile M a r g a u x " c o d e = " C 0 0 0 4 3 1 2 5 " ... >

...

< / w i n e >

< w i n e n a m e = " R i e s l i n g H u g e l " c o d e = " C 0 0 0 4 2 1 0 1 " ... >

10 ...

< / w i n e >

< w i n e n a m e = " C h ^a t e a u M o n t g u ´e r e t " c o d e = " C 1 0 2 6 3 8 5 9 " ... >

...

< / w i n e >

15 < w i n e n a m e = " M u m m C o r d o n R o u g e " c o d e = " C 0 0 3 1 2 3 6 3 " >

...

< / w i n e >

< w i n e n a m e = " P r a d o Rey R o b l e " c o d e = " C 0 0 9 2 9 0 2 6 " ... >

...

20 < / w i n e >

< / wine - c a t a l o g >

< o w n e r > ... < o w n e r >

< l o c a t i o n > ... < l o c a t i o n >

< c e l l a r >

9 < w i n e c o d e = " C 0 0 0 4 3 1 2 5 " > ... < / w i n e >

< w i n e c o d e = " C 0 0 3 1 2 3 6 3 " > ... < / w i n e >

< w i n e c o d e = " C 1 0 2 6 3 8 5 9 " > ... < / w i n e >

< w i n e c o d e = " C 0 0 9 2 9 0 2 6 " > ... < / w i n e >

< / c e l l a r >

14 < / cellar - b o o k >

(18)

location address of the cellar (if different from that of the owner)

cellar list of wine bottle lots (using codes from the wine catalog) and, for each, the quantity currently held in the cellar and the purchase date of the lot

Wine-Catalog.xml (listing 2.3) gives the description of each wine product with a code that will be matched by the ones of the cellar.

Listing 2.2 shows the content of the cellar-book as anXMLinstance document. The first line starting with <?xml is a processing instruction that indicates the XML version used³ and the encoding for the file, here UTF-8.

Thereal content of the file corresponding the tree structure storing the information starts with the root element cellar-book (line 6) which itself has three children: owner (line 11), location (line 21) and cellar (line 27). The first child of cellar-book is the contents of the fileWineCatalog.xml(shown in listing 2.3) which is included at run-time via the element xi:include (line 9).

Element !DOCTYPE(line 2), not a well-formed XMLelement, defines entities that can be used in theXML instance document. This notation will be explained further in section 3.1 but for the moment they can be considered as text macros that will perform string sub- stitutions before the XMLfile is processed. Substitution occurs when an entity is referred to by enclosing its name between & and ;. For example, entity guy (line 2) is replaced by Guy Lapalme when &guy; is encountered in the file. Entities can refer to other entities: &GL;

(line 32) will be replaced by Guy Lapalme, Montr´eal. When an entity declaration is followed by SYSTEM and the name of a file, then a reference to this entity is replaced by the content of the file.

Listing 2.2: [CellarBook.xml]: XML instance document describing the content of the cellar

< ! D O C T Y P E cellar - b o o k [ <! E N T I T Y guy " Guy L a p a l m e " >

< ! E N T I T Y e a c u t e " &# xe9 ; " >

< ! E N T I T Y mtl " M o n t r & e a c u t e ; al " >

5 < ! E N T I T Y GL " & guy ; , & mtl ; " > ] >

< cellar - b o o k x m l n s : x s i = " h t t p : // www . w3 . org / 2 0 0 1 / X M L S c h e m a - i n s t a n c e "

x m l n s : c a t = " h t t p : // www . iro . u m o n t r e a l . ca / l a p a l m e / wine - c a t a l o g "

x s i : n o N a m e s p a c e S c h e m a L o c a t i o n = " C e l l a r B o o k . xsd " >

< x i : i n c l u d e h r e f = " W i n e C a t a l o g . xml "

10 x m l n s : x i = " h t t p : // www . w3 . org / 2 0 0 1 / X I n c l u d e " / >

< o w n e r >

< n a m e >

< f i r s t > J u d e < / f i r s t >

< f a m i l y > R a i s i n < / f a m i l y >

15 < / n a m e >

3AlthoughXMLversion 1.1 exists, very few processors deal with it, so most of the timeversion="1.0"

is used.

(19)

< s t r e e t > 1 2 3 4 rue des C h a t e a u x < / s t r e e t >

< c i t y > St - G e o r g e < / c i t y >

ON

< postal - c o d e > M7W 7 S0 < / postal - c o d e >

20 < / o w n e r >

< l o c a t i o n >

< s t r e e t > 4 5 8 7 des F u t a i l l e s < / s t r e e t >

< c i t y > V a l l ´e e des c r u s < / c i t y >

QC

25 < postal - c o d e > H3C 4 J8 < / postal - c o d e >

< / l o c a t i o n >

< c e l l a r >

< w i n e c o d e = " C 0 0 0 4 3 1 2 5 " >

2005 -06 -20

30 < q u a n t i t y > 2 < / q u a n t i t y >

< c o m m e n t >

< c a t : b o l d > & GL ; < / c a t : b o l d > : s h o u l d r e o r d e r s o o n

< / c o m m e n t >

< / w i n e >

35 . . . .

< w i n e c o d e = " C 0 0 9 2 9 0 2 6 " >

2003 -10 -15

< q u a n t i t y > 1 < / q u a n t i t y >

< c o m m e n t > for < c a t : b o l d > big < / c a t : b o l d > p a r t i e s < / c o m m e n t >

40 < / w i n e >

< / c e l l a r >

< / cellar - b o o k >

Listing 2.3 is the content of the catalog of available types of wines storing information such as their properties (color, alcoholic strength), their origin, their price and their year of production.⁴ Other information such as thename, thecodeandformatare given as attributes within the start-tag. While the value of an element can be an arbitrarily complex tree of elements, attribute values can only be single string values. Strings for attribute values must be delimited by either matching ’ or ". These delimiters have the same meaning and this convention is convenient when embedding a quote of one type within a string value. In case the two types of quotes are needed within a single string, one can use the predefined entities

'and "(explained in section 3.1).

The structure of an XML instance file may seem arbitrary and, in a sense, it is. In order to make sure that its processing is efficient, it is important that the structure of the information be in the right format (i.e. embedded within the correct tags and in the correct order) and that all the mandatory information be present. This verification could be done by the program using the information but it would more helpful to detect errors or lack of information when the instance file is created. Thus the program needing the data can be

4this information was inspired by data found on the web site of theSociété des Alcools du Québec (SAQ).

(20)

sure that the file structure follows the expected format. This validation process, similar to the static type checking for a programming language, is explained in the next chapter but before, we will look at namespaces, another important concept inXMLinstance documents.

Listing 2.3: [WineCatalog.xml]: XML instance document for the wine catalog, it will be included in figure 2.2 line 9

< wine - c a t a l o g x m l n s : x s i = " h t t p : // www . w3 . org / 2 0 0 1 / X M L S c h e m a - i n s t a n c e "

x s i : s c h e m a L o c a t i o n =

" h t t p : // www . iro . u m o n t r e a l . ca / l a p a l m e / wine - c a t a l o g W i n e C a t a l o g . xsd "

x m l n s = " h t t p : // www . iro . u m o n t r e a l . ca / l a p a l m e / wine - c a t a l o g " >

5 < w i n e n a m e = " D o m a i n e de l ’ Ile M a r g a u x " c o d e = " C 0 0 0 4 3 1 2 5 "

c l a s s i f i c a t i o n = " a . c " a p p e l l a t i o n = " B o r d e a u x s u p ´e r i e u r "

f o r m a t = " 750 ml " >

< c o l o r > red < / c o l o r >

10 < a l c o h o l i c - s t r e n g t h > 1 2 . 5 < / a l c o h o l i c - s t r e n g t h >

< n a t u r e > s t i l l < / n a t u r e >

< o r i g i n >

< c o u n t r y > F r a n c e < / c o u n t r y >

15 < r e g i o n > B o r d e a u x < / r e g i o n >

S C E A D o m a i n e de L & a p o s ; Ile M a r g a u x ( B . P . 5)

< / o r i g i n >

20 < c o m m e n t > R e a d y for d r i n k i n g now < / c o m m e n t >

< food - p a i r i n g >

A c c o m p a n i e s < e m p h > B o r d e l a i s e r i b s t e a k < / e m p h > ,

p o r k w i t h p r u n e s or m a g r e t de c a n a r d .

< / food - p a i r i n g >

25 2 2 . 8 0

< y e a r > 2 0 0 2 < / y e a r >

< / w i n e >

...

< w i n e n a m e = " P r a d o Rey R o b l e " c o d e = " C 0 0 9 2 9 0 2 6 "

30 c l a s s i f i c a t i o n = " d . o . " a p p e l l a t i o n = " Ribera - del - d u e r o "

f o r m a t = " m a g n u m " >

< c o l o r > red < / c o l o r >

< a l c o h o l i c - s t r e n g t h > 1 2 . 5 < / a l c o h o l i c - s t r e n g t h >

35 < n a t u r e > s t i l l < / n a t u r e >

< o r i g i n >

< c o u n t r y > S p a i n < / c o u n t r y >

< r e g i o n > Old C a s t i l l e < / r e g i o n >

(21)

40 R e a l S i t i o de V e n t o s i l l a SA

< / o r i g i n >

3 5 . 2 5

< y e a r > 2 0 0 2 < / y e a r >

< / w i n e >

45 < wine - c a t a l o g >

2.1 Namespaces

Namespaces allow a graceful combination of independent XML files. As can be seen in figure 2.1, listing 2.2 includes listing 2.3 via the elementxi:include(line 9). These files both use thewineelement in different ways:⁵ in listing 2.3,wine (line 5) refers to a description of a type of wine while in listing 2.2wine (line 28) refers to a batch of bottles. So both references must be distinguished from one another in order to validate them with the appropriate XML Schema.

Each element name in an XML file is defined within a context, called a namespace, indicated as a prefix ending with a colon (:). The definition of a namespace prefix is done using attributes of the root element of an instance file defined in thexmlns namespace (how about the circularity of this definition!). For example, on line 6 of listing 2.2, two namespace prefixes are defined: xsi and cat, for which are given two arbitrary unique identifiers that will be used to distinguish their namespaces. Most often identifiers of namespaces are URLs (URIs more precisely) because the authors of an XML file use a URL designating a web site that they own. If authors take care not to use the same URL for different purposes, this pretty much guarantees the uniqueness of the namespaces. This does not necessarily means that the URLs used as names for namespaces do exist. It must be remembered that the URL notation is nothing more than a useful convention, although this name can also be used by validators as a hint to find the corresponding schema.

By default, names without prefixes are defined in the empty namespace or to the value assigned to the xmlns attribute. To create elements in a specific namespace, we assign a default namespace like we did at the start of listing 2.3 by specifying a value for the xmlns attribute (line 4). In principle, any element can set a value for the xmlns attribute to change the default namespace or to set the prefix of new namespaces for nested elements.

So namespace prefixes are inherited: the search for the URI corresponding to a prefix starts from the current element and follows the parent links in the tree until it finds a corresponding prefix declared as a value of a xmlns attribute.

As shown in listing 2.1, the declaration of namespaces is most often done at the root element of the file. In this listing, the box indicates thefrontier of the namespace. An element outside of the box must use a prefix to refer to an element inside the box. Within the box, no

5In such a simple case as this one, it would be an easy matter, and probably a better design, to have different names for these two concepts but we want to illustrate the use of namespaces in a small scale example. The same name clash would occur if we wanted to combine independently createdXMLfiles.

(22)

prefix is necessary because the namespace has been given a null prefix line 4 within the box.

It is possible de define a namespace for any element (which will also apply to its subelements) but this make it hard to follow for the human reader to be aware of the current namespace of an element even though anamespace awareXMLprocessor has no problem because a namespace is associated with each element. For example in listing 2.2,cat:bold(line 32) designates the bold element in the http://www.iro.umontreal.ca/lapalme/wine-catalog namespace.

All elements in the listing 2.3 also have the same namespace; sobold elements (line 23) are the same: i.e. when, as will be explained later, they will be processed by a XML system, they will be identified as being of the same type.

The use of namespaces will be better understood once we have seen their use in validation with schemas (section 3.2.4). An excellent short introduction to the concept of namespace can be found in [34, p. 160-166].

(23)

Chapter 3 Document Validation

As we have mentioned in the previous section, an XML file must be well-formed in order to be processed correctly. XML designers have created a thorough checking method called validation that verifies whether elements of an XML file are well-formed and, furthermore, ensures that their ordering and nesting obey certain rules. These rules are specified by a DTD or a XML Schema. This validation is done prior to any further processing so that programs that process aXMLfile do not waste time checking for such errors. An application is even allowed to stop any processing if it encounters an invalidXML file.

The author of an XML file can usually be warned of the invalidity of his XML file at creation time. This validation can be done either within the XML text editor itself (e.g. XMLSpy [9], <oXygen/> [36] or the nXML mode in Emacs [18]) or by an external validator program (e.g. Xerces [10] or XSV [32]). XML editors can also play an active role in the creation of validXMLfile, by suggesting at each point valid continuations (acceptable elements, attributes or values) depending on the DTD or the XML Schema.

XML, like its ancestor SGML, defines the validation of a file with respect to aDocument Type Declaration (DTD) given at the start of the file. Most often the DTD is an accom- panying external document that allows different files to follow the same rules by sharing it. A DTD is relatively simple to define but the rules of validation it can enforce are quite rudimentary because they can only define constraints on the nesting of elements and perform simple checking on values of attributes. In order to validate the content of elements, XML designers have defined a more elaborate type system called aSchema which can be used in at least two technologies: XML Schemapresented in section 3.2 andRELAX NG described in section 3.3.

3.1 Document Type Declaration (DTD)

A DTD is a notation to define elements that are allowed to appear in an XMLfile as well as the type of information they can contain. Table 3.1 gives an overview of some of the more frequent definitions of elements, attributes and entity that can be defined in a DTD. These definitions are simili XML tags in the sense that they look like XML start-tags without

(24)

<!DOCTYPErootElement SYSTEM ”file.dtd” {[!ENTITY*]}? >

<!ELEMENTNCName ({#PCDATA |}? regexpOf!ELEMENT )>

<!ELEMENTNCName (#PCDATA) >

<!ELEMENTNCName EMPTY>

<!ATTLISTelementNCName attributeNCName declValue default>

declValue = CDATA |ID |IDREF|(CNAME {|CNAME}+ ) default ={#REQUIRED |#IMPLIED}

<![CDATA[ ... ]]>

<!ENTITYname ” ... ”>

<!ENTITY% name ” ... ”>

<!ENTITYname SYSTEM ”file.xml”>

Table 3.1: A reminder of the subset of DTD syntax used in listings 3.1 and 3.2. CDATA is character data as is, but PCDATA isparsed character data that can contain references to entities. Names in italics refer to other elements. declValue and default above are not part of the DTD syntax, they are only useful abbreviations in this table. Regular expressions are used to describe the allowed forms: braces are used for grouping, ? indicates that the preceding grouping is optional,*that it can be repeated as often as necessary possibly none and + that it must be appear at least once.

their corresponding end-tags. For mainly historical reasons, DTDs are not well-formed XMLfiles. The types for DTDs are most often given as either:

• (#PCDATA) (Parsed Character DATA) which corresponds to character string informa-

tion; parsed means that the character data can contain entity references as explained below

• a regular expression in parentheses involving other element names.

The regular expression for the sequencing of children elements follows the now well known conventions¹:

, sequence

| choice

( ) grouping of expressions

? optional previous expression

* repetition, possibly none, of the previous expression + repetition at least once of the previous expression

1Regular expressions used in the definition of what can appear in aDTDin table 3.1 should be distinguished from the regular expressions used in the DTDthemselves even though they use the same symbols with the same meaning. We have used two different fonts (this sans-serif fontis used for meta regular expressions) but they can be hard to distinguish in some cases. The context should make clear the type of regexp that is referred to in each case.

(25)

Listing 3.1 is a validating DTD for the XML instance document given in listing 2.2.

Elements are defined with an!ELEMENTtag, seewine(line 5), containing a regular expression indicating constraints on its children elements: a wine element has up to four children elements in sequence: purchaseDate, quantity, rating and comment, the last two being optional. Elements purchaseDate (line 6) or city (line 28) can contain character data and no other elements. A cellar (line 3) is a list (possibly empty) of wine elements. A wine (line 5) element must contain a purchaseDate element, followed by a quantityand possibly a rating or a comment. A name (line 12) is a non-empty list of either a first, initial or familyin any order; these elements can even be repeated which shows the limitations on the types of constraints that can be easily represented with a DTD.

Attributes are defined using !ATTLIST tags indicating the element to which they belong, their name, their type and whether they are mandatory (#REQUIRED) or optional (#IMPLIED).

See for example the !ATTLIST for the code (line 10) attribute of the wine element.

A DTDcan also contain definitions of entities that act as text macros that are replaced textually either in the instance document or in the DTD itself. Entities whose definitions start with <!ENTITY such as guy (line 20) (already illustrated in listing 2.2) define textual replacements when they are called, i.e. when they appear between & and ;. This entity mechanism is necessary in order to be able to insert a less-than sign (< typed as<) in an XMLfile because < is reserved to indicate the start of a tag. So now we also need a way to insert an ampersand (& typed as &) which indicates the start of an entity. Three other predefined entities also exist for XML files: " for ", ' for ’ and > for > (this last one by symmetry with < even though it is not strictly needed).

Macro replacements are also quite useful to modularize DTDs but in order to be used within definitions of DTDs they must be distinguished from ordinary entities; this different type of entity is called a parameter entity. Its definition has with a percent sign as name followed by the name of the parameter entity and its definition; see address (line 26). Its call is preceded by a percent sign instead of an ampersand (seeowner(line 32) and location (line 33)). Another special kind of entity, indicated by SYSTEM, refers to a file such as in wine-catalog(line 35). This entity can then be used to include a file as is shown on the last line of listing 3.1, which includes the file given in listing 3.2.

Listing 3.1: [CellarBook.dtd]: DTD for the cellar book. It can validate the instance file in listing 2.2. ELEMENTs and ATTLISTs are independent, indentation is ignored by the DTD processor, it is used here for the human reader only to highlight some inclusion dependencies.

< ! E L E M E N T c e l l a r ( w i n e )* >

5 < ! E L E M E N T w i n e ( p u r c h a s e D a t e , q u a n t i t y , r a t i n g ? , c o m m e n t ?) >

< ! E L E M E N T p u r c h a s e D a t e (# P C D A T A ) >

< ! E L E M E N T q u a n t i t y (# P C D A T A ) >

< ! E L E M E N T r a t i n g E M P T Y >

< ! A T T L I S T r a t i n g s t a r s C D A T A # I M P L I E D >

10 < ! A T T L I S T w i n e c o d e I D R E F # R E Q U I R E D >

(26)

< ! E L E M E N T n a m e ( f i r s t | f a m i l y | i n i t i a l )+ >

< ! E L E M E N T f i r s t (# P C D A T A ) >

< ! E L E M E N T f a m i l y (# P C D A T A ) >

15 < ! E L E M E N T i n i t i a l (# P C D A T A ) >

< ! E L E M E N T cellar - b o o k ( wine - catalog , owner , l o c a t i o n , c e l l a r ) >

< !- - [ g e n e r a l ] e n t i t i e s for use in i n s t a n c e d o c u m e n t - ->

20 < ! E N T I T Y guy " Guy L a p a l m e " >

< ! E N T I T Y e a c u t e " &# xe9 ; " >

< ! E N T I T Y mtl " M o n t r & e a c u t e ; al " >

< ! E N T I T Y GL " & guy ; , & mtl ; " >

25 < !- - p a r a m e t e r e n t i t i e s for use w i t h i n a DTD - ->

< ! E N T I T Y % a d d r e s s " ( street , city , p r o v i n c e , postal - c o d e ) " >

< ! E L E M E N T s t r e e t (# P C D A T A ) >

< ! E L E M E N T c i t y (# P C D A T A ) >

< ! E L E M E N T p r o v i n c e (# P C D A T A ) >

30 < ! E L E M E N T postal - c o d e (# P C D A T A ) >

< ! E L E M E N T o w n e r ( name ,% a d d r e s s ;) >

< ! E L E M E N T l o c a t i o n % a d d r e s s ; >

35 < ! E N T I T Y % wine - c a t a l o g S Y S T E M " W i n e C a t a l o g . dtd " >

% wine - c a t a l o g ;

We now look at the validation of the wine catalog (listing 3.2). Given the fact that all element names must be unique in a DTD (there are no namespaces in DTDs), we must give a different name to thewine element of listing 3.1. Here we decided to call itcat-wine (line 4). The attribute format (line 10) shows an example of an enumeration of values from which the attribute value must necessarily be chosen. The link between the wine and the cat-wine elements is done using thecode (line 13) of listing 3.2 of type ID and its reference in code (line 10) of listing 3.1 which is of type IDREF. In an XML file, all values of type ID must be distinct and values of type IDREF must refer to an existingID.

Listing 3.2: [WineCatalog.dtd]: DTD to validate the instance file in listing 2.3. It is included in listing 3.1. ELEMENTs and ATTLISTs are independent, indentation is ignored by the DTD processor, it is used here for the human reader only to highlight some inclusion dependencies.

< ! E L E M E N T wine - c a t a l o g ( cat - w i n e *) >

< ! E L E M E N T cat - w i n e ( p r o p e r t i e s , origin ,

5 ( tasting - n o t e ? , food - p a i r i n g ? , c o m m e n t ?)* ,

(27)

price , y e a r ) >

< ! A T T L I S T cat - w i n e n a m e C D A T A # R E Q U I R E D >

< ! A T T L I S T cat - w i n e a p p e l l a t i o n C D A T A # I M P L I E D >

< ! A T T L I S T cat - w i n e c l a s s i f i c a t i o n C D A T A # I M P L I E D >

10 < ! A T T L I S T cat - w i n e f o r m a t ( 3 7 5 ml | 750 ml | 1 l | m a g n u m | j e r o b o a m

| r e h o b o a m | m a t h u s a l e m | s a l m a n a z a r

| b a l t h a z a r | n a b u c h o d o n o s o r ) # R E Q U I R E D >

< ! A T T L I S T cat - w i n e c o d e ID # R E Q U I R E D >

< ! E L E M E N T p r o p e r t i e s ( color , a l c o h o l i c - s t r e n g t h , n a t u r e ?) >

15 < ! E L E M E N T c o l o r (# P C D A T A ) >

< ! E L E M E N T a l c o h o l i c - s t r e n g t h (# P C D A T A ) >

< ! E L E M E N T n a t u r e (# P C D A T A ) >

< ! E L E M E N T o r i g i n ( country , region , p r o d u c e r ) >

< ! E L E M E N T c o u n t r y (# P C D A T A ) >

20 < ! E L E M E N T r e g i o n (# P C D A T A ) >

< ! E L E M E N T p r o d u c e r (# P C D A T A ) >

< ! E N T I T Y % C o m m e n t " (# P C D A T A | e m p h | b o l d )* " >

< ! E L E M E N T e m p h (# P C D A T A ) >

25 < ! E L E M E N T b o l d (# P C D A T A ) >

< ! E L E M E N T c o m m e n t % C o m m e n t ; >

< ! E L E M E N T tasting - n o t e % C o m m e n t ; >

< ! E L E M E N T food - p a i r i n g % C o m m e n t ; >

30

< ! E L E M E N T p r i c e (# P C D A T A ) >

< ! E L E M E N T y e a r (# P C D A T A ) >

3.1.1 Associating an Instance File to DTD

The link between a DTDand anXMLfile that it validates can be done externally using an XML Editor, but most DTD validators insist that we add a!DOCTYPE element at the start of the XML file. For example, one can use declarations such as the following

< ! D O C T Y P E cellar - b o o k S Y S T E M " C e l l a r B o o k . dtd " [

<! E N T I T Y WC S Y S T E M " W i n e C a t a l o g C o n t e n t N o N S . xml " >

< ! E N T I T Y CB S Y S T E M " C e l l a r B o o k C o n t e n t N o N S . xml " >

5 ] >

< cellar - b o o k >

< wine - c a t a l o g > & WC ; < / wine - c a t a l o g >

& CB ;

< / cellar - b o o k >

The root element of the XML instance document is given as the second value, SYSTEM in third and a reference to the DTD file in fourth. In the previous example, we have also

(28)

put the content of the wine catalog and the cellar book in separate files that are included as system entities. These lines will be seen as a complete XMLfile (in fact listing 2.2) by a program using the standard XMLtools and APIs.

3.2 Schema

As we have seen in the previous section, aDTDdescribes some constraints on the order and nesting of elements in anXMLfile but the type of constraints is quite limited and it does not allow any validation of the character content of elements. There are also other drawbacks: all element names in a DTD must be unique and thus combining separately developed DTDs can become quite cumbersome. Moreover, theDTDfile is not a well-formedXMLfile, thus one cannot easily use anXML tool to create or process it. This is why XML Schema has been introduced with a comprehensive set of elementary types and a way to combine them to create new types. The concept of namespaces (presented in section 2.1) is also used in order to facilitate the combination of independent files without name clashes.

A Schema is a well-formed XML file (usually with a .xsd extension) that defines types which are used to validate the elements of the XML file. In a way similar to variable declarations in a programming language, we can define types² for many elements instead of using inline definitions of embedded elements. In a Schema, there are two kinds of types:

simple and complex. Simple types define constraints on the text content of an element which cannot contain any element. A complex type can contain nested elements.

There are many different ways of organizing a Schema as described by Van der Vlist[33]:

one can either use a russian doll approach in which a single element is defined with all embedded elements internally defined; another way is to use a bottom-up approach in which the elements are defined before being used in more complex elements; it is also possible to use atop-down approach that first define the higher level elements before defining the lower level elements. All these styles of definition are possible and we will sometimes use a mix of them in order to show some features of XML Schema.

Table 3.2 presents the XML elements we use in our example to define the types needed for the validation of our wine catalog and cellar book. Since a schema is itself an XML file, it is important to distinguish the elements defining the Schema from the elements being defined. This is done by having different namespaces for the defining element (definiens) using xs: (xsd is also commonly used) as prefix and for the defined elements (definiendum) without prefix, i.e. in the default namespace. Contrarily to aDTD, aXML Schemabeing a valid XMLfile, it can be validated using theXML Schema of XML Schemas which is usually included in all XML editors.

A XML Schema has a xs:schema element as root which can contain different kinds of definition elements.

2We follow the Java convention of starting type identifiers with an upper case letter. Element identifiers start with a lower case letter. In a name comprising more than one word, each word starts with an uppercase letter, no underscore or dash are used.

(29)

<xs:schema targetNameSpace=”URI”>

xs:import* {xs:simpleType |xs:complexType |xs:element |xs:group}*

</xs:schema>

<xs:import nameSpace=”URI” schemaLocation=”URI”/>

<xs:simpleTypename=”NCName”>

xs:restriction

</xs:simpleType>

<xs:complexTypename=”NCName”{ mixed=”true”}?>

{xs:choice |xs:sequence |xs:group}? xs:attribute*

</xs:complexType>

<xs:element name=”QName” type=”TName”/>

<xs:element name=”QName” ref=”EName”/>

<xs:element name=”QName”>

{xs:simpleType |xs:complexType}?

{xs:unique |xs:key |xs:keyref}*

</xs:element>

<xs:sequence{min|max}occurs=”nonNegativeInteger|unbounded”>

{xs:element |xs:choice |xs:sequence |xs:group}*

</xs:sequence>

<xs:choice{min|max}occurs=”nonNegativeInteger|unbounded”>

{xs:element |xs:choice |xs:sequence |xs:group}*

</xs:choice>

<xs:groupname=”NCName”>

{xs:choice |xs:sequence}

</xs:group>

<xs:attributename=”NCName” type=”TName”{ use=”required”}?/>

<xs:restriction base=”TName”>

<xs:{max|min}{in|ex}clusivevalue=”anySimpleType”/>

|<xs:{max|min|}lengthvalue=”nonNegativeInteger”/>

|<patternvalue=”regExp”/>

|<enumeration value=”anyValue”/>

</xs:restriction>

<xs:{unique|key} name=”NCName”>

xs:selector xs:field+

</xs:{unique|key}>

<xs:keyref name=”NCName” refer=”NCName”>

xs:selector xs:field+

</xs:keyref>

<xs:{selector|field} xpath=”XPathExpr”/>

Table 3.2: A reminder of the subset of XML Schema syntax used in listings 3.3 and 3.4.

Names in italics refer to other elements. NCName (non-colonized) name is a name without namespace prefix. Regular expressions are used to describe the allowed forms: braces are used for grouping, ? indicates that the preceding grouping is optional, * that it can be repeated and +that it can be repeated but at least once.

(30)

• xs:importallows the combination of different schemas into a single one; in our case, we have a schema for the wine catalog which is imported into the one of the cellar book

• xs:simpleType gives supplementary constraints on predefined types; this is explained

further in section 3.2.1.

• xs:complexType defines a new type in terms of a choice or a sequence between other

types; xs:group gives a new to an incomplete type. xs:attributesare given at the of the definition, even though they appear in the start-tag

• xs:element is the fundamental way of defining an element that can appear in an in-

stance file. It can be given either with a name and a type, it can refer to another element definition or it can be defined with an anonymous simple or complex type followed by keys and keyrefs definitions

• xs:sequence(respxs:choice) combines other elements by making sure that they occur

sequentially (resp. alternatively i.e. only one of the element can appear)

• xs:groupclusters elements that can be used together

• xs:attribute gives the name and the simple type associated with an attribute. At-

tributes are not ordered and optional unless their are given the valuerequiredto their use attributes

• xs:restrictiongives range, pattern constraints on the value of a simple type. enumeration is to limit the allowed value to one of a given list.

• xs:key,xs:uniqueandxs:keyrefdefine cooccurrence constraints that will be explained in section 3.2.3.

In the rest of this section, we first give the XML Schemas in their entirety Listings 3.3 and 3.4 that correspond to the DTDs given in listings 3.1 and 3.2 respectively. Figure 3.1 gives the overall structure of the XML Schema of listing 3.3. Figure 3.2 gives the overall structure of the Schema in listing 3.4. As we will see, the validation of the text content of the elements can be much more thorough with a XML Schema than with aDTD.

We will then explain the structure of the type system: first simple types (section 3.2.1) then complex types (section 3.2.2) and finally how define keys and their reference (section 3.2.3).

(31)

Figure 3.1: Graphical view of the Schema for the cellar book (listing 3.3). A name in a rectangular box is anelement name or, if preceded by@, an attribute name. Acomplex type name is preceded by a square and a simple type by a triangle. A sequence is shown with 4 dots horizontally aligned in an hexagon and a choice with the 4 dots aligned vertically (see figure 4.3 for an example). A +after a box, indicates that further details have been omitted.

Three small squares in front of an element name either indicates that its definition will be referred to somewhere else in the schema; the reference is indicated by a small arrow at the bottom right of the rectangle. It was produced by the <oXygen/> XML editor from the XML Schema file given in listing 3.3.

(32)

Listing 3.3: [CellarBook.xsd]: XML Schema for the cellar book. It can validate the instance file in listing 2.2. It can be compared with theDTD listing 3.1.

< x s : s c h e m a x m l n s : x s = " h t t p : // www . w3 . org / 2 0 0 1 / X M L S c h e m a "

x m l n s : c a t = " h t t p : // www . iro . u m o n t r e a l . ca / l a p a l m e / wine - c a t a l o g " >

5 < x s : i m p o r t n a m e s p a c e = " h t t p : // www . iro . u m o n t r e a l . ca / l a p a l m e / wine - c a t a l o g "

s c h e m a L o c a t i o n = " W i n e C a t a l o g . xsd " / >

< x s : e l e m e n t n a m e = " c e l l a r " >

< x s : c o m p l e x T y p e >

10 < x s : s e q u e n c e m i n O c c u r s = " 0 " m a x O c c u r s = " u n b o u n d e d " >

< x s : e l e m e n t n a m e = " w i n e " t y p e = " W i n e " / >

< / x s : s e q u e n c e >

< / x s : c o m p l e x T y p e >

< / x s : e l e m e n t >

15

< x s : c o m p l e x T y p e n a m e = " W i n e " >

< x s : s e q u e n c e >

< x s : e l e m e n t n a m e = " p u r c h a s e D a t e " t y p e = " x s : d a t e " / >

< x s : e l e m e n t n a m e = " q u a n t i t y " t y p e = " x s : n o n N e g a t i v e I n t e g e r " / >

20 < x s : e l e m e n t n a m e = " r a t i n g " m i n O c c u r s = " 0 " >

< x s : a t t r i b u t e n a m e = " s t a r s " t y p e = " x s : p o s i t i v e I n t e g e r " / >

< / x s : c o m p l e x T y p e >

< / x s : e l e m e n t >

25 < x s : e l e m e n t n a m e = " c o m m e n t " t y p e = " c a t : C o m m e n t " m i n O c c u r s = " 0 " / >

< / x s : s e q u e n c e >

< x s : a t t r i b u t e n a m e = " c o d e " t y p e = " cat:SAQ - c o d e " use = " r e q u i r e d " / >

< / x s : c o m p l e x T y p e >

30 < x s : e l e m e n t n a m e = " n a m e " >

< x s : s e q u e n c e m a x O c c u r s = " u n b o u n d e d " >

< x s : c h o i c e >

< x s : e l e m e n t n a m e = " f i r s t " t y p e = " x s : s t r i n g " / >

35 < x s : e l e m e n t n a m e = " f a m i l y " t y p e = " x s : s t r i n g " / >

< x s : e l e m e n t n a m e = " i n i t i a l " t y p e = " x s : s t r i n g " / >

< / x s : c h o i c e >

< / x s : s e q u e n c e >

< / x s : c o m p l e x T y p e >

40 < / x s : e l e m e n t >

< x s : e l e m e n t n a m e = " cellar - b o o k " >

(33)

45 < x s : e l e m e n t ref = " c a t : w i n e - c a t a l o g " / >

< x s : e l e m e n t n a m e = " o w n e r " t y p e = " O w n e r " / >

< x s : e l e m e n t n a m e = " l o c a t i o n " m i n O c c u r s = " 0 " >

< x s : g r o u p ref = " a d d r e s s " / >

50 < / x s : c o m p l e x T y p e >

< / x s : e l e m e n t >

< x s : e l e m e n t ref = " c e l l a r " / >

< / x s : s e q u e n c e >

< / x s : c o m p l e x T y p e >

55 < !- - c o m m e n t out the f o l l o w i n g k e y r e f e l e m e n t to w o r k a r o u n d an X M L S p y v a l i d a t i o n bug t h i s i m p l i e s t h a t the k e y r e f are not v a l i d a t e d t h o u g h - ->

< x s : k e y r e f r e f e r = " c a t : W i n e N u m b e r " n a m e = " SAQ - UPC " >

< x s : s e l e c t o r x p a t h = " c e l l a r / w i n e " / >

< x s : f i e l d x p a t h = " @ c o d e " / >

60 < / x s : k e y r e f >

< / x s : e l e m e n t >

< x s : g r o u p n a m e = " a d d r e s s " >

65 < x s : e l e m e n t n a m e = " s t r e e t " t y p e = " x s : s t r i n g " / >

< x s : e l e m e n t n a m e = " c i t y " t y p e = " x s : s t r i n g " / >

< x s : e l e m e n t n a m e = " p r o v i n c e " t y p e = " P r o v i n c e C A " / >

< x s : e l e m e n t n a m e = " postal - c o d e " t y p e = " P o s t a l C o d e C A " / >

< / x s : s e q u e n c e >

70 < / x s : g r o u p >

< x s : s i m p l e T y p e n a m e = " P r o v i n c e C A " >

< !- - h t t p : // www . c a n a d a p o s t . ca / t o o l s / pg / m a n u a l / b03 - e . asp # c 0 1 2 - ->

< x s : r e s t r i c t i o n b a s e = " x s : s t r i n g " >

75 < x s : e n u m e r a t i o n v a l u e = " AB " / >

< x s : e n u m e r a t i o n v a l u e = " BC " / >

< x s : e n u m e r a t i o n v a l u e = " MB " / >

< x s : e n u m e r a t i o n v a l u e = " NB " / >

< x s : e n u m e r a t i o n v a l u e = " NL " / >

80 < x s : e n u m e r a t i o n v a l u e = " NT " / >

< x s : e n u m e r a t i o n v a l u e = " NS " / >

< x s : e n u m e r a t i o n v a l u e = " NU " / >

< x s : e n u m e r a t i o n v a l u e = " ON " / >

< x s : e n u m e r a t i o n v a l u e = " QC " / >

85 < x s : e n u m e r a t i o n v a l u e = " SK " / >

< x s : e n u m e r a t i o n v a l u e = " YT " / >

< / x s : r e s t r i c t i o n >

< / x s : s i m p l e T y p e >

(34)

90 < x s : c o m p l e x T y p e n a m e = " O w n e r " >

< x s : e l e m e n t ref = " n a m e " / >

< x s : g r o u p ref = " a d d r e s s " / >

< / x s : s e q u e n c e >

95 < / x s : c o m p l e x T y p e >

< x s : s i m p l e T y p e n a m e = " P o s t a l C o d e C A " >

< x s : r e s t r i c t i o n b a s e = " x s : s t r i n g " >

< x s : p a t t e r n v a l u e = " [ A - Z ][0 -9][ A - Z ] [0 -9][ A - Z ][0 -9] " / >

100 < / x s : r e s t r i c t i o n >

< / x s : s i m p l e T y p e >

< / x s : s c h e m a >

(35)

Figure 3.2: Graphical view of the Schema for the wine catalog (listing 3.4). See caption of figure 3.1 for an explanation of symbols used in the figure.

XML: Looking at the Forest Instead of the Trees