RELAX NG Compact and RELAX NG syntax - XML: Looking at the Forest Instead of the Trees

As we have seen in the previous section, XML Schema allows a thorough validation of XMLinstance files. The type extension mechanism is very powerful but its XMLformat is not user-friendly, especially for complex embedding of sequences and choices. This is why the graphical editing of schemas, provided by editors such asXMLSpy, is very useful. In fact when it comes to ease of use, theDTDgrammar like format is much more convenient. In order to get the best of both worlds, an alternative Schema notation has been suggested which is called RELAX NG (REgular LAnguage for XML, New Generation) which features a simpler, intuitive notation to define schemas. RELAX NG is based on the same mathematical theory underlying regular expressions but adapted to theXML context. The mathematical foundations are both simpler and more powerful than the ones of the XML Schema.

RELAX NG has two equivalent syntaxes:⁷ one is XML-based and the other (called

7Trang [17] is a tool that can transform one notation into the other and even anRELAX NG Schema

compact) is more convenient because it allows grammar-like definitions. Eric van der Vlist[34]

has written an excellent book explaining both notations in detail. First, he introduces the XMLpatterns which are the theoretical foundations of the formalism that are combined into ordered and unordered groups and used in choices among alternatives. He then shows how the compact notation can simplify the XML notation. In this report, we use the compact notation to writeRELAX NGschemas; we will use the Trang automatic Schema converter to get the XMLnotation should one need it for further processing. Most validators can deal directly with the compact notation. Listing 3.6, a RELAX NG compact notation schema for our cellar book looks more intuitive than the equivalent XML Schema of listing 3.3.

As can be seen in figure 3.3, the structure of RELAX NG Compact definitions is quite regular and simple: on the last line of the top left cell, a definition is simply defined by a name followed by an equal sign and a pattern definition (each line of the bottom cell of the table correspond to a different pattern definition). A pattern can start either by the keyword

element orattribute followed by another pattern within braces. Patterns can be combined

sequentially (with a comma), with alternatives (with a vertical bar) or by interleaving (with an ampersand); this last case means that all patterns must occur but not necessarily in order.

A pattern can also be qualified to be optional, appear zero or more times or once or more.

Mixed pattern allow text elements to appear between patterns. Reference to another pattern is indicated by simply giving its name. empty means that the content of the element must be empty. text corresponds to any number of text nodes in the instance document. Giving a value (usually within braces) means that the element in the document should match this value. It is also possible to specify facets (in theXML Schema sense) to a type with a list of triples of the form: the name of the facet, an equal sign and then the value of the facet.

In listing 3.6, we can see examples of element definitions (line 9, line 19 and line 25). A definition can also be a comma-separated sequence of patterns (line 32 and line 38). We use it here for type definitions but the concept is more general and can be applied to any kind of definition. The content of a definition starts with the keywordattributeorelementfollowed by its name and the type of its content between braces. Similarly to the regular expressions conventions used forDTDs, a definition or a reference to a definition can be followed by a? to indicate that it is optional (seeratingand commentwithin wine(line 10)), a *to indicate a repetition of 0 or more times (see cellar-element (line 9)) or a + for a repetition of at least one element. If a &is used instead of a comma (such as for name-element (line 19)) is used to separate elements, it indicates an interleave meaning that elements in the pattern are unordered. In this case it means that the parts of thename can appear in any order, any of them being optional because they are followed by a ?.⁸ The root element the schema is defined by the rule associated with the start keyword.

When there is no constraint on the string inside an element then the type is text but it can also refer to the built-in data types of XML Schema (see wine (line 10)). Restrictions can also be added on types by indicating them within braces: patterns (see PostalCodeCA into anXMLSchema

8This is a slight difference from the syntax allowed for anameelement as defined by theDTD(listing 3.1) and XML Schema (listing 3.3) in which the only way to indicate this constraint would have been to enumerate all possible orderings offirst,familyandinitial.

(line 40)) or enumerations (see provinceelement in Address (line 32)).

Listing 3.6 includes (line 4) the definitions of the wine catalog in a separate file (list-ing 3.8). Because the included file also has a start symbol, we override its definition by the definition in braces after the name of the file. Any other included definition could be overridden in this way. There are many other possibilities to combine definitions of many files but we will not deal with them in this document. One should consult [34, chapter 10]

for more details.

Namespace prefixes are declared by a definition following the keywordnamespace (line 2).

To use the predefined types of XMLSchema (figure 3.3), we declare similarly the prefix used for referring to them. RELAX NG does not implement the notions of XML Schema^keys

and keyref so that one must resort to the simpler (but often sufficient) notion of DTD ID

and IDREF explained in section 3.1.

Listing 3.6: [CellarBook.rnc]: RELAX NGcompact notation schema for the cellar book.

It can validate listing 2.2. Compare it with listing 3.3

d a t a t y p e s xs = " h t t p : // www . w3 . org / 2 0 0 1 / X M L S c h e m a - d a t a t y p e s "

n a m e s p a c e cat = " h t t p : // www . iro . u m o n t r e a l . ca / l a p a l m e / wine - c a t a l o g "

i n c l u d e " W i n e C a t a l o g . rnc " {

5 s t a r t = cellar - b o o k

}

cellar - e l e m e n t = e l e m e n t c e l l a r {

10 e l e m e n t w i n e {

a t t r i b u t e c o d e { x s : I D R E F } ,

e l e m e n t p u r c h a s e D a t e { x s : d a t e } ,

e l e m e n t q u a n t i t y { x s : n o n N e g a t i v e I n t e g e r } ,

e l e m e n t r a t i n g { a t t r i b u t e s t a r s { x s : p o s i t i v e I n t e g e r }?}? ,

15 e l e m e n t c o m m e n t { C o m m e n t }?

}

name - e l e m e n t = e l e m e n t n a m e {

20 e l e m e n t f i r s t { t e x t }?

& e l e m e n t f a m i l y { t e x t }?

& e l e m e n t i n i t i a l { t e x t }?

}

25 cellar - b o o k = e l e m e n t cellar - b o o k { wine - catalog ,

e l e m e n t o w n e r { O w n e r } ,

e l e m e n t l o c a t i o n { A d d r e s s } , cellar - e l e m e n t

30 }

A d d r e s s = e l e m e n t s t r e e t { t e x t } , e l e m e n t c i t y { t e x t } ,

e l e m e n t p r o v i n c e { " AB " | " BC " | " MB " | " NB " | " NL " | " NT " |

35 " NS " | " NU " | " ON " | " QC " | " SK " | " YT " } , e l e m e n t postal - c o d e { P o s t a l C o d e C A }

O w n e r = name - element , A d d r e s s

40 P o s t a l C o d e C A = x s : s t r i n g { p a t t e r n = " [ A - Z ][0 -9][ A - Z ] [0 -9][ A - Z ][0 -9] " } Should one need to manipulate a RELAX NG schema with a program, it would be simpler to use the corresponding RELAX NG XML notation as illustrated in listing 3.7.

As we have obtained it automatically from the compact notation, we will not explain them further but we want to point out that it is much simpler to write than the correspond-ing XML Schema because of the uniformity of the underlying concepts (everything is a pattern).

Listing 3.7: [CellarBook.rng]: RELAX NGschema for the cellar book inXMLnotation to be compared with listing 3.3. It was obtained automatically (using theTrangconverter) from listing 3.6.

< ? xml v e r s i o n = " 1.0 " e n c o d i n g = " UTF -8 " ? >

< g r a m m a r x m l n s = " h t t p : // r e l a x n g . org / ns / s t r u c t u r e / 1 . 0 "

d a t a t y p e L i b r a r y = " h t t p : // www . w3 . org / 2 0 0 1 / X M L S c h e m a - d a t a t y p e s " >

5 < s t a r t >

< ref n a m e = " cellar - b o o k " / >

< / s t a r t >

< d e f i n e n a m e = " cellar - e l e m e n t " >

10 < e l e m e n t n a m e = " c e l l a r " >

< z e r o O r M o r e >

< e l e m e n t n a m e = " w i n e " >

< a t t r i b u t e n a m e = " c o d e " >

< d a t a t y p e = " I D R E F " / >

15 < / a t t r i b u t e >

< e l e m e n t n a m e = " p u r c h a s e D a t e " >

< d a t a t y p e = " d a t e " / >

< / e l e m e n t >

< e l e m e n t n a m e = " q u a n t i t y " >

20 < d a t a t y p e = " n o n N e g a t i v e I n t e g e r " / >

< / e l e m e n t >

< o p t i o n a l >

< e l e m e n t n a m e = " r a t i n g " >

< o p t i o n a l >

25 < a t t r i b u t e n a m e = " s t a r s " >

70 < ref n a m e = " cellar - e l e m e n t " / >

The beginning of listing 3.8 illustrates how to declare a default namespace for the elements of this file, included in listing 3.6 (line 4). The definition of elements follows the same principles explained for the cellar book. wine-catalog(line 6) must add an optional attribute xml:basethat is used by theXMLprocessor during the file inclusion process. It is needed in

order to ensure the integrity of both the including and included file. ElementFormat(line 34) shows that comments starts with a #and go up to the end of the line. These comments are also preserved during in the transformation to the XMLnotation in listing 3.9 (line 83).

Listing 3.8: [WineCatalog.rnc]: Relax NG Schema for the wine catalog in compact nota-tion. It can validate the instance document of listing 2.3. It can be compared with listing 3.4

d e f a u l t n a m e s p a c e = " h t t p : // www . iro . u m o n t r e a l . ca / l a p a l m e / wine - c a t a l o g "

d a t a t y p e s xs = " h t t p : // www . w3 . org / 2 0 0 1 / X M L S c h e m a - d a t a t y p e s "

s t a r t = wine - c a t a l o g

wine - c a t a l o g = e l e m e n t wine - c a t a l o g {

# n e e d e d b e c a u s e t h i s s c h e m a w i l l be i m p o r t e d a t t r i b u t e x m l : b a s e { t e x t }? ,

e l e m e n t w i n e { W i n e }*

10 }

W i n e = a t t r i b u t e n a m e { t e x t } ,

a t t r i b u t e a p p e l l a t i o n { t e x t } , a t t r i b u t e c l a s s i f i c a t i o n { t e x t } ,

15 a t t r i b u t e c o d e { x s : I D } , a t t r i b u t e f o r m a t { F o r m a t } ,

e l e m e n t p r o p e r t i e s { P r o p e r t i e s } , e l e m e n t o r i g i n { O r i g i n } ,

( e l e m e n t tasting - n o t e { C o m m e n t }

20 | e l e m e n t food - p a i r i n g { C o m m e n t }

| comment - e l e m e n t )* ,

e l e m e n t p r i c e { x s : d e c i m a l } , e l e m e n t y e a r { x s : g Y e a r }

P r o p e r t i e s = e l e m e n t c o l o r { C o l o r } ,

e l e m e n t a l c o h o l i c - s t r e n g t h { P e r c e n t a g e } , e l e m e n t n a t u r e { t e x t }?

30 O r i g i n = e l e m e n t c o u n t r y { t e x t } , e l e m e n t r e g i o n { t e x t } , e l e m e n t p r o d u c e r { t e x t } F o r m a t = " 375 ml " | " 750 ml " | " 1 l "

35 | " m a g n u m " # 1.5 l i t r e s

| " j e r o b o a m " # 3 l i t r e s

| " r e h o b o a m " # 4.5 l i t r e s

| " m a t h u s a l e m " # 6 l i t r e s

| " s a l m a n a z a r " # 9 l i t r e s

40 | " b a l t h a z a r " # 12 l i t r e s

Listing 3.9: [WineCatalog.rng]: Relax NG schema for the wine catalog in XMLnotation.

It can validate listing 2.3. It was obtained automatically (using the Trangconverter) from listing 3.8 and slightly reformatted here to fit in the page

< ? xml v e r s i o n = " 1.0 " e n c o d i n g = " UTF -8 " ? >

< ref n a m e = " F o r m a t " / >

< / e l e m e n t >

< / c h o i c e >

< / d e f i n e >

< d e f i n e n a m e = " P e r c e n t a g e " >

< d a t a t y p e = " d e c i m a l " >

123 0

100

2

< / d a t a >

< / d e f i n e >

128 < / g r a m m a r >

3.4 Associating an Instance File to a Schema

An instance XMLfile can specify its validating schema by adding some information in the attributes of the root tag. This is illustrated in listing 2.2 (line 6) where we indicate the loca-tion of the schema with no namespace using the xsi:noNamespaceSchemaLocation attribute.

We then include (using anxi:includeelement) theWineCatalog.xmlfile (listing 2.3) so that its elements can be referred to. In fact, the XML processor sees the full content of these file (i.e. the cellar and the wine catalog). Listing 2.1 illustrates the file inclusion mechanism and how the instance files are linked to their respectiveXML Schema in listing 3.5.

xi:include refers to the W3C standard[26] which specifies a general purpose inclusion

mechanism to merge information from different XML files. So it is possible to include only some well-formed parts of the included file, but here we include the whole wine catalog. This is a principled way of including information and not mere character inclusions like the one specified with DTD system entities we used in section 3.1.1.

Listing 2.2 also shows that even if a file is validated with a XML Schema, a DOCTYPE element can be added to define new entities. In fact, it is the only way to define an entity in a XML Schema.

Listing 2.3 (line 1) shows how to link an instance file and define its namespace. The empty namespace, defined by the xmlnsattribute in the root tag (line 4), indicates that all element tags without prefix are defined in the http://www.iro.umontreal.ca/lapalme/wine-catalog namespace. The schema location is indicated as the value of thexsi:schemaLocation(line 3) attribute with two values (blank separated). The first part indicates the namespace corre-sponding to the target namespace of the schema and the second part gives its URI (here a local file).

RELAX NG specifications [20] do not prescribe how an instance file should be linked to its schema, so eachXML editor or validator has an implementation specific way of asso-ciating these files (either internally or externally). For example,<oXygen/>uses processing instructions inserted at the top of the file such as the following (depending on whether the compact syntax is used or not).

< ? o x y g e n R N G S c h e m a = " C e l l a r B o o k . rnc " t y p e = " c o m p a c t " ? >

< ? o x y g e n R N G S c h e m a = " C e l l a r B o o k . rng " t y p e = " xml " ? >

3.5 Additional Information on XML Schema

Although XML schemas have been standardized, the area of validation is still a research subject and alternatives have been proposed: see [24] for a comparison of some of them.

Interesting links are being made with relational database models[25] in order to build on its strong theoretical background. Schemas and the validation process are being formalized [15].

We have only skimmed over the subject of validation of XMLfiles but the same essen-tial ideas apply throughout. On top of the official and informal information available at www.w3.org/xml/Schema, some good sources of information and interesting tutorials can be found in the following resources:

http://www.XML.com is maintained by the O’Reilly editor with many excerpts from their books

http://www.XML.org is a market-oriented site with interesting files in theresourcessection http://www.mulberrytech.com/quickref/XMLquickref.pdf is a very useful XML

Syn-tax Quick Reference Sheet (US letter size)

http://www.xfront.com/xml-schema.html gives a complete tutorial in roughly 150 Mi-crosoft Powerpoint slides.

http://www.xmlspy.com XMLSpy is a good commercial XML editor on the PC platform, complete with a powerful structure editor and internal validation and real-type sug-gestions of allowable elements attributes (strangely, these sugsug-gestions are not adequate in the text view i.e. the mode in which XML tags are explicitely typed). It is easy to switch between the text view and the structural view of the editor. There is also a good stylesheet designer module (Stylevision) to create stylesheet transformations interactively and graphically. These transformations can then be used as a basis for what is called the authentic view which can effectively hide the XML tags from the user of a XMLdocument.

http://www.oxygenxml.com/ <oXygen/>is a good XMLeditor for PC, Linux, MacOS X and Solaris. Real time valid suggestions are offered in the text view. Validation can be done within the editor. Stylesheets transformations can be displayed in a window of the editor. It also features a tree editing mode and a similar graphical output of a schemato to what is provided by XMLSpy. Unfortunately, it is not possible to edit the schema graphically.

http://www.thaiopensource.com/nxml-mode/ nXML mode in Emacs [18] offers real time valid suggestions for editing xml files when their schema is written in RELAX NG. Trang can be used for translating an XML Schema or a DTD into RELAX NG. The most interesting feature of nXML is its real-time validation during editing as it incrementally reparses and validates the document during idle periods in the typing process.

Chapter 4 Document Transformation

SinceXMLis a tree-structured representation of information, it is relatively simple to process this information either to change its shape or to select some sub-trees. To achieve this,XML designers have defined theeXtensible Stylesheet Language (XSL) [16] technology which refers to two components:

XSLT [16] a transformation language to convert an XML document into either another XML document, into HTML, or into a plain text document (a very wide one-level tree!)

XSL-FO a platform- and media-independent formatting language composed of a set of XML elements, called formatting objects, that describe parts of a printed page at a high-level, e.g. <block>, <table>, etc. These elements are most often produced by XSLT transformations of an XMLdocument.

XSLT depends on XPath [19] (explained in section 4.1), a syntax to identify nodes in anXMLdocument. This specification is separate because several other W3C specifications depend on it; we saw an example in section 3.2.3 where XPath expressions were used to define keys and keyrefs.

XSLT is anXMLbased formalism to define production rules (similar to OPS5 or Prolog without unification) that match nodes in a tree of an XML document and produce a new tree. These rules are defined instylesheets (XMLfiles named with the .xsl extension) that can be validated with a predefined XSLT Schema. This transformation mechanism is very general and can be used to produce any kind of tree, but most often it is used for presentation, one simple kind of tree being an HTMLdocument. In fact, most web browsers can process XMLdocuments linked withXSLT stylesheets to display the resulting transformation. For example, Internet Explorer (figure 1.2) and Firefox have a predefined stylesheet for XML files to explore them gradually by folding and unfolding elements .

In section 4.3.1, we will show how to transform our cellar book instance document into an HTML page with indented bulleted lists. We will see in section 4.3.2 how to create an HTML tabular presentation of our wine-catalog. Section 4.3.3 illustrates features of stylesheets that allow to better select information and perform some simple calculations to

produce information that was not present in the original XML file. We will then show, in section 4.4, how to transform ourXMLinstance document into the compact text representa-tion we presented in figure 1.4. Finally, we will illustrate in secrepresenta-tion 4.5 the use ofFormatting Objects to produce a PDF output from an XML document.

4.1 XPath

Because XML documents are tree-structured, we must be able to designate nodes in their trees either absolutely (i.e. starting from the root) or relatively to a given node. AnXPath expression¹ refers to either a single node or to a set of nodes in the document tree.

There are seven types of nodes in an XMLdocument:

root the starting point of the document

element the most common type of node, it may contain other elements text containing the real information; it cannot contain any element

attribute string information contained in the start-tag of the containing element, it is considered as a child of the element which contains it

comment information that is normally ignored for processing but that is nevertheless kept in the structure of the document

processing instruction elements starting with <? that will not be discussed in this paper namespace information about the namespace of an element, its processing will not be

discussed in this section.

An XPath expression designating a set of nodes in the document tree consists of three parts.

1. anaxis specifiergives the path to a set of nodes; we will only use here theabbreviated syntax similar to the path notation used in computer systems to designate files and directories.

• /is the path separator between levels in the tree

• anabsolute path from the root starts with /

• arelative path from the current node starts with something else than a /

• .. is the parent of the current node

• .is the current node

1In this report, we only useXPath1.0 syntax [19]. Recently, a more involved specificationXPath2.0 [11]

has been proposed and is already implemented in someXMLprocessors.

• //indicates a path with any number of intervening levels between two nodes; if//

appears at the start of the expression then it means any arbitrary path between a node and the root

Element names are used to select nodes in the path but attributes (preceded by@) can also be used. The unabbreviated syntax allows access to siblings or ancestors in the tree, but we will not use it in this document.

2. node test can be the name of a node (with or without the namespace prefix), * to indicate all nodes, or it can be a function name such as node(), text() or comment() to indicate the type of the node that is looked for.

3. predicateis a boolean expression given between square brackets ([]) that can further filter the set of nodes identified with the axis specifier and the node test. If the expression is a number i, then it refers to the i^th child element (numbering starts at 1).

A predicate can use variables (their creation will be shown later) by prefixing their name by $, string manipulation functions (concat(.,.), substring(,.,.), ...), number functions (sum(.), floor(.), ...) and node set functions (position(), count(.), local-name(.), ...).

We will give some examples of their use but the full set can be found in theXPathreference document [19].

Table 4.1 presents examples of absolute XPath expressions that return a node or a set of nodes on the cellar book document shown in listing 2.2. The table makes explicit the three parts of an XPathexpression. The XPath expressions of the table can be paraphrased as follows:

1. refers to the owner element of the cellar

2. returns the wines for which we have 2 bottles or less. The nodes returned are the wine elements even though the predicate uses an internal element; please note that the predicate is evaluated in the current context of the path specified. When XPath expressions are used in the context of an XSL file, as it is most often the case, the <

must be replaced by < (even within strings!) 3. refers to the first wine of the cellar

4. returns the elements which contain a postal-codeelement. This is achieved by finding apostal-code anywhere in the tree from the root and then getting the parent element

Dans le document XML: Looking at the Forest Instead of the Trees (Page 45-0)