Schema - XML: Looking at the Forest Instead of the Trees

As we have seen in the previous section, aDTDdescribes some constraints on the order and nesting of elements in anXMLfile but the type of constraints is quite limited and it does not allow any validation of the character content of elements. There are also other drawbacks: all element names in a DTD must be unique and thus combining separately developed DTDs can become quite cumbersome. Moreover, theDTDfile is not a well-formedXMLfile, thus one cannot easily use anXML tool to create or process it. This is why XML Schema has been introduced with a comprehensive set of elementary types and a way to combine them to create new types. The concept of namespaces (presented in section 2.1) is also used in order to facilitate the combination of independent files without name clashes.

A Schema is a well-formed XML file (usually with a .xsd extension) that defines types which are used to validate the elements of the XML file. In a way similar to variable declarations in a programming language, we can define types² for many elements instead of using inline definitions of embedded elements. In a Schema, there are two kinds of types:

simple and complex. Simple types define constraints on the text content of an element which cannot contain any element. A complex type can contain nested elements.

There are many different ways of organizing a Schema as described by Van der Vlist[33]:

one can either use a russian doll approach in which a single element is defined with all embedded elements internally defined; another way is to use a bottom-up approach in which the elements are defined before being used in more complex elements; it is also possible to use atop-down approach that first define the higher level elements before defining the lower level elements. All these styles of definition are possible and we will sometimes use a mix of them in order to show some features of XML Schema.

Table 3.2 presents the XML elements we use in our example to define the types needed for the validation of our wine catalog and cellar book. Since a schema is itself an XML file, it is important to distinguish the elements defining the Schema from the elements being defined. This is done by having different namespaces for the defining element (definiens) using xs: (xsd is also commonly used) as prefix and for the defined elements (definiendum) without prefix, i.e. in the default namespace. Contrarily to aDTD, aXML Schemabeing a valid XMLfile, it can be validated using theXML Schema of XML Schemas which is usually included in all XML editors.

A XML Schema has a xs:schema element as root which can contain different kinds of definition elements.

2We follow the Java convention of starting type identifiers with an upper case letter. Element identifiers start with a lower case letter. In a name comprising more than one word, each word starts with an uppercase letter, no underscore or dash are used.

<xs:schema targetNameSpace=”URI”>

Table 3.2: A reminder of the subset of XML Schema syntax used in listings 3.3 and 3.4.

Names in italics refer to other elements. NCName (non-colonized) name is a name without namespace prefix. Regular expressions are used to describe the allowed forms: braces are used for grouping, ? indicates that the preceding grouping is optional, * that it can be repeated and +that it can be repeated but at least once.

• xs:importallows the combination of different schemas into a single one; in our case, we have a schema for the wine catalog which is imported into the one of the cellar book

• xs:simpleType gives supplementary constraints on predefined types; this is explained

further in section 3.2.1.

• xs:complexType defines a new type in terms of a choice or a sequence between other

types; xs:group gives a new to an incomplete type. xs:attributesare given at the of the definition, even though they appear in the start-tag

• xs:element is the fundamental way of defining an element that can appear in an

in-stance file. It can be given either with a name and a type, it can refer to another element definition or it can be defined with an anonymous simple or complex type followed by keys and keyrefs definitions

• xs:sequence(respxs:choice) combines other elements by making sure that they occur

sequentially (resp. alternatively i.e. only one of the element can appear)

• xs:groupclusters elements that can be used together

• xs:attribute gives the name and the simple type associated with an attribute.

At-tributes are not ordered and optional unless their are given the valuerequiredto their use attributes

• xs:restrictiongives range, pattern constraints on the value of a simple type. enumeration is to limit the allowed value to one of a given list.

• xs:key,xs:uniqueandxs:keyrefdefine cooccurrence constraints that will be explained in section 3.2.3.

In the rest of this section, we first give the XML Schemas in their entirety Listings 3.3 and 3.4 that correspond to the DTDs given in listings 3.1 and 3.2 respectively. Figure 3.1 gives the overall structure of the XML Schema of listing 3.3. Figure 3.2 gives the overall structure of the Schema in listing 3.4. As we will see, the validation of the text content of the elements can be much more thorough with a XML Schema than with aDTD.

We will then explain the structure of the type system: first simple types (section 3.2.1) then complex types (section 3.2.2) and finally how define keys and their reference (sec-tion 3.2.3).

Figure 3.1: Graphical view of the Schema for the cellar book (listing 3.3). A name in a rectangular box is anelement name or, if preceded by@, an attribute name. Acomplex type name is preceded by a square and a simple type by a triangle. A sequence is shown with 4 dots horizontally aligned in an hexagon and a choice with the 4 dots aligned vertically (see figure 4.3 for an example). A +after a box, indicates that further details have been omitted.

Three small squares in front of an element name either indicates that its definition will be referred to somewhere else in the schema; the reference is indicated by a small arrow at the bottom right of the rectangle. It was produced by the <oXygen/> XML editor from the XML Schema file given in listing 3.3.

Listing 3.3: [CellarBook.xsd]: XML Schema for the cellar book. It can validate the instance file in listing 2.2. It can be compared with theDTD listing 3.1.

< ? xml v e r s i o n = " 1.0 " e n c o d i n g = " UTF -8 " ? >

< x s : s e q u e n c e >

90 < x s : c o m p l e x T y p e n a m e = " O w n e r " >

< x s : s e q u e n c e >

< x s : e l e m e n t ref = " n a m e " / >

< x s : g r o u p ref = " a d d r e s s " / >

< / x s : s e q u e n c e >

95 < / x s : c o m p l e x T y p e >

< x s : s i m p l e T y p e n a m e = " P o s t a l C o d e C A " >

< x s : r e s t r i c t i o n b a s e = " x s : s t r i n g " >

< x s : p a t t e r n v a l u e = " [ A - Z ][0 -9][ A - Z ] [0 -9][ A - Z ][0 -9] " / >

100 < / x s : r e s t r i c t i o n >

< / x s : s i m p l e T y p e >

< / x s : s c h e m a >

Figure 3.2: Graphical view of the Schema for the wine catalog (listing 3.4). See caption of figure 3.1 for an explanation of symbols used in the figure.

Listing 3.4: [WineCatalog.xsd]: Schema for the wine catalog. It can validate the instance document shown in listing 2.3. It can be compared with the DTD in listing 3.2

< ? xml v e r s i o n = " 1.0 " e n c o d i n g = " UTF -8 " ? >

< x s : e l e m e n t n a m e = " p r i c e " t y p e = " x s : d e c i m a l " > < / x s : e l e m e n t >

< / x s : e n u m e r a t i o n >

< / x s : s i m p l e T y p e >

< x s : s i m p l e T y p e n a m e = " P e r c e n t a g e " >

< x s : r e s t r i c t i o n b a s e = " x s : d e c i m a l " >

138 < x s : m i n I n c l u s i v e v a l u e = " 0 " / >

< x s : m a x I n c l u s i v e v a l u e = " 100 " / >

< x s : f r a c t i o n D i g i t s v a l u e = " 2 " / >

< / x s : r e s t r i c t i o n >

< / x s : s i m p l e T y p e >

143 < / x s : s c h e m a >

3.2.1 Simple Types

A simple type is a primitive datatype such as xs:string, xs:decimal, xs:double, xs:date (XML has 19 of them shown in figure 3.3) or a derivation of a primitive datatype. A derivation is a restriction on the original type such as constraining the maximum length of a string, giving a list of acceptable values, or requiring that the value matches a regular expression. Figure 3.3 shows a number of built-in derived types: xs:normalisedString,

xs:integer and all types that are derive from them (i.e. appear under them). Users can

also define their own simple types using the xs:simpleTypeelement.

We can see uses of simple types in listing 3.3: stars (line 22) which must not only be an integer but a positive one, first (line 34) which is a string (essentially the same thing as a #PCDATA in a DTD). Examples of definition of simple types: constrain a string to be one of many choices such as ProvinceCA (line 72) or have the string match a regular expression such as PostalCodeCA (line 97).

A new simple type can also be created using a list (allowing a series of primitive type values) or aunion(allowing one of many primitive types). It is thus possible to define a whole gamut of types. These are quite straightforward if one refers to the specification [31, 13], so they will not be described further here.

3.2.2 Complex Types

A complex type can contain element declarations, element references and attributes decla-rations. We will illustrate some of these possibilities with listing 3.3.

An element declaration is done with an xs:element giving the name of the element and its type which can either be defined as the value of the element such ascellar(line 8) or by indicating the type with the type attribute (wine (line 11) or purchaseDate (line 18)).

A complex type is defined either by a sequence of elements contained inxs:sequence ele-ment (e.g. cellar(line 8)) or by a choice between many elements contained in anxs:choice element such as within name (line 30). Attributes are defined after the definitions of the elements in sequence or in choice even though they appear in the start-tag (seecode(line 27) as attribute of wine (line 11)).

Figure 3.3: Built-in datatypes for XML Schema. ur-types serve as root of the type hier-archy for all derivations. ur is the German prefix meaning ancestral such as in Ursprung (beginning). Figure taken from section 3 of XML Schema Part 2: Datatypes [13].

xs:choice and xs:sequence can be nested. For example, name (line 30) indicates a choice between three elements of type string first, family and initial which can be re-peated any number of times. Indeed, because an element only occurs once by default (i.e.

minOccurs="1") and that maxOccurs="unbounded", each element can appear as often as we wish.

An existing element can also be referred to using the ref attribute like name (line 30) used in Owner (line 90). But be aware that in this case, if you had mistakenly used the name attribute instead ofref, you would have named a new attribute with no connection with the one you wished to reference; this can lead to errors that are difficult to track down.

In listing 3.4, themixed="true"attribute in the definition of a type (seeComment(line 113)) means that character data can also appear between the elements described by the content of the type. In this case, character data can thus be interspersed with any number of emph and bold elements.

3.2.3 Keys and Keyrefs

As shown at the bottom left of figure 3.3, DTD’s ID and IDREF are built-in XML types and thus allow some simple uniqueness and reference constraints that we explained in sec-tion 3.1. ButXMLSchema has also defined a much more involved³ system usingxs:keyand xs:unique elements to define uniqueness constraints on the values andxs:keyref to refer to these elements.

Within the wine catalog (listing 3.4), to ensure that each wine has a different code attribute, we add constraints after the xs:complexType element within the wine-catalog element (line 12):

• xs:keyelement for which we give a nameWineNumber(line 20) to be used for referencing;

WineNumber will never appear in the XML instance file, it is used internally by the

validator. A key is defined in two parts: a xs:selector which identifies the scope within which the key must appear only once and a xs:fieldwhich indicates the value that will be used in the equality comparisons for the keys. If more than onexs:field element are present, they are considered as forming a tuple of values that must be distinct i.e. they must be different in at least one of their components. The values designated in these elements is indicated by anXPathexpression⁴ associated with the xpath attribute.

• xs:uniqueelement using xs:selectorand xs:fieldelements as forxs:key. xs:unique

defines the same type of constraints as axs:keyexcept that the values so defined cannot be referenced by xs:keyref elements. On line 25, we ensure that the combination of the name and appellation attributes of is unique for each wine.

3In fact so involved and complex that RelaxNG designers decided to leave it out of their proposition.

4XPath syntax will be explained in section 4.2 but, for the moment, we only need to know that each level in the tree is separated by a forward slash; each element is designated by its name, an attribute name is preceded by a@^.

The wine code identified by WineNumber will be used in the description of the cellar (listing 3.3). It is the value associated with the code attribute (line 27) of the Wine type (line 16) used to define element wine (line 11). To make the wine code of the cellar match a code in the catalog, we define a xs:keyref element with xs:selector and xs:field sub-elements (as we have done for xs:key and xs:unique) but, in this case, the value identified must match an existing value of an xs:keyelement.

Considering the above, we would expect that the definition of the xs:keyref element appear after the type definition of the cellar element (line 8). But, for implementation reasons, the xs:keyref element should appear at a level high enough so that it covers the uniqueness domain of the key (the whole catalog in this case). This is why thexs:keyref is defined (line 57) within the cellar-book element.⁵

3.2.4 Namespaces in Schemas

We have introduced namespaces for instance documents in section 2.1, but they show their full power during the validation process in which the combination of element names and namespaces must correspond between the instance and the schema. Of course, namespaces must be properly combined during file inclusion and the details can become quite intricate.

We will illustrate with listings 3.3 and 3.4 a simple but quite frequent case. These two schemas define a wine element having different meanings and content which must be well distinguished. This is achieved with namespace declarations. A similar kind of name clash would occur if one needed to use an element calledtypeorsequencethat are already reserved by the schema vocabulary. This is why we define a namespace (usually xs or xsd) for the names of aXML Schema.

By default, names without prefixes are defined in theempty namespace or the namespace assigned to the xmlns attribute. To create elements in a specific namespace (and not the empty one), we set a value for the attribute as it is done for element targetNameSpace in listing 3.4 (line 6). The same namespace is also assigned to the prefix cat. In order for all global elements and types of an included file to be visible in the including file, the

elementFormDefaultshould be assignedqualified and attributeFormDefault, unqualified

as is seen on lines 2 and 2 of listing 3.4.

The importation of the elements of an external schema file along with its namespaces is done using xs:import as shown in listing 3.3 (line 5) indicating both the namespace used here for the target namespace of the imported file (here we keep the same) and the location of the file to be imported. The imported namespace must be given a prefix definition with anxmlns declaration like we do in the xs:schema opening tag on the first line of listing 3.3.

Because the name associated with the target namespace of the imported file is the same as the one associated with the cat prefix, we use cat:wine to refer to the wine element of

5It seems that there is a bug in theXMLSpy 2006 validator to validate theXMLinstance file (listing 2.2) because of thexs:keyrefelement; strangely, this files validates with the public domain validators but not withXMLSpy. For the moment, we suggest to comment out the keyrefelement element definition instead of reorganising the whole definition as suggested by theXMLSpy support people.

listing 3.4. Namespace and importation of RELAX NG schemas are similar in principle to what we have shown for XML Schema.

We can now better understand how namespaces are then used in the instance documents and how there are linked to their schemas. For example the first lines of listing 2.3 define the namespace associated with the null prefix (i.e. only the element name) as the value of the

xmlns attribute. We also indicate the namespace and the location of the Schema to be used

for validation as value of the xsi:schemaLocation. The xsi prefix must also be defined by an attribute starting with xmlns:. Because all elements defined in this file are in the same package to which we have assigned the null prefix, no namespace prefix is used in this file.

3.2.5 Overview of the XML Schemas of Our Application

Coming back to listing 2.1 showing the outline of our XML instance files validated by the two XML Schemas described in this section. Their outline is shown in listing 3.5. These listings show the inclusion of both the instance file and the corresponding XML Schema of the wine catalog into the cellar book. Note the use of the namespace prefixes in both XML instance and the corresponding XML Schema files. Boxes in listings 2.1 and 3.5 correspond to frontiers of namespaces⁶. In listing 3.5, we see that references from outside the box to the inside need to use the namespace prefixcat:: for theSQA-codetype (line 18), the wine-catalogelement (line 24) or to the WineNumberkey element name (line 30).

These listings show another interesting use of namespaces: to make sure that relative reference be kept, xml:base (line 5 within the included box) attribute is added to the root element. This is why xml:base is added as an attribute in the WineCatalog.xsd schema (line 16 of included listing of listing 3.5). xml:base is itself a special XML type whose definition must also be imported (line 8 of included listing of listing 3.5).

6here for simplicity (which should be the rule) we have kept the same name for the namespaces but we could have changed between the instance and the schema

Listing 3.5: Outline of CellarBook.xsdwhich imports (line 4)WineCatalog.xsdwhich uses

Compact Syntax (RNC) XML syntax (RNG) {default? namespaceid =URI <grammar>

|datatypesid =URI }* {<start>pattern</start>

{ start=pattern |<definename=”NCName”> pattern+ </define>}*

|id =pattern }* </grammar>

element QName {{pattern }} <elementname=”QName”>pattern+</element>

attribute QName {{pattern }} <attribute name=”QName”>pattern+</attribute>

pattern { ,pattern }+ <group name=”QName”>pattern+</group>

pattern { &pattern }+ <interleavename=”QName”>pattern+</interleave>

pattern { | pattern }+ <choicename=”QName”> pattern+ </choice>

pattern ? <optionalname=”QName”>pattern+</optional>

pattern * <zeroOrMorename=”QName”>pattern+</zeroOrMore>

pattern + <oneOrMorename=”QName”>pattern+</oneOrMore>

mixed {{pattern }} <mixedname=”QName”>pattern+</mixed>

id <ref name=”NCName”/>

empty <empty/>

text <text/>

dataTypeValue <value{ name=”NCName”}?>string </value>

dataTypeName {{ {id =value}* }} <data{type=”NCName”}?>

{<param name=”NCName”>string</param>}*

</data>

Table 3.3: Reminder of RELAX NG Compact and RELAX NG syntax used in our examples. The top cells of the table give the start of the file for RNC and the root element for RNG. Each line of the the bottom cells is a different pattern that can be combined almost freely with the others. The corresponding RNC and RNG elements appear in the same line of the bottom cell of the table. Fat braces ({{ }}) are braces that are terminal of

Dans le document XML: Looking at the Forest Instead of the Trees (Page 28-45)