Using XHTML, HTML or XML within ONIX text fields

Given the frequent requirement for ONIX messages to convey product information in a form suitable for use in web pages, guidance is provided below on how to incorporate web content in an ONIX product record.

However, in order to apply this guidance correctly, a user must already have some knowledge of different forms of web content. Those already familiar with the differences between HTML and XHTML may skip this section.

Web content that is largely text-based is generally styled for presentation in a web page using the HyperText Markup Language (HTML). HTML has been the language of the World Wide Web since its inception and is still the most popular language for constructing web pages. HTML was based upon the Standard Generalized Markup Language (SGML), which has been in use for preparing electronic content in academic and professional publishing since the early 1990s.

XML was developed in the late 1990s as demand grew for ways to use the web for exchanging data and messages that didn’t have to be presented as human-readable web pages. XML is a much stricter language than SGML, so it is generally not possible to incorporate HTML-tagged content directly into an XML

message. Responding to demand to make it possible to embed HTML in XML, the World Wide Web Consortium has defined an XML-compatible version of HTML, called XHTML. XHTML text fragments can be embedded in XML messages, provided this is allowed by the tagging rules of the XML application.

The tagging rules of ONIX specify that XHTML text fragments may be embedded in certain ONIX data elements, but within very strict constraints. These constraints are set out in section X.14 below, along with a list of the appropriate data elements.

HTML text fragments – and indeed any fragment of plain or tagged text, regardless of the tagging language – can also be embedded in ONIX data elements, but only by using XML techniques that ‘hide’ these fragments from any XML-aware software that is processing the ONIX message. Two such methods for embedding HTML or other tagged text in an ONIX data element are described in section X.15 below. These methods are available by default in all XML applications, and cannot be prohibited in ONIX applications, but their use is strongly discouraged. All ONIX users are encouraged to convert HTML text fragments to be valid XHTML fragments before incorporating them in ONIX messages. In any event, these methods should only be used in, and HTML should only be embedded in, the ONIX data elements that may also be used with XHTML (and which are listed below). The range of HTML tags used should be restricted in the same way as the usable XHTML tags.

X.14 XHTML (Version 1.0 or later)

The ONIX Product Information Message DTD and the XSD and RNG schemas enable the inclusion of XHTML-tagged text within specific data elements where this has been deemed appropriate. This is, for example, the expected way to include multiple paragraphs of text in long descriptive data. In these cases the data element may contain any well-formed fragment of XHTML-tagged text with the following

restrictions:

1. It must be the case that, if the fragment were to be placed in an otherwise empty <body> element in an XHTML document, the resulting document would be valid;

2. The fragment may not include any XHTML forms, embedded objects, or script or document revision elements;

3. The fragment may not use ‘event’ attributes and others that may affect browser behaviour;

4. The fragment may not include special character named entity references (other than the five

The intention of the first three of these restrictions is to prevent the unwitting or malicious transmittal of viruses in ONIX messages. The intention of the fourth of these restrictions is to enable validation of ONIX for Books messages against any of the three schema formats in which the ONIX for Books schemas are available.

Note also that some ONIX recipients may be reluctant to use XHTML text that contains links, images, tables, or that uses attributes such as style.

The data elements within which XHTML markup may be used are:

 <AncillaryContentDescription>

 <ConferenceTheme> (deprecated)

 <ContributorDescription>

 <PromotionContact> (deprecated)

 <PublishingStatusNote>

 <ReissueDescription> (deprecated)

 <ReligiousTextFeatureDescription>

The use of XHTML tags within any of these data elements should be signalled by including the textformat attribute with value ‘05’ in the start tag of the data element in question:

Example using Reference names

<Text textformat="05">XHTML-tagged text……may be multiple paragraphs.</Text>

using Short tags

<d104 textformat="05">XHTML-tagged text……may be multiple paragraphs.</d104>

Note that XHTML tags such as or must be properly closed, correctly nested, and must be lower case. The allowed set of tags is based around XHTML 1.1 Strict. It may be useful with self-closing elements such as to use the modified form – the extra space character makes no significant difference in XHTML, but improves compatibility if the XHTML is inadvertently used by the recipient in an HTML context.

For XHTML textual data in an East Asian writing system which uses text glosses (for example, Chinese or Japanese), the <ruby> tag should be used. Both ‘simple’ and ‘complex’ ruby from XHTML 1.1 (see http://www.w3.org/TR/ruby/) are supported by the ONIX for Books schemas, though browser support for complex ruby is not universal. Note that XHTML markup must not be mixed with Unicode interlinear annotation delimiters within a single data element.

Most of the XHTML-enabled data elements listed above are also repeatable, to provide parallel text in multiple languages – <ConferenceTheme>, within which the use of XHTML markup is strongly discouraged, and the other two deprecated elements <PromotionContact> and <ReissueDescription> are the exceptions.

(<TitleStatement> appears to be an exception, but it may be repeated per language using a separate

<TitleDetail> composite.) In contrast, the following data elements are repeatable for multiple languages, but are not XHTML-enabled:

 <DeletionText>

X.15 HTML (Version 4.01 or earlier), and other XML

The inclusion of text tagged in accordance with HTML version 4.01 or earlier in an ONIX data element is possible using one of two methods described below, but use of either method is strongly discouraged – if possible, use XHTML instead. In the event that HTML is included, in either of these ways, it may only be included in the elements listed in X.14 above, and the textformat attribute on the start tag for the element should be specified with the value ‘02’ (HTML, other than XHTML). XML-tagged text can be included in the same selection of ONIX data elements, using the same two methods. XML-tagged text should use the textformat attribute with value ‘03’ (XML).

To embed HTML or XML (other than XHTML) in an ONIX data element, either:

1. Replace the ‘<’ character at the start of every HTML or XML start and end tag with its entity reference ‘<’, or

2. Enclose the entire content of the data element within an XML ‘CDATA section’ (see Section 2.7 of the XML 1.0 Recommendation for details of this).

In general, method 2 using CDATA is preferred to method 1.

Note: using embedded HTML markup with either method presents significant difficulties for data senders and recipients who process the ONIX data using XSLT. Any named character entities encapsulated within CDATA will make the ONIX invalid after processing, since XSLT processing cannot output ONIX using method 2: conversion to numerical character references or native characters is required. And for method 1, special precautions need to be taken to avoid double-escaping of & or other named character entities or numerical character references after XSLT processing. Using XHTML as described in X.14 is strongly recommended as it avoids these issues.

Example HTML method 1 – replace < in HTML markup with <

using Reference names

<Text textformat="02">Maj Sjöwall is best known for the

Martin Beck novels.</Text>

using Short tags (illustrating double-escaping issue)

<d104 textformat="02">Maj Sj&#246;wall is best known for the Martin Beck novels.</d104>

Notes Only the < character should be changed. HTML tags may be upper or lower case, but lower case is recommended for improved compatibility. In HTML, some end tags such as are optional. Note the use of a ‘double-escaped’ numerical character

reference instead of native character ‘ö’ or the character reference ö in the Short tags example – this double escaping is strongly discouraged. Use ö without double-escaping. (To avoid doubt, this means that if an & character is intended to appear in the final rendered HTML, it should be included in the ONIX data as &, not as &amp;.) Using this method, there may be issues when the < character is intended to appear in the final rendered HTML, and using the numerical character reference < may help avoid these issues.

Example HTML method 2 – encapsulate HTML in <![CDATA[ … ]]>

using Reference names

<Text textformat="02"><![CDATA[Maj Sjöwall is best known for the

Martin Beck novels.]]></Text>

using Short tags (illustrating use of named character entity)

<d104 textformat="02"><![CDATA[Maj Sjöwall is best known for the Martin Beck novels.]]></d104> ()

Notes No special treatment of the < character in markup is necessary. Note the use of a named character entity ö instead of native character ‘ö’ in the Short tags

example. Since text within CDATA is not parsed by the recipient XML system, this may work, even though named character entities are not allowed within parsed ONIX data.

Example XML method 1 – replace < in XML markup with <

using Reference names

<Text textformat="03"><para>XML-tagged paragraph with

<emph>emphasized</emph> text</para></Text>

using Short tags

<d104 textformat="03"><para>XML-tagged paragraph with

<emph>emphasized</emph> words.</para></d104>

Example XML method 2 – encapsulate XML in <![CDATA[ … ]]>

using Reference names

<Text textformat="03"><![CDATA[<para>XML-tagged paragraph with

<emph>emphasized</emph> words.</para>]]></Text>

using Short tags

<d104 textformat="03"><![CDATA[<para>XML-tagged paragraph with

<emph>emphasized</emph> words.</para>]]></d104>

Note that the validity of the HTML or XML markup cannot be checked via the ONIX schemas, since with either method, the markup is effectively ‘hidden’ from the validation process.

The use of CDATA for anything other than inclusion of HTML or XML-tagged text should be avoided.

X.16 Using HTML5

HTML5 can be embedded using either method for HTML 4.

XHTML5 (the XML serialization of HTML5) may be embedded using the method for XHTML 1.0 and 1.1, provided that in addition to the rules that apply to all XHTML, new elements introduced only in XHTML5 are avoided – for example <article> or <section>. Some new elements may be allowed in a future version of the XHTML subset. A full list of allowed and disallowed elements is given in the Guide.

3. ONIX for Books Message header

Dans le document ONIX for Books Product Information Format Specification (Page 25-29)