Logical Structure Description - Document Structures

5.1 Document Structures

5.1.2 Logical Structure Description

Parallel to the visual structure of a document, although strictly related to it for the foregoing reasons, is its logical structure. Since the document may include several kinds of components at different levels of complexity, playing different roles and possibly interrelated in many different ways to each other, a sufficiently expressive representation must be available to processing systems intended to handle them as true documents and not just as computer files. XML is often exploited for this pur-pose, as a powerful and flexible language.

DOM (Document Object Model) The DOM is a platform-independent represen-tation to model the logical structure, content and style of documents, and the way in which they can be built, accessed, navigated and manipulated by adding, deleting or modifying elements and/or content. The DOM does not express the relevance or structure of information in a document. It is a logical model that just specifies rep-resentational and behavioral requirements, leaving each specific implementation in

Fig. 5.3 DOM interface hierarchy

any language free to support them in any convenient way, as long as structural iso-morphism is preserved: any two DOM implementations will produce the same struc-ture model representation of any given document. Being object oriented, it models documents using objects (each endowed with identity, structure and behavior) by specifying:

• Classes (interfaces), an abstract specification of how to access and manipulate documents and components thereof in an application’s internal representation;

• Their semantics, including properties (attributes) and behavior (methods);

• Their relationships and collaborations among each other and with objects.

It is particularly suited to represent and handle XML/HTML documents in object oriented environments. Indeed, the DOM representation of a document (called its structure model) has a forest-shaped structure which may consist of many trees whose nodes are objects (rather than data structures). Some types of nodes may have child nodes, others are leaves. Figure5.3shows the DOM interface hierarchy.

DOMImplementation provides methods for performing operations that are in-dependent of any particular DOM instance. DOMException instances are raised when attempting invalid operations. Node is the primary interface, representing a single node in the document tree (for which several sub-types, or specializations, are available). It exposes methods for dealing with children, inherited also by its sub-classes that cannot have children. It includes a nodeName attribute inherited by all its descendants to declare the name of their instances. NodeList handles ordered lists of Nodes (e.g., the children of a Node); NamedNodeMap handles unordered sets of nodes referenced by their name attribute.

Document is the (mandatory and sole) root of the tree, representing the whole document and providing access to all of its data. Element represents an element in an HTML/XML document, possibly associated with attributes represented by Attr instances, whose allowed values are typically defined in a DTD. Attr objects are not considered part of the document tree, but just properties of the elements they are

associated with. Thus, in fact they are not children of the element they describe, and do not have a separate identity from them. CharacterData is an abstract interface (it has no actual objects) to access character data, inherited by Text, Comment, and CDATASection. Comment refers to the characters between. Text nodes can only be leaves, each representing a continuous (i.e., not interleaved with any Element or Comment) piece of textual content of an Element or Attr.

DocumentType allows dealing with the list of entities that are defined for the document. Each Document has at most one such object whose value is stored in the doctype attribute. Notation nodes have no parent, and represent notations in the DTD that declare either the format of an unparsed entity (by name) or processing instruction targets. Entity represents a (parsed or unparsed) entity in an XML docu-ment. EntityReference objects represent references to entities whose corresponding Entity node might not exist. If it exists, then the subtree of the EntityReference node is in general a copy of its subtree. Entity and EntityReference nodes, and all their descendants, are read-only. ProcessingInstruction represents processor-specific in-formation in the text of an XML document. CDATASection objects contain (in the DOMString attribute inherited by Text) text whose characters would be regarded as markup. They cannot be nested and are terminated by the]]>delimiter.

DocumentFragment is a ‘lightweight’ version of a Document object, useful for extracting (e.g., copying/cutting) or creating a portion of document tree that can be inserted as a sub-tree of another Node (e.g., for moving around document pieces), by adding its children to the children list of that node.

Example 5.3 (DOM node types for XML and HTML) Internal nodes, along with possible types of their children nodes:

Document

Element (at most one), ProcessingInstruction, Comment, DocumentType (at most one)

DocumentFragment

Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference EntityReference

Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference Element

Element, Text, Comment, ProcessingInstruction, CDATASection, EntityReference Attr

Text, EntityReference Entity

Element, ProcessingInstruction, Comment, Text, CDATASection, EntityReference Leaf nodes:

DocumentType, ProcessingInstruction, Comment, Text, CDATASection, Nota-tion.

The DOM standard was defined by the W3C to establish a multi-environment framework and avoid incompatibility among browsers. Different levels of specifi-cations are available, each requiring additional features with respect to the previous

levels. The current level is 2, although pieces of level 3 are progressively becoming W3C recommendations. Level 1 [1] consists of two modules:

Core (mandatory) provides a low-level set of fundamental interfaces that can rep-resent any structured document; it also defines (optional) extended interfaces for representing XML documents.

HTML provides additional, higher-level interfaces that are used with those defined in the Core to provide a more convenient view of HTML documents.

An implementation including the extended XML interfaces does not require the HTML module, and XML interfaces are not required for implementations that only deal with HTML documents. Level 2 [2] is made up of 14 modules: Core, XML, HTML, Views, Style Sheets, CSS, CSS2, Events, User interface Events, Mouse Events, Mutation Events, HTML Events, Range, Traversal. An implementation is

‘Level 2 compliant’ if it supports the Core; it can be compliant to single modules as well (if it supports all the interfaces for those modules and the associated semantics).

Several APIs for the various programming languages have been developed to handle DOM representations. This provides programmers with an interface to their proprietary representations and functionalities (they may even access pre-DOM software), and authors with a means to increase interoperability on the Web. DOM APIs keep the whole document structure in memory, which allows random access to any part thereof, but is quite space demanding.³

Dans le document Advances in Pattern Recognition (Page 179-182)