• Aucun résultat trouvé

The SAX and DOM APIs

1.3 The Power of Python and XML

1.3.2 The SAX and DOM APIs

The two most basic and broadly used APIs to XML data are the SAX and DOM interfaces. These interfaces differ substantially; learning to determine which of these is appropriate for your application is an important step to learn.

SAX defines a relatively low-level interface that is easy for XML parsers to support, but requires the application programmer to manage more details of using the information in the XML documents and performing operations on it. It offers the advantage of low overhead: no large data structures are constructed unless the application itself actually needs them. This allows many forms of processing to proceed much more quickly than could occur if more overhead were required, and much larger documents can be processed efficiently. It achieves this by being an event-oriented interface; using SAX is more like processing user-input events in a graphical

user interface than manipulating a pre-constructed data structure. So how do you get "events"

from an XML parser, and what kind of events might there be?

SAX defines a number of handler interfaces that your application can implement to receive events.

The methods of these objects are called when the appropriate events are encountered in the XML document being parsed; each method can be thought of as the actual event, which fits well with object-oriented approaches to parsing. Events are categorized as content, document type, lexical, and error events; each category of events is handled using a distinct interface. The application can specify exactly which categories of events it is interested in receiving by providing the parser with the appropriate handlers and omitting those it does not need. Python's XML support provides base classes that allow you to implement only the methods you're interested in, just inheriting do-nothing methods for events you don't need.

The most commonly used events are the content-related events, of which the most important are startElement, characters, and endElement. We look at SAX in depth in Chapter 3, but now let's take a quick look at how we might use SAX to extract some useful information from a document. We'll use a simple document; it's easy to see how this would extend to something more complex. The document is shown here:

<catalog>

<book isbn="1-56592-724-9">

<title>The Cathedral &amp; the Bazaar</title>

<author>Eric S. Raymond</author>

</book>

<book isbn="1-56592-051-1">

<title>Making TeX Work</title>

<author>Norman Walsh</author>

</book>

<!-- imagine more entries here... -->

</catalog>

If we want to create a dictionary that maps the ISBN numbers given in the isbn attribute of the book elements to the titles of the books (the content of the title elements), we would create a content handler (as shown in Example 1-1) that looks at the three events listed previously.

Example 1-1. bookhandler.py import xml.sax.handler

class BookHandler(xml.sax.handler.ContentHandler):

def __init__(self):

self.inTitle = 0 self.mapping = {}

def startElement(self, name, attributes):

if name == "book":

self.buffer = ""

self.isbn = attributes["isbn"]

elif name == "title":

self.inTitle = 1

def characters(self, data):

if self.inTitle:

self.buffer += data

def endElement(self, name):

if name == "title":

self.inTitle = 0

self.mapping[self.isbn] = self.buffer

Extracting the information we're looking for is now trivial. If the code above is in bookhandler.py and our sample document is in books.xml, we could do this in an interactive session:

>>> import xml.sax

>>> import bookhandler

>>> import pprint

>>>

>>> parser = xml.sax.make_parser( )

>>> handler = bookhandler.BookHandler( )

>>> parser.setContentHandler(handler)

>>> parser.parse("books.xml")

>>> pprint.pprint(handler.mapping) {u'1-56592-051-1': u'Making TeX Work',

u'1-56592-724-9': u'The Cathedral & the Bazaar'}

For reference material on the handler object methods, refer to Appendix C.

The DOM is quite the opposite of SAX. SAX offers a very small window of view that passes over the input document, relying on the application to infer the whole; the DOM gives the whole document to the application, which must then extract the finer details for itself. Instead of reporting individual events to the application as the parser handles the corresponding syntax in the document, the application creates an object that represents the entire document as a hierarchical structure. Although there is no requirement that the document be completely parsed and stored in memory when the object is provided to the application, most implementations work that way for simplicity. Some implementations avoid this; it is certainly possible to create a DOM implementation that parses the document lazily or uses some kind of persistent storage to keep the parsed document instead of an in-memory structure.

The DOM provides objects called nodes that represent parts of a document to the application.

There are several types of nodes, each used for a different kind of construct. It is important to understand that the nodes of the DOM do not directly correspond to SAX events, although many are similar. The easiest way to see the difference is to look at how elements and their content are represented in both APIs. In SAX, an element is represented by start and end events, and its content is represented by all the events that come between the start and the end. The DOM provides a single object that represents the element, and it provides methods that allow the application to get the child nodes that represent the content of the element. Different node types are provided for elements, text, and just about everything else that can exist in an XML document.

We go into more detail and see some extended examples using the DOM in Chapter 4, and a detailed reference to the DOM API is given in Appendix D. For a quick taste of the DOM, let's write a snippet of code that does the same thing we do with SAX in Example 1-1, but using the basic DOM implementation from the Python standard library, as shown in Example 1-2.

Example 1-2. dombook.py import pprint

import xml.dom.minidom

from xml.dom.minidom import Node

doc = xml.dom.minidom.parse("books.xml")

mapping = {}

for node in doc.getElementsByTagName("book"):

isbn = node.getAttribute("isbn")

L = node.getElementsByTagName("title") for node2 in L:

title = ""

for node3 in node2.childNodes:

if node3.nodeType == Node.TEXT_NODE:

title += node3.data mapping[isbn] = title

# mapping now has the same value as in the SAX example:

pprint.pprint(mapping)

It should be clear that we're dealing with something very different here! While there's about the same amount of code in the DOM example, it can be very difficult to develop reusable components, while experience with SAX often points the way to reusable components with only a small bit of refactoring. It is possible to reuse DOM code, but the mindset required is very different. What the DOM provides to compensate is that a document can be manipulated at arbitrary locations with full knowledge of the complete document, and the document contents can be extracted in different ways by different parts of an application without having to parse the document more than once. For some applications, this proves to be a highly motivating reason to use the DOM instead of SAX.