Converting XML to HTML - The Simple API for XML

Chapter 3. The Simple API for XML

3.6 Converting XML to HTML

The PyXML package contains XML parsers, including PyExpat, as well as support for SAX and DOM, and much more. While learning the ropes of the PyXML package, it would be nice to have a comprehensive list of all the classes and methods. Since this is a programming book, it seems appropriate to write a Python program to extract the information we need—and in XML, no less!

Let's generate an XML file that details each of the files in the PyXML package, the classes therein, and the methods of the class. This process allows us to generate quick, usable XML.

Rather than a replacement for all the snazzy code-to-documentation generators out there, Example 3-8 shows a simple, quick way to generate XML that we can experiment with and use throughout the examples in this chapter. After all, when manipulating XML, it helps to have a few hundred thousand bytes of it sitting around to play with. (This program also demonstrates the simplicity of examining all the files in a directory tree in using the os.path.walk function.) Example 3-8. genxml.py

"""

genxml.py

Descends PyXML tree, indexing source files and creating XML tags for use in navigating the source.

"""

import os import sys

from xml.sax.saxutils import escape

def process(filename, fp):

print "* Processing:", filename,

# parse the file

pyFile = open(filename)

fp.write("<file name=\"" + filename + "\">\n")

inClass = 0

line = pyFile.readline( ) while line:

line = line.strip( )

if line.startswith("class") and line[-1] == ":":

if inClass:

fp.write(" </class>\n") inClass = 1

fp.write(" <class name='" + line[:-1] + "'>\n")

elif line.find("def") > 0 and line[:-1] == ":" and inClass:

fp.write(" <method name='" + escape(line[:-1]) + "'/>\n")

line = pyFile.readline( )

pyFile.close( ) if inClass:

fp.write(" </class>\n") inClass = 0

fp.write("</file>\n")

def finder(fp, dirname, names):

"""Add files in the directory dirname to a list."""

for name in names:

if name.endswith(".py"):

path = os.path.join(dirname, name) if os.path.isfile(path):

process(path, fp)

def main( ):

print "[genxml.py started]"

xmlFd = open("pyxml.xml", "w")

xmlFd.write("<?xml version=\"1.0\"?>\n") xmlFd.write("<pyxml>\n")

os.path.walk(sys.argv[1], finder, xmlFd)

xmlFd.write("</pyxml>") xmlFd.close( )

print "[genxml.py finished]"

if __name__ == "__main__":

main( )

The main function in Example 3-8 uses the os.path.walk function to search your PyXML directory for Python files. For each Python source file that exists beneath the starting directory (the argument to the script), the process function is called to extract class information. That function writes the extracted information into the open XML file.

At this point, the script proceeds to parse each Python source file, highlighting each of the classes and methods contained within them by parsing each line for relevant keywords such as class and def:

def process(filename, fp):

print "* Processing:", filename,

# parse the file

pyFile = open(filename)

fp.write("<file name='" + filename + "'>\n") inClass = 0

line = pyFile.readline() while line:

line = line.strip()

When the program finds a class declaration, it creates the appropriate class tag and attributes within the XML document:

if line.startswith("class") and line[-1] == ":":

if inClass:

fp.write(" </class>\n") inClass = 1

fp.write(" <class name='" + line[:-1] + "'>\n")

When the program encounters a method definition, it replaces special characters with entities so they don't cause problems in the XML. The method definition string is trimmed, and then surrounded with the appropriate markup:

elif line.find("def") > 0 and line[:-1] == ":" and inClass:

fp.write(" <method name='" + escape(line[:-1]) + "'/>\n")

line = pyFile.readline()

After a file is complete, the program closes out the last class it was in, if any, and closes out the file tag as well:

pyFile.close() if inClass:

fp.write(" </class>\n") inClass = 0

fp.write("</file>\n")

Python simplifies the work of parsing the text. Each line is manipulated quite a bit, quotation marks are escaped with entities (using the escape function from the xml.sax.saxutils module), and XML tags are placed around class definitions and method names.

To run this program from the shell:

$> python genxml.py /home/chris/PyXML/xml

The parameter to the script is the path to your PyXML source directory (including the xml subdirectory).

3.6.1 The Generated Document

The XML that is generated is placed in a file called pyxml.xml. Each file element looks something like this:

</class>

</class>

</class>

</class>

</file>

Note that the name attribute of the file tag varies depending upon what your parameter is to the script (your PyXML source path). Functions not defined as methods in a class are not included by the simple parsing loop (hey, this isn't a compiler!), but you should be aware that the XML support provided by both the standard library and the PyXML package includes many useful functions—read the reference documentation for more information on those. The escape function we use in this script is a perfect example of this. If you're new to Python, you'll find that little helper functions are characteristic of Python libraries; most of the small utilities needed to make the larger facilities easier to use have already been included, allowing you to concentrate on your application.

If you spend some time reviewing this XML file, you will start to become familiar with the scope of the PyXML toolkit. A script is provided a little later in this chapter that converts this XML to HTML using the SAX API and Python string manipulation features. Figure 3-2 shows the XML within a browser.

Figure 3-2. genhtml.py output in a browser

3.6.2 The Conversion Handler

You can finish off this program by implementing the PyXMLConversionHandler class. This class generates HTML from the XML file we created earlier. The process allows you to load the HTML file into your browser and see all of the files, classes, and methods within PyXML in formatted text. Create this class, as shown in Example 3-9, in the file handlers.py.

Example 3-9. handlers.py

from xml.sax import ContentHandler

class PyXMLConversionHandler(ContentHandler):

"""A simple handler implementing 3 methods of the SAX interface."""

def __init__(self, fp):

"""Save the file object that we generate HTML into."""

self.fp = fp

def startDocument(self):

"""Write out the start of the HTML document."""

self.fp.write("<html><body>\n")

def startElement(self, name, attrs):

if name == "file":

# generate start of HTML s = attrs.get('name', "")

self.fp.write("File: %s \n" % s) print "* Processing:", s

elif name == "class":

self.fp.write(" " * 3 + "Class: "

+ attrs.get('name', "") + " \n")

elif name == "method":

self.fp.write(" " * 6 + "Method: "

+ attrs.get('name', "") + " \n")

def endDocument(self):

"""End the HTML document we're generating."""

self.fp.write("</body></html>")

While the conversion itself is very straightforward, one interesting thing to note is that this class writes its output to a file object passed to the constructor instead of building a string of XML text in memory. This avoids storing a potentially large buffer in memory and building it incrementally with many memory copies. If the string is required to be in memory when the process is complete, the creator can provide a StringIO instance as the file to write to; the StringIO

implementation is more efficient at building a large string than many string concatenations. This is a Python idiom that has proven its utility over a wide range of projects.

3.6.3 Driving the Conversion Handler

The main script really isn't any different from the others we've looked at so far. We create the parser and instantiate our handler class, register the handler, and set the parser in motion. This process is shown in Example 3-10.

Example 3-10. genhtml.py

#!/usr/bin/env python

# generates HTML from pyxml.xml

import sys

from xml.sax import make_parser

from handlers import PyXMLConversionHandler

dh = PyXMLConversionHandler(sys.stdout) parser = make_parser( )

parser.setContentHandler(dh) parser.parse(sys.stdin)

The output from this script is written to the standard output stream.

Dans le document Python & XML Christopher A. Jones Fred L. Drake, Jr. (Page 86-95)