• Aucun résultat trouvé

Chapter 5. Querying XML with XPath

5.3 Location Paths

The most commonly used type of XPath expression is the location path. A location path can be thought of as similar to a path for a file on a disk, but on steroids.

Where a path for a filesystem contains only names of directories and a file, an XPath

location path can specify much more. At each step along the path, it can perform selection based on complex tests of the nodes in a document, and the result may be several nodes. The tests, or predicates, for each step of the path can match based on element name, attribute presence or value, or textual content.

The full syntax of location paths is complex, but the specification is considerate enough to define abbreviated forms for the most commonly used tests; these are called abbreviated location paths. All of the location paths we describe in this chapter use the abbreviated syntax; for more information on the full syntax and selection capabilities of XPath, please refer to the specification.

Location paths are used within XSLT elements, but may also be used programmatically with an XPath API to return node sets from an XML document at runtime. The latter technique will come into greater focus as you read this chapter;

the former is covered in Chapter 6.

5.3.1 An Example Document

Let's start with an example document that represents data records. The records are all fairly similar, but of course the field values are different in each one. This is typical of the type of documents you might mine with XPath. In Example 5-1, we apply XPath expressions against an XML document representing starships from some popular science-fiction television series.

Example 5-1. ships.xml

<?xml version="1.0" encoding="UTF-8"?>

<shiptypes

name="United Federation of Planets">

<ship name="USS Enterprise">

<class>Sovereign</class>

<captain>Jean-Luc Picard</captain>

<registry-code>NCC-1701-E</registry-code>

</ship>

<ship name="USS Voyager">

<class>Intrepid</class>

<captain>Kathryn Janeway</captain>

<registry-code>NCC-74656</registry-code>

</ship>

<ship name="USS Enterprise">

<class>Galaxy</class>

<captain>Jean-Luc Picard</captain>

<registry-code>NCC-1701-D</registry-code>

</ship>

<ship name="USS Enterprise">

<class>Constitution</class>

<captain>James T. Kirk</captain>

<registry-code>NCC-1701</registry-code>

</ship>

<ship name="USS Sao Paulo">

<class>Defiant</class>

<captain>Benjamin L. Sisko</captain>

<registry-code>NCC-75633</registry-code>

</ship>

</shiptypes>

5.3.2 A Path Hosting Script

The ships.xml file provides a good stretch of XML data to write paths against. Now you can write a small program to apply path expressions to the document, and report on the nodes that are returned. In Example 5-2, we create a small script, xp.py, which invokes the xml.xpath.Evaluate function provided with 4Suite and more recent versions of PyXML.

Example 5-2. xp.py

"""

xp.py (requires xml doc on stdin)

"""

import sys

from xml.dom.ext.reader import PyExpat from xml.xpath import Evaluate

path0 = "ship/captain" # all captain elements

reader = PyExpat.Reader( )

dom = reader.fromStream(sys.stdin)

captain_elements = Evaluate(path0, dom.documentElement) for element in captain_elements:

print "Element: ", element

To run this program, you need to supply the previously created ships.xml from Example 5-1 as input:

$ python xp.py < ships.xml

In Example 5-2, the path ship/captain is used to extract all captain elements from the ships.xml document. The result is a node list containing the following:

<captain>Jean-Luc Picard</captain>

<captain>Kathryn Janeway</captain>

<captain>Jean-Luc Picard</captain>

<captain>James T. Kirk</captain>

<captain>Benjamin L. Sisko</captain>

Of course, this is not a complete or standalone document, but rather a node list.

These nodes are processed by the remaining code in the program:

captain_elements = Evaluate(path0, dom.documentElement) for element in captain_elements:

print "Element: ", element

The path ship/captain is a relative location path, as it does not specify an exact location from the root of the document to the element, as does /shiptypes/ship/captain. The ship/captain expression returns captain elements that are children of a ship element, relative to the document node passed to Evaluate.

5.3.3 Getting Character Data

You will often want to target text beneath an element. For example, you may want to search just for the captain's name, rather than the element node. You could append the XPath text function to your expression:

path1 = "ship/captain/text( )"

This addition to the path expression selects all text nodes beneath the captain element. If you replace the original production lines with the following code:

captainnodes = Evaluate(path1, dom.documentElement) for captainnode in captainnodes:

print "Starfleet Captain: ", captainnode.nodeValue you see the following result:

$ python xp.py < ships.xml

Starfleet Captain: Jean-Luc Picard Starfleet Captain: Kathryn Janeway Starfleet Captain: Jean-Luc Picard Starfleet Captain: James T. Kirk Starfleet Captain: Benjamin L. Sisko 5.3.4 Specifying an Index

Often, when working with data, you become interested in the ordinal positions of elements within columns, rows, or arrays. XML is no different in this regard. XPath provides indexed elements with syntax similar to array indexes, but it is important to know that XPath indexes are one-based, while Python sequence indexes are zero-based. To target an element using an index, use brackets next to the element name:

path2 = "ship[2]/captain/text( )"

In this case, ship[2] indicates that the second ship element for each parent of any ship element should have the text nodes beneath its captain element selected. To see the output, change the processing code:

capnode = Evaluate(path2, dom.documentElement)

print "Captain of ship[2] is: ", capnode[0].nodeValue Using path2, the output is:

$ python xp.py < ships.xml

Captain of ship[2] is: Kathryn Janeway

It is important not to allow the visual similarity between ship[2] and Python sequence indexing to confuse you; they are very different. The notation is actually shorthand for ship[position( )=2], which indicates that the second ship child element of some other element will match. Consider the following XML fragment:

<fleet name="Atlantic">

<ship id="id1"/>

<lifeboat id="id2"/>

</fleet>

<fleet name="Pacific">

<lifeboat id="id3"/>

<ship id="id4"/>

<ship id="id5"/>

</fleet>

The XPath expression ship[2] matches only the ship element with an id attribute of id5. This is not a trick, but it is an excellent reason to keep a copy of the XPath specification close by.

5.3.5 Testing Descendent Nodes

You may also want to query the text content beneath an element name. Say you have a structure of book chapters, each containing headings and paragraphs. You may want to search for text that appears underneath a certain heading. XPath provides a convenient way for you to check the character data of a text node that is the child of an element. If you are searching for a <ship> element with a <class>

element beneath it that contains the word Intrepid, you could use the following path:

path3 = 'ship[class="Intrepid"]'

This expression selects ship elements that have a child class element with child character data of Intrepid. You can further explore the returned node list with a processing code:

shipnodes = Evaluate(path3, dom.documentElement) for shipnode in shipnodes:

shipname = shipnode.getAttribute("name")

captain = Evaluate("captain/text( )", shipnode) print "--- Intrepid Class Ship ---"

print "Name: ", shipname

print "Captain: ", captain[0].nodeValue

In this code, we select all ship nodes that have a child class element indicating that they are Intrepid class ships. We can then reprocess this node to further select ship names and captains to generate the following output:

$ python xp.py < ships.xml

--- Intrepid Class Ship --- Name: USS Voyager

Captain: Kathryn Janeway

Instead of just checking that a descendent element contains necessary information as in path3, you can continue building the path expression to grab something specific beneath the matching element:

path4 = 'ship[class="Constitution"]/@name'

In this path, you drill down further. First, a ship element is selected only if its child class element contains the character data Constitution. This path is further extended when we select the name attribute of the ship element that contains the specific child character data (the @ symbol is used to indicate that we're interested in an attribute rather than a child element). Again, we change the processing code a little to use the new node list:

ship = Evaluate(path4, dom.documentElement)

print "Name of Constitution Class Ship: ", ship[0].nodeValue The output follows:

$ python xp.py < ships.xml

Name of Constitution Class Ship: USS Enterprise 5.3.6 Testing Attributes

Of course, evaluating XML attributes and their contents involves a slightly different process than evaluating element names and text node character data. In XPath, the

@ character is used to indicate an attribute. Brackets are also used to surround the node when it is being tested against character data. In order to test the character contents of an attribute, use a path such as the following:

path5 = 'ship[@name="USS Enterprise"]'

This expression selects all ship elements that have a name attribute containing the word Enterprise. In your ships.xml file, there are three starships named Enterprise, each with slightly different registry codes. You can mine the node list for more information:

ships = Evaluate(path5, dom.documentElement) for shipnode in ships:

registry = Evaluate("registry-code/text( )", shipnode) captain = Evaluate("captain/text( )", shipnode)

print "Found Enterprise with registry: ", registry[0].nodeValue print "Captain: ", captain[0].nodeValue

These subsequent expressions are relative paths that select captain and registry-code text from the current element with each hop through the node list. This time using the preceding code, the output appears as:

$ python xp.py < ships.xml

Found Enterprise with registry: NCC-1701-E Captain: Jean-Luc Picard

Found Enterprise with registry: NCC-1701-D Captain: Jean-Luc Picard

Found Enterprise with registry: NCC-1701 Captain: James T. Kirk

5.3.7 Selecting Elements

As with any ordered data set, you are usually interested in pulling one specific type of information out from the entire document. You may only be interested in the names of employees in a human resources database. Or you may have heavily nested data that you want to make sure you pull out with each occurrence of a given data type, regardless of its position in the document. With XPath, you can use the path expression // to indicate that all matching elements beneath the root should be selected:

path6 = "/shiptypes//captain"

This expression selects all captain elements beneath the route, regardless of where they appear. Since you are working with elements, obtaining character data requires some of the work shown earlier, or a traversal of the node structure:

captains = Evaluate(path6, dom.documentElement) for captain in captains:

print "Captain: ", captain.firstChild.nodeValue Running path6 generates the following output:

$ python xp.py < ships.xml Captain: Jean-Luc Picard Captain: Kathryn Janeway Captain: Jean-Luc Picard Captain: James T. Kirk Captain: Benjamin L. Sisko 5.3.8 Additional Operators

If you are familiar with filesystem paths on Windows or Unix, you may have seen the . and .. operators. The . operator indicates the current directory (or current element in XPath) while .. refers to the parent directory (or parent element in XPath). Using ships.xml, shown in Example 5-1, we can search for a specific ship's name and then reference the parent element to see which organization the ship belongs to.

path7 = "ship[@name='USS Voyager']/../@name"

This expression searches for a ship element that has a name attribute of "USS Voyager." The path then continues to select the name attribute of this ship element's parent. In ships.xml, this is the name attribute of the shiptypes element. To generate output, change your processing code in xp.py:

org = Evaluate(path7, dom.documentElement)

print "USS Voyager is owned by", org[0].nodeValue

This time xp.py generates output attributing the Voyager to the Federation of Planets:

$ python xp.py < ships.xml

USS Voyager is owned by United Federation of Planets