XML APIs
Web Data Management and Distribution
Serge Abiteboul Philippe Rigaux Marie-Christine Rousset Pierre Senellart
http://gemo.futurs.inria.fr/wdmd
January 4, 2010
Introduction
Application Programming Interfaces (APIs)
DOM, the Document Object Model.
It provides ahierarchical representation, where each node is an object instance of a DOM class.
Normalized by the W3C (see http://www.w3.org/DOM/).
DOM parsers exist in all object-oriented language: Java and C++ (the Xerces parser, from Apache), JavaScript (Ajax), PHP, Python, etc.
Not very efficient, and space consumming.
SAX, the Simple API for XML.
Operates on theserializedrepresentation;
Associatestriggersto each syntactic feature (e.g., a tag);
Efficent (one scan of the serialized representation; not always appropriate).
XML:DB, the XML Database API.
Provides a common interface to native or XML-enabled databases (i.e., meant as the “JDBC for XML” API);
Promoted by the XML:DB initiative (see http://xmldb-org.sourceforge.net/xapi/);
Introduction
Content of this presentation
A bird’s eye view of the principles of these APIs, along with a few examples.
SAX
Outline
1 Introduction
2 SAX
3 DOM
4 XML:DB
SAX
SAX: main principles
SAX is the API of choice for processing XML document in serialized form (including XML streams).
The XML input is read once, and the parser triggers handlers when events are met.
An event is simply a syntactic feature of the document: an opening or a closing tag, a line in a character string, an entity, etc.
Functions data
Data storage
<el1>
Event handler
</el1><el2> ... character SAX Parser
XML Document (Serialized form)
SAX
SAX example: the Handler
Programming with SAX = writing a handler (subclass of the abstract class
ContentHandler) which defines all the functions that must be triggered.import org . xml . sax .*;
import org . xml . sax . h e l p e r s. L o c a t o r I m p l;
public class S a x H a n d l e r i m p l e m e n t s C o n t e n t H a n d l e r {
p r i v a t e L o c a t o r l o c a t o r;
/* * C o n s t r u c t o r */
public S a x H a n d l e r () { super();
// Set the d e f a u l t l o c a t o r l o c a t o r = new L o c a t o r I m p l ();
}
Note: the locator can be used to know the location of the parser when an
event is processed.
SAX
SAX example: handler functions
Writing a handler = defining methods startDocument, startElement, endElement, etc.
/* * O p e n i n g tag h a n d l e r */
public void s t a r t E l e m e n t ( String nameSpaceURI , String localName ,
String rawName ,
A t t r i b u t e s a t t r i b u t e s) throws S A X E x c e p t i o n {
System . out . p r i n t l n(" O p e n i n g tag : " + l o c a l N a m e );
// Show the attributes , if any if ( a t t r i b u t e s. g e t L e n g t h () > 0) {
System . out . p r i n t l n(" A t t r i b u t e s: ");
for (int i = 0; i < a t t r i b u t e s. g e t L e n g t h (); i ++) { System . out . p r i n t l n( a t t r i b u t e s. g e t L o c a l N a m e ( i )
+ " = " + a t t r i b u t e s. g e t V a l u e( i ));
}
SAX
SAX example: handler functions (cont.)
/* * C l o s i n g tag h a n d l e r */
public void e n d E l e m e n t( String nameSpaceURI , String localName ,
String r a w N a m e) throws S A X E x c e p t i o n { System . out . print (" C l o s i n g tag : " + l o c a l N a m e );
System . out . p r i n t l n ();
}
/* * C h a r a c t e r data h a n d l i n g */
public void c h a r a c t e r s(char[] ch ,
int start , int end ) throws S A X E x c e p t i o n {
System . out . p r i n t l n(" # PCDATA : "
+ new String ( ch , start , end ));
}
SAX
Calling the SAX handler
public class S a x E x a m p l e { /* * C o n s t r u c t o r */
public S a x E x a m p l e ( String uri ) { X M L R e a d e r s a x R e a d e r =
X M L R e a d e r F a c t o r y . c r e a t e X M L R e a d e r (
" org . apache . xerces . p a r s e r s. S A X P a r s e r");
s a x R e a d e r. s e t C o n t e n t H a n d l e r (new S a x H a n d l e r ());
s a x R e a d e r. parse ( uri );
}
public static void main ( String [] args ) { try {
S a x E x a m p l e parser = new S a x E x a m p l e ( args [0]);
} catch ( T h r o w a b l e t ) { t . p r i n t S t a c k T r a c e ();
} }
DOM
Outline
1 Introduction
2 SAX
3 DOM
4 XML:DB
DOM
The DOM approach
According to the DOM,
everythingin an XML document is a
node.In object-oriented terms: everything is an
object, instance of classNode
orinstance of a
subclassof Node.
1
The entire document is a Document node
2
Every XML tag is an Element node
3
The texts contained in the XML elements are Text nodes
4
Every XML attribute is an Attribute node
5
Comments are Comment nodes
Plus, many other classes, not used for the tree representation.
Remark
Remember: an Element node does
notcontain the text.
DOM
From serialized representation to DOM tree (reminder!)
<? xml v e r s i o n=" 1.0 "
e n c o d i n g=" UTF-8 "? >
<A >
<B >
<D > Text 1 </ D >
<D > Text 2 </ D >
</ B >
<B >
<D > Text 3 </ D >
</ B >
<C att1 =" 2 "
att2 =" 3 "/ >
</ A >
Document
Element A
Element B
Element D
Text - Text 1
Element D
Text - Text 2
Element B
Element D
Text - Text 3
Element C
Attr att1 2
Attr att2 3
DOM
The DOM hierarchy (excerpt)
Leaf
Processing Instruction
Text
CData Section Comment
Entity
Reference Element Document Document Type
Entity
Attribute TreeNode
Node
Data Character
Container
Notation
DOM
The Node super-class
DOM is an attempt to provide an object-oriented model of XML document.
Node is the super-class. It should gather
allthe properties common to
allnodes. But some properties are properly inherited in a child class, and remain undefined in another.
Example: the
nameis inherited by Element nodes, but is undefined for Text node.
Actually there is no obvious OO hierarchy that cleanly models XML trees (i.e, from very abstract to very specialized nodes).
A pragmatic approach
The Node provides
allthe properties of
allthe node types. Thus one can:
Adopt the OO paradigm and map as accurately as possible each node to the specialized type;
or see everything as a Node, and follow a more procedural approach.
DOM
Properties of the Node class
Property Type Property Type
nodeType short nodeName String
nodeValue String parentNode Node
firstChild Node lastChild Node
childNodes NodeList previousSibling Node
nextSibling Node attributes NamedNodeMap
DOM
Methods of the Node class
Some important methods of Node. Note: the “current node” refers to the object that processes the method.
insertBefore (Node
new, Node
child)
Inserts the node
newas a new child of the current node, just before
child.
replaceChild (Node
new, Node
old).
Replace the
childnode by
new. removeChild (Node
child) Remove a child node;
appendChild (Node
child)
Ad a child node in last position (i.e., after the last of the current children).
boolean hasChildNodes().
True, if the current node has children.
DOM
Methods of the Document class
A Document object is always the first node created for a new XML tree.
Therefore it plays the role of a
factoryfor creating new nodes that must be inserted in the tree.
Methods of Document:
createElement(): creates and returns an Element node;
createTextNode(): creates and returns an Text node;
createCommentNode(): creates and returns an Comment node;
etc.
DOM
A first example: the preorder DOM program
preorder is a simple DOM program that
1
instantiate a DOM parser (Xerces);
2
Traverse a DOM tree in preorder;
3
Add to each Text node its position in the preorder traversal;
4
Serializes the output.
Remark
The program is available on the web site. Tested with the Xerces parser.
Should work with any other parser!
DOM
First step: instantiate the parser
// Import Java c l a s s e s import java . io .*;
import javax . xml . p a r s e r s .*;
import org . w3c . dom .*;
class D o m P r e o r d e r {
public static void main ( String args []) {
try {
// I n s t a n t i a t e the DOM parser D o c u m e n t B u i l d e r F a c t o r y f a c t o r y =
D o c u m e n t B u i l d e r F a c t o r y . n e w I n s t a n c e ();
D o c u m e n t B u i l d e r b u i l d e r =
f a c t o r y. n e w D o c u m e n t B u i l d e r ();
DOM
Second step: call the preorder recursive method
Note two important initial initial expressions: one gets the Document node as result of the parse method, and the root Element node as result of the getDocumentElement method.
// A n a l y s e the d o c u m e n t
File fdom = new File ( args [0]);
D o c u m e n t dom = b u i l d e r. parse ( fdom );
Node r o o t E l e m e n t = dom . g e t D o c u m e n t E l e m e n t ();
// Start the pre - order scan . // The first node number is 1.
e x p l o r e I n P r e o r d e r ( rootElement , 1);
// S e r i a l i z e the result
D o m S e r i a l i z e r s e r i a l i z e r = new D o m S e r i a l i z e r ( dom );
s e r i a l i z e r. output (" Output . xml ");
DOM
The exploreInPreorder method
p r i v a t e static int
e x p l o r e I n P r e o r d e r ( Node node , int number ) {
String str = new String ();
number ++;
// If Text node : put the number in the node ’s value . if ( node . g e t N o d e T y p e () == Node . T E X T _ N O D E ) {
str = " ( " + number + " ) " + node . g e t N o d e V a l u e ();
node . s e t N o d e V a l u e ( str );
}
// R e c u r s i v e call (: see next slide :) return number ;
}
DOM
The recursive call
p r i v a t e static int
e x p l o r e I n P r e o r d e r ( Node node , int number ) {
(: see p r e v o u s slide :) // R e c u r s i v e call
if ( node . h a s C h i l d N o d e s ()) {
// Get the c h i l d r e n of the c u r r e n t node N o d e L i s t c h i l d r e n = node . g e t C h i l d N o d e s ();
// Pre - order t r a v e r s a l for each node in the list for (int i =0; i < c h i l d r e n. g e t L e n g t h (); i ++)
number =
e x p l o r e I n P r e o r d e r ( c h i l d r e n. item ( i ) , number );
}
return number ; }
XML:DB
Outline
1 Introduction
2 SAX
3 DOM
4 XML:DB
XML:DB
Main components of XML:DB
The basic components employed by the XML:DB API are
drivers, collections, andresources.Drivers
are implementations of the database interface that encapsulate the database access logic for specific XML database products.
They are provided by the product vendor and must be registered with the database manager.
Collections
are hierarchical containers for resources and further sub-collections.
Resources
represent an XML document or a document fragment, selected by a query.
Remark
Our examples have been tested with eXist. See the site for the code, and
further instructions.
XML:DB
First XML:DB example: retrieving a document
The database driver class for eXist is
org.exist.xmldb.DatabaseImplThe URI gives the address of the eXist instance, and the access protocol.
public class E x i s t A c c e s s {
String DRIVER = " org . exist . xmldb . D a t a b a s e I m p l";
String URI = " xmldb : exist :// l o c a l h o s t :8080/ exist / xmlrpc "
String c o l l e c t i o n P a t h = " / db / movies / "; String r e s o u r c e N a m e = " Heat . xml ";
(: see next slide :)
Remark
See the actual code for
importinstructions.
XML:DB
First XML:DB example (continued)
public static void main ( String [] args ) throws E x c e p t i o n {
// i n i t i a l i z e d a t a b a s e driver Class cl = Class . f o r N a m e( DRIVER );
D a t a b a s e d a t a b a s e = ( D a t a b a s e) cl . n e w I n s t a n c e ();
D a t a b a s e M a n a g e r . r e g i s t e r D a t a b a s e ( d a t a b a s e );
// get the c o l l e c t i o n C o l l e c t i o n col =
D a t a b a s e M a n a g e r . g e t C o l l e c t i o n ( URI + c o l l e c t i o n P a t h );
// get the c o n t e n t of a d o c u m e n t
System . out . p r i n t l n(" Get " + r e s o u r c e N a m e );
X M L R e s o u r c e res = col . g e t R e s o u r c e( r e s o u r c e N a m e );
System . out . p r i n t l n( res . g e t C o n t e n t ());
}
XML:DB
Secondt XML:DB example: execute a query
(: D e c l a r a t i o n s and i n i t i a l i z a t i o n s as before :) // query a d o c u m e n t
String xQuery =
" for $x in doc( ’ movies . xml ’)// title return $x "; // I n s t a n t i a t e a XQuery s e r v i c e
X Q u e r y S e r v i c e s e r v i c e = col . g e t S e r v i c e(" X Q u e r y S e r v i c e ", // E x e c u t e the query , print the result
R e s o u r c e S e t result = s e r v i c e. query ( xQuery );
R e s o u r c e I t e r a t o r i = result . g e t I t e r a t o r ();
while( i . h a s M o r e R e s o u r c e s ()) { R e s o u r c e r = i . n e x t R e s o u r c e ();
System . out . p r i n t l n (( String ) r . g e t C o n t e n t ());
}