XML APIs Web Data Management and Distribution Serge Abiteboul Philippe Rigaux Marie-Christine Rousset Pierre Senellart

(1)

XML APIs

Web Data Management and Distribution

Serge Abiteboul Philippe Rigaux Marie-Christine Rousset Pierre Senellart

http://gemo.futurs.inria.fr/wdmd

January 4, 2010

(2)

Introduction

Application Programming Interfaces (APIs)

DOM, the Document Object Model.

It provides ahierarchical representation, where each node is an object instance of a DOM class.

Normalized by the W3C (see http://www.w3.org/DOM/).

DOM parsers exist in all object-oriented language: Java and C++ (the Xerces parser, from Apache), JavaScript (Ajax), PHP, Python, etc.

Not very efficient, and space consumming.

SAX, the Simple API for XML.

Operates on theserializedrepresentation;

Associatestriggersto each syntactic feature (e.g., a tag);

Efficent (one scan of the serialized representation; not always appropriate).

XML:DB, the XML Database API.

Provides a common interface to native or XML-enabled databases (i.e., meant as the “JDBC for XML” API);

Promoted by the XML:DB initiative (see http://xmldb-org.sourceforge.net/xapi/);

(3)

Introduction

Content of this presentation

A bird’s eye view of the principles of these APIs, along with a few examples.

(4)

SAX

Outline

1 Introduction

2 SAX

3 DOM

4 XML:DB

(5)

SAX

SAX: main principles

SAX is the API of choice for processing XML document in serialized form (including XML streams).

The XML input is read once, and the parser triggers handlers when events are met.

An event is simply a syntactic feature of the document: an opening or a closing tag, a line in a character string, an entity, etc.

Functions data

Data storage

<el1>

Event handler

</el1><el2> ... character SAX Parser

XML Document (Serialized form)

(6)

SAX

SAX example: the Handler

Programming with SAX = writing a handler (subclass of the abstract class

ContentHandler) which defines all the functions that must be triggered.

import org . xml . sax .*;

import org . xml . sax . h e l p e r s. L o c a t o r I m p l;

public class S a x H a n d l e r i m p l e m e n t s C o n t e n t H a n d l e r {

p r i v a t e L o c a t o r l o c a t o r;

/* * C o n s t r u c t o r */

public S a x H a n d l e r () { super();

// Set the d e f a u l t l o c a t o r l o c a t o r = new L o c a t o r I m p l ();

}

Note: the locator can be used to know the location of the parser when an

event is processed.

(7)

SAX

SAX example: handler functions

Writing a handler = defining methods startDocument, startElement, endElement, etc.

/* * O p e n i n g tag h a n d l e r */

public void s t a r t E l e m e n t ( String nameSpaceURI , String localName ,

String rawName ,

A t t r i b u t e s a t t r i b u t e s) throws S A X E x c e p t i o n {

System . out . p r i n t l n(" O p e n i n g tag : " + l o c a l N a m e );

// Show the attributes , if any if ( a t t r i b u t e s. g e t L e n g t h () > 0) {

System . out . p r i n t l n(" A t t r i b u t e s: ");

for (int i = 0; i < a t t r i b u t e s. g e t L e n g t h (); i ++) { System . out . p r i n t l n( a t t r i b u t e s. g e t L o c a l N a m e ( i )

+ " = " + a t t r i b u t e s. g e t V a l u e( i ));

}

(8)

SAX

SAX example: handler functions (cont.)

/* * C l o s i n g tag h a n d l e r */

public void e n d E l e m e n t( String nameSpaceURI , String localName ,

String r a w N a m e) throws S A X E x c e p t i o n { System . out . print (" C l o s i n g tag : " + l o c a l N a m e );

System . out . p r i n t l n ();

}

/* * C h a r a c t e r data h a n d l i n g */

public void c h a r a c t e r s(char[] ch ,

int start , int end ) throws S A X E x c e p t i o n {

System . out . p r i n t l n(" # PCDATA : "

+ new String ( ch , start , end ));

}

(9)

SAX

Calling the SAX handler

public class S a x E x a m p l e { /* * C o n s t r u c t o r */

public S a x E x a m p l e ( String uri ) { X M L R e a d e r s a x R e a d e r =

X M L R e a d e r F a c t o r y . c r e a t e X M L R e a d e r (

" org . apache . xerces . p a r s e r s. S A X P a r s e r");

s a x R e a d e r. s e t C o n t e n t H a n d l e r (new S a x H a n d l e r ());

s a x R e a d e r. parse ( uri );

}

public static void main ( String [] args ) { try {

S a x E x a m p l e parser = new S a x E x a m p l e ( args [0]);

} catch ( T h r o w a b l e t ) { t . p r i n t S t a c k T r a c e ();

} }

(10)

DOM

Outline

1 Introduction

2 SAX

3 DOM

4 XML:DB

(11)

DOM

The DOM approach

According to the DOM,

everything

in an XML document is a

node.

In object-oriented terms: everything is an

object, instance of class

Node

or

instance of a

subclass

of Node.

1

The entire document is a Document node

2

Every XML tag is an Element node

3

The texts contained in the XML elements are Text nodes

4

Every XML attribute is an Attribute node

5

Comments are Comment nodes

Plus, many other classes, not used for the tree representation.

Remark

Remember: an Element node does

not

contain the text.

(12)

DOM

From serialized representation to DOM tree (reminder!)

<? xml v e r s i o n=" 1.0 "

e n c o d i n g=" UTF-8 "? >

<A >

<B >

</ B >

<B >

</ B >

<C att1 =" 2 "

att2 =" 3 "/ >

</ A >

Document

Element A

Element B

Element D

Text - Text 1

Element D

Text - Text 2

Element B

Element D

Text - Text 3

Element C

Attr att1 2

Attr att2 3

(13)

DOM

The DOM hierarchy (excerpt)

Leaf

Processing Instruction

Text

CData Section Comment

Entity

Reference Element Document Document Type

Entity

Attribute TreeNode

Node

Data Character

Container

Notation

(14)

DOM

The Node super-class

DOM is an attempt to provide an object-oriented model of XML document.

Node is the super-class. It should gather

all

the properties common to

all

nodes. But some properties are properly inherited in a child class, and remain undefined in another.

Example: the

name

is inherited by Element nodes, but is undefined for Text node.

Actually there is no obvious OO hierarchy that cleanly models XML trees (i.e, from very abstract to very specialized nodes).

A pragmatic approach

The Node provides

all

the properties of

all

the node types. Thus one can:

Adopt the OO paradigm and map as accurately as possible each node to the specialized type;

or see everything as a Node, and follow a more procedural approach.

(15)

DOM

Properties of the Node class

Property Type Property Type

nodeType short nodeName String

nodeValue String parentNode Node

firstChild Node lastChild Node

childNodes NodeList previousSibling Node

nextSibling Node attributes NamedNodeMap

(16)

DOM

Methods of the Node class

Some important methods of Node. Note: the “current node” refers to the object that processes the method.

insertBefore (Node

new

, Node

child

)

Inserts the node

new

as a new child of the current node, just before

child

.

replaceChild (Node

new

, Node

old

).

Replace the

child

node by

new

. removeChild (Node

child

) Remove a child node;

appendChild (Node

child

)

Ad a child node in last position (i.e., after the last of the current children).

boolean hasChildNodes().

True, if the current node has children.

(17)

DOM

Methods of the Document class

A Document object is always the first node created for a new XML tree.

Therefore it plays the role of a

factory

for creating new nodes that must be inserted in the tree.

Methods of Document:

createElement(): creates and returns an Element node;

createTextNode(): creates and returns an Text node;

createCommentNode(): creates and returns an Comment node;

etc.

(18)

DOM

A first example: the preorder DOM program

preorder is a simple DOM program that

1

instantiate a DOM parser (Xerces);

2

Traverse a DOM tree in preorder;

3

Add to each Text node its position in the preorder traversal;

4

Serializes the output.

Remark

The program is available on the web site. Tested with the Xerces parser.

Should work with any other parser!

(19)

DOM

First step: instantiate the parser

// Import Java c l a s s e s import java . io .*;

import javax . xml . p a r s e r s .*;

import org . w3c . dom .*;

class D o m P r e o r d e r {

public static void main ( String args []) {

try {

// I n s t a n t i a t e the DOM parser D o c u m e n t B u i l d e r F a c t o r y f a c t o r y =

D o c u m e n t B u i l d e r F a c t o r y . n e w I n s t a n c e ();

D o c u m e n t B u i l d e r b u i l d e r =

f a c t o r y. n e w D o c u m e n t B u i l d e r ();

(20)

DOM

Second step: call the preorder recursive method

Note two important initial initial expressions: one gets the Document node as result of the parse method, and the root Element node as result of the getDocumentElement method.

// A n a l y s e the d o c u m e n t

File fdom = new File ( args [0]);

D o c u m e n t dom = b u i l d e r. parse ( fdom );

Node r o o t E l e m e n t = dom . g e t D o c u m e n t E l e m e n t ();

// Start the pre - order scan . // The first node number is 1.

e x p l o r e I n P r e o r d e r ( rootElement , 1);

// S e r i a l i z e the result

D o m S e r i a l i z e r s e r i a l i z e r = new D o m S e r i a l i z e r ( dom );

s e r i a l i z e r. output (" Output . xml ");

(21)

DOM

The exploreInPreorder method

p r i v a t e static int

e x p l o r e I n P r e o r d e r ( Node node , int number ) {

String str = new String ();

number ++;

// If Text node : put the number in the node ’s value . if ( node . g e t N o d e T y p e () == Node . T E X T _ N O D E ) {

str = " ( " + number + " ) " + node . g e t N o d e V a l u e ();

node . s e t N o d e V a l u e ( str );

}

// R e c u r s i v e call (: see next slide :) return number ;

}

(22)

DOM

The recursive call

p r i v a t e static int

e x p l o r e I n P r e o r d e r ( Node node , int number ) {

(: see p r e v o u s slide :) // R e c u r s i v e call

if ( node . h a s C h i l d N o d e s ()) {

// Get the c h i l d r e n of the c u r r e n t node N o d e L i s t c h i l d r e n = node . g e t C h i l d N o d e s ();

// Pre - order t r a v e r s a l for each node in the list for (int i =0; i < c h i l d r e n. g e t L e n g t h (); i ++)

number =

e x p l o r e I n P r e o r d e r ( c h i l d r e n. item ( i ) , number );

}

return number ; }

(23)

XML:DB

Outline

1 Introduction

2 SAX

3 DOM

4 XML:DB

(24)

XML:DB

Main components of XML:DB

The basic components employed by the XML:DB API are

drivers, collections, andresources.

Drivers

are implementations of the database interface that encapsulate the database access logic for specific XML database products.

They are provided by the product vendor and must be registered with the database manager.

Collections

are hierarchical containers for resources and further sub-collections.

Resources

represent an XML document or a document fragment, selected by a query.

Remark

Our examples have been tested with eXist. See the site for the code, and

further instructions.

(25)

XML:DB

First XML:DB example: retrieving a document

The database driver class for eXist is

org.exist.xmldb.DatabaseImpl

The URI gives the address of the eXist instance, and the access protocol.

public class E x i s t A c c e s s {

String DRIVER = " org . exist . xmldb . D a t a b a s e I m p l";

String URI = " xmldb : exist :// l o c a l h o s t :8080/ exist / xmlrpc "

String c o l l e c t i o n P a t h = " / db / movies / "; String r e s o u r c e N a m e = " Heat . xml ";

(: see next slide :)

Remark

See the actual code for

import

instructions.

(26)

XML:DB

First XML:DB example (continued)

public static void main ( String [] args ) throws E x c e p t i o n {

// i n i t i a l i z e d a t a b a s e driver Class cl = Class . f o r N a m e( DRIVER );

D a t a b a s e d a t a b a s e = ( D a t a b a s e) cl . n e w I n s t a n c e ();

D a t a b a s e M a n a g e r . r e g i s t e r D a t a b a s e ( d a t a b a s e );

// get the c o l l e c t i o n C o l l e c t i o n col =

D a t a b a s e M a n a g e r . g e t C o l l e c t i o n ( URI + c o l l e c t i o n P a t h );

// get the c o n t e n t of a d o c u m e n t

System . out . p r i n t l n(" Get " + r e s o u r c e N a m e );

X M L R e s o u r c e res = col . g e t R e s o u r c e( r e s o u r c e N a m e );

System . out . p r i n t l n( res . g e t C o n t e n t ());

}

(27)

XML:DB

Secondt XML:DB example: execute a query

(: D e c l a r a t i o n s and i n i t i a l i z a t i o n s as before :) // query a d o c u m e n t

String xQuery =

" for $x in doc( ’ movies . xml ’)// title return $x "; // I n s t a n t i a t e a XQuery s e r v i c e

X Q u e r y S e r v i c e s e r v i c e = col . g e t S e r v i c e(" X Q u e r y S e r v i c e ", // E x e c u t e the query , print the result

R e s o u r c e S e t result = s e r v i c e. query ( xQuery );

R e s o u r c e I t e r a t o r i = result . g e t I t e r a t o r ();

while( i . h a s M o r e R e s o u r c e s ()) { R e s o u r c e r = i . n e x t R e s o u r c e ();

System . out . p r i n t l n (( String ) r . g e t C o n t e n t ());

}