You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-dev@axis.apache.org by James M Snell <ja...@us.ibm.com> on 2001/03/21 18:25:20 UTC
The Great Debate: Xml Parsers

All,

(I'm cross-posting this to the Xerces-dev list so our friends on the 
parser-side of things can follow along and join in)

As many of you know, we've had discussions in the past about which Xml 
Parser to use as the core of the Axis message processing API.  Throughout 
the course of this discussion, we've touched on several issues that have 
become core requirements of Axis and need to drive our decision.  These 
requirements are:

   1  Axis must not force the entire message object model to be in memory 
at one time.  In other words, DOM is out.
   2  Axis must be very fast and very scalable in order to be widely 
adopted over other Web Service implementation platforms
   3  We must be able to independently parse individual elements of the 
message either as raw bits, SAX, the Axis defined Message API, DOM or 
whatever else the user wants.
   4  We must be able to fully support SOAP semantics (i.e. multiref 
elements, id/href, etc) without an overly negative impact on performance 
(see number 1 and 2)

We've looked at Xerces, we've looked at JDOM, and most recently I've been 
doing some work with a new Xml Pull Parser developed originally by 
Aleksander Slominski as part of a research project for Indiana Univ. Below 
is a basic summary of our thoughts thus far:

Xerces 1.x ->  Our concern with Xerces 1.x DOM is that it is slow, huge, 
and complicated.  These are the standard complaints with DOM that we've 
all heard (note to the Xerces guys:  I eagerly await the release of 
Xerces2 ! :-) ....)  It just won't scale well in the types of environments 
that we foresee Axis being deployed (which include limited capacity 
devices such as handhelds (in which case it probably wouldn't work at all 
due simply to it's size).

We also looked at SAX as an alternative but quickly determined that SAX 
just was not adequate for proper SOAP processing that also met the 
requrements mentioned above.  (for those of you who weren't part of that 
discussion, I will not rehash it here, ping me later and I'll give you the 
rundown).

JDOM -> Whlie JDOM is smaller and faster than Xerces and DOM, which is 
nice, it still does not meet our requirements listed above.  An additional 
issue raised internally at IBM was that JDOM is nowhere near being a 
standard yet.  (As some of you may know, the current Axis codebase uses 
JDOM for it's message processing).  We've all pretty much decided already 
that JDOM should be removed from the core and should be replaced with a 
lightweight XML parser that meets the requirements.

Xml Pull Parser (XPP) -> XPP is a lightweight (23k) pull parser that is 
completely namespace aware and XML 1.0 compliant.  It's interface needs 
quite a bit of work so I've been working with the author on getting it 
cleaned up.  XPP has two advantages: 1. it's small, 2. it's fast.  The 
parser was originally implemented as part of a research project comparing 
the performance of various parsers in relation to SOAP-deserialization. 
I'll have to try to dig up the results of their tests again, but XPP 
outperformed nearly everything else available.   XPP would meet each of 
our requirements once the interface redesign is complete.  This interface 
redesign includes building a SAX layer over the parser's primary 
interface.

Now, here's what we need to decide:

Which is more important: Performance/Scalability or Standards support?

>From earlier decisions, I believe that we have agreed that performance and 
scalability in the case of Axis far outweigh standards support within the 
core engine itself as long as there are hooks specifically designed into 
the engine that allow full standards support if the developer wishes it. 
Thus the reason we were going to provide our own Axis Message API with 
hooks for optionally processing the message with SAX or DOM.  (i.e. if the 
developer wants to tank their performance by using DOM, so be it)

I would like to invite the Xerces guys to join this discussion so that we 
may figure out how to resolve this issue.  I understand now that Xerces 2 
includes a Pull Parser interface of it's own along with a low level 
interface that enables modularization, but many of us here either haven't 
heard of it yet or aren't quite sure what it could mean for Axis.  Could 
anybody on the Xerces team explain this in greater depth for us?

- James Snell
     Software Engineer, Emerging Technologies, IBM
     jasnell@us.ibm.com (online)
     jsnell@lemoorenet.com (offline)