You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-dev@axis.apache.org by James M Snell <ja...@us.ibm.com> on 2001/03/21 18:25:20 UTC
The Great Debate: Xml Parsers
All,
(I'm cross-posting this to the Xerces-dev list so our friends on the
parser-side of things can follow along and join in)
As many of you know, we've had discussions in the past about which Xml
Parser to use as the core of the Axis message processing API. Throughout
the course of this discussion, we've touched on several issues that have
become core requirements of Axis and need to drive our decision. These
requirements are:
1 Axis must not force the entire message object model to be in memory
at one time. In other words, DOM is out.
2 Axis must be very fast and very scalable in order to be widely
adopted over other Web Service implementation platforms
3 We must be able to independently parse individual elements of the
message either as raw bits, SAX, the Axis defined Message API, DOM or
whatever else the user wants.
4 We must be able to fully support SOAP semantics (i.e. multiref
elements, id/href, etc) without an overly negative impact on performance
(see number 1 and 2)
We've looked at Xerces, we've looked at JDOM, and most recently I've been
doing some work with a new Xml Pull Parser developed originally by
Aleksander Slominski as part of a research project for Indiana Univ. Below
is a basic summary of our thoughts thus far:
Xerces 1.x -> Our concern with Xerces 1.x DOM is that it is slow, huge,
and complicated. These are the standard complaints with DOM that we've
all heard (note to the Xerces guys: I eagerly await the release of
Xerces2 ! :-) ....) It just won't scale well in the types of environments
that we foresee Axis being deployed (which include limited capacity
devices such as handhelds (in which case it probably wouldn't work at all
due simply to it's size).
We also looked at SAX as an alternative but quickly determined that SAX
just was not adequate for proper SOAP processing that also met the
requrements mentioned above. (for those of you who weren't part of that
discussion, I will not rehash it here, ping me later and I'll give you the
rundown).
JDOM -> Whlie JDOM is smaller and faster than Xerces and DOM, which is
nice, it still does not meet our requirements listed above. An additional
issue raised internally at IBM was that JDOM is nowhere near being a
standard yet. (As some of you may know, the current Axis codebase uses
JDOM for it's message processing). We've all pretty much decided already
that JDOM should be removed from the core and should be replaced with a
lightweight XML parser that meets the requirements.
Xml Pull Parser (XPP) -> XPP is a lightweight (23k) pull parser that is
completely namespace aware and XML 1.0 compliant. It's interface needs
quite a bit of work so I've been working with the author on getting it
cleaned up. XPP has two advantages: 1. it's small, 2. it's fast. The
parser was originally implemented as part of a research project comparing
the performance of various parsers in relation to SOAP-deserialization.
I'll have to try to dig up the results of their tests again, but XPP
outperformed nearly everything else available. XPP would meet each of
our requirements once the interface redesign is complete. This interface
redesign includes building a SAX layer over the parser's primary
interface.
Now, here's what we need to decide:
Which is more important: Performance/Scalability or Standards support?
>From earlier decisions, I believe that we have agreed that performance and
scalability in the case of Axis far outweigh standards support within the
core engine itself as long as there are hooks specifically designed into
the engine that allow full standards support if the developer wishes it.
Thus the reason we were going to provide our own Axis Message API with
hooks for optionally processing the message with SAX or DOM. (i.e. if the
developer wants to tank their performance by using DOM, so be it)
I would like to invite the Xerces guys to join this discussion so that we
may figure out how to resolve this issue. I understand now that Xerces 2
includes a Pull Parser interface of it's own along with a low level
interface that enables modularization, but many of us here either haven't
heard of it yet or aren't quite sure what it could mean for Axis. Could
anybody on the Xerces team explain this in greater depth for us?
- James Snell
Software Engineer, Emerging Technologies, IBM
jasnell@us.ibm.com (online)
jsnell@lemoorenet.com (offline)