You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Sam Ruby <ru...@us.ibm.com> on 2001/03/21 22:26:35 UTC

The Great Debate: Xml Parsers

Cross posting to xerces-j-dev.

- Sam Ruby

---------------------- Forwarded by Sam Ruby/Raleigh/IBM on 03/21/2001 02:36 PM ---------------------------

James M Snell/Fresno/IBM@IBMUS on 03/21/2001 12:25:20 PM

Please respond to axis-dev@xml.apache.org

To:   axis-dev@xml.apache.org
cc:   xerces-dev@xml.apache.org
Subject:  The Great Debate: Xml Parsers



All,

(I'm cross-posting this to the Xerces-dev list so our friends on the
parser-side of things can follow along and join in)

As many of you know, we've had discussions in the past about which Xml
Parser to use as the core of the Axis message processing API.  Throughout
the course of this discussion, we've touched on several issues that have
become core requirements of Axis and need to drive our decision.  These
requirements are:

   1  Axis must not force the entire message object model to be in memory
at one time.  In other words, DOM is out.
   2  Axis must be very fast and very scalable in order to be widely
adopted over other Web Service implementation platforms
   3  We must be able to independently parse individual elements of the
message either as raw bits, SAX, the Axis defined Message API, DOM or
whatever else the user wants.
   4  We must be able to fully support SOAP semantics (i.e. multiref
elements, id/href, etc) without an overly negative impact on performance
(see number 1 and 2)

We've looked at Xerces, we've looked at JDOM, and most recently I've been
doing some work with a new Xml Pull Parser developed originally by
Aleksander Slominski as part of a research project for Indiana Univ. Below
is a basic summary of our thoughts thus far:

Xerces 1.x ->  Our concern with Xerces 1.x DOM is that it is slow, huge,
and complicated.  These are the standard complaints with DOM that we've
all heard (note to the Xerces guys:  I eagerly await the release of
Xerces2 ! :-) ....)  It just won't scale well in the types of environments
that we foresee Axis being deployed (which include limited capacity
devices such as handhelds (in which case it probably wouldn't work at all
due simply to it's size).

We also looked at SAX as an alternative but quickly determined that SAX
just was not adequate for proper SOAP processing that also met the
requrements mentioned above.  (for those of you who weren't part of that
discussion, I will not rehash it here, ping me later and I'll give you the
rundown).

JDOM -> Whlie JDOM is smaller and faster than Xerces and DOM, which is
nice, it still does not meet our requirements listed above.  An additional
issue raised internally at IBM was that JDOM is nowhere near being a
standard yet.  (As some of you may know, the current Axis codebase uses
JDOM for it's message processing).  We've all pretty much decided already
that JDOM should be removed from the core and should be replaced with a
lightweight XML parser that meets the requirements.

Xml Pull Parser (XPP) -> XPP is a lightweight (23k) pull parser that is
completely namespace aware and XML 1.0 compliant.  It's interface needs
quite a bit of work so I've been working with the author on getting it
cleaned up.  XPP has two advantages: 1. it's small, 2. it's fast.  The
parser was originally implemented as part of a research project comparing
the performance of various parsers in relation to SOAP-deserialization.
I'll have to try to dig up the results of their tests again, but XPP
outperformed nearly everything else available.   XPP would meet each of
our requirements once the interface redesign is complete.  This interface
redesign includes building a SAX layer over the parser's primary
interface.

Now, here's what we need to decide:

Which is more important: Performance/Scalability or Standards support?

>From earlier decisions, I believe that we have agreed that performance and
scalability in the case of Axis far outweigh standards support within the
core engine itself as long as there are hooks specifically designed into
the engine that allow full standards support if the developer wishes it.
Thus the reason we were going to provide our own Axis Message API with
hooks for optionally processing the message with SAX or DOM.  (i.e. if the
developer wants to tank their performance by using DOM, so be it)

I would like to invite the Xerces guys to join this discussion so that we
may figure out how to resolve this issue.  I understand now that Xerces 2
includes a Pull Parser interface of it's own along with a low level
interface that enables modularization, but many of us here either haven't
heard of it yet or aren't quite sure what it could mean for Axis.  Could
anybody on the Xerces team explain this in greater depth for us?

- James Snell
     Software Engineer, Emerging Technologies, IBM
     jasnell@us.ibm.com (online)
     jsnell@lemoorenet.com (offline)





---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: The Great Debate: Xml Parsers

Posted by Arnaud Le Hors <le...@us.ibm.com>.

James M Snell wrote:

> (I'm cross-posting this to the Xerces-dev list so our friends on the
> parser-side of things can follow along and join in)

Thanks.

>    1  Axis must not force the entire message object model to be in memory
> at one time.  In other words, DOM is out.

This is bogus. There is nothing in the DOM that forces one to have the
whole document in memory. The DOM is just an API. A set of interfaces.

> Xerces 1.x ->  Our concern with Xerces 1.x DOM is that it is slow, huge,
> and complicated.  These are the standard complaints with DOM that we've
> all heard (note to the Xerces guys:  I eagerly await the release of
> Xerces2 ! :-) ....)

So far the Xerces2 DOM and Xerces1 DOM are the same. We're
rearchitecturing the parser. We do not have any plan to write a new DOM
implementation (I believe one could very well fit your requirements
though). So, I'm not sure what you're waiting for here but you might be
disappointed.

> JDOM -> Whlie JDOM is smaller and faster than Xerces and DOM,

Unless I miss something and JDOM now includes a parser, comparing JDOM
to Xerces makes no sense.
I assume you mean "while JDOM is smaller and faster than Xerces DOM"
here, right?
I keep hearing that and I'm willing to believe it but I have yet to see
any metrics that shows this is true. And it is certainly not true in all
use cases. For one thing you can traverse a Xerces DOM tree as many
times as you want without creating any new object. JDOM creates a new
iterator on every node you traverse, and every time you traverse it.

So be careful when making such statements. Uless you make them very
specific they generally are wrong.

> which is
> nice, it still does not meet our requirements listed above.  An additional
> issue raised internally at IBM was that JDOM is nowhere near being a
> standard yet.  (As some of you may know, the current Axis codebase uses
> JDOM for it's message processing).  We've all pretty much decided already
> that JDOM should be removed from the core and should be replaced with a
> lightweight XML parser that meets the requirements.

Again, I don't understand what "JDOM [...] should be replaced with a
lightweight XML parser" means. How can you replace a set of classes to
represent an XML document in memory with a parser? Please, rephrase.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

Re: The Great Debate: Xml Parsers

Posted by Ted Leung <tw...@sauria.com>.

----- Original Message -----
From: "Sam Ruby" <ru...@us.ibm.com>
To: <xe...@xml.apache.org>
Sent: Wednesday, March 21, 2001 1:26 PM
Subject: The Great Debate: Xml Parsers


> Cross posting to xerces-j-dev.
>
> - Sam Ruby
>
> ---------------------- Forwarded by Sam Ruby/Raleigh/IBM on 03/21/2001
02:36 PM ---------------------------
>
> James M Snell/Fresno/IBM@IBMUS on 03/21/2001 12:25:20 PM
>
> Please respond to axis-dev@xml.apache.org
>
> To:   axis-dev@xml.apache.org
> cc:   xerces-dev@xml.apache.org
> Subject:  The Great Debate: Xml Parsers
>
>
>
> All,
>
> (I'm cross-posting this to the Xerces-dev list so our friends on the
> parser-side of things can follow along and join in)
>
> As many of you know, we've had discussions in the past about which Xml
> Parser to use as the core of the Axis message processing API.  Throughout
> the course of this discussion, we've touched on several issues that have
> become core requirements of Axis and need to drive our decision.  These
> requirements are:
>
>    1  Axis must not force the entire message object model to be in memory
> at one time.  In other words, DOM is out.

Seems to me that JDOM should be out on this count also.

>    2  Axis must be very fast and very scalable in order to be widely
> adopted over other Web Service implementation platforms

You couldn't be more right.

>    3  We must be able to independently parse individual elements of the
> message either as raw bits, SAX, the Axis defined Message API, DOM or
> whatever else the user wants.

Why?

>    4  We must be able to fully support SOAP semantics (i.e. multiref
> elements, id/href, etc) without an overly negative impact on performance
> (see number 1 and 2)
>
> We've looked at Xerces, we've looked at JDOM, and most recently I've been
> doing some work with a new Xml Pull Parser developed originally by
> Aleksander Slominski as part of a research project for Indiana Univ. Below
> is a basic summary of our thoughts thus far:
>
> Xerces 1.x ->  Our concern with Xerces 1.x DOM is that it is slow, huge,
> and complicated.  These are the standard complaints with DOM that we've
> all heard (note to the Xerces guys:  I eagerly await the release of
> Xerces2 ! :-) ....)  It just won't scale well in the types of environments
> that we foresee Axis being deployed (which include limited capacity
> devices such as handhelds (in which case it probably wouldn't work at all
> due simply to it's size).
>
> We also looked at SAX as an alternative but quickly determined that SAX
> just was not adequate for proper SOAP processing that also met the
> requrements mentioned above.  (for those of you who weren't part of that
> discussion, I will not rehash it here, ping me later and I'll give you the
> rundown).

I'd like to know why this is?  Especially since you are talking about
building a
SAX layer atop XPP below

> JDOM -> Whlie JDOM is smaller and faster than Xerces and DOM, which is
> nice, it still does not meet our requirements listed above.  An additional
> issue raised internally at IBM was that JDOM is nowhere near being a
> standard yet.  (As some of you may know, the current Axis codebase uses
> JDOM for it's message processing).  We've all pretty much decided already
> that JDOM should be removed from the core and should be replaced with a
> lightweight XML parser that meets the requirements.
>
> Xml Pull Parser (XPP) -> XPP is a lightweight (23k) pull parser that is
> completely namespace aware and XML 1.0 compliant.  It's interface needs
> quite a bit of work so I've been working with the author on getting it
> cleaned up.  XPP has two advantages: 1. it's small, 2. it's fast.  The
> parser was originally implemented as part of a research project comparing
> the performance of various parsers in relation to SOAP-deserialization.
> I'll have to try to dig up the results of their tests again, but XPP
> outperformed nearly everything else available.   XPP would meet each of
> our requirements once the interface redesign is complete.  This interface
> redesign includes building a SAX layer over the parser's primary
> interface.
>
> Now, here's what we need to decide:
>
> Which is more important: Performance/Scalability or Standards support?

PERFORMANCE  -- It's already bad enough that you're trying to do RPC like
 things with text files.  VC's aren't dropping out of the sky to buy kids
E10K's or
S80's any more.

> >From earlier decisions, I believe that we have agreed that performance
and
> scalability in the case of Axis far outweigh standards support within the
> core engine itself as long as there are hooks specifically designed into
> the engine that allow full standards support if the developer wishes it.
> Thus the reason we were going to provide our own Axis Message API with
> hooks for optionally processing the message with SAX or DOM.  (i.e. if the
> developer wants to tank their performance by using DOM, so be it)
>
> I would like to invite the Xerces guys to join this discussion so that we
> may figure out how to resolve this issue.  I understand now that Xerces 2
> includes a Pull Parser interface of it's own along with a low level
> interface that enables modularization, but many of us here either haven't
> heard of it yet or aren't quite sure what it could mean for Axis.  Could
> anybody on the Xerces team explain this in greater depth for us?

Actually Xerces 1 contains a pull parser interface as well, but it's poorly
documented
and mostly used internally.  If getting the "product out" is the key, then
neither this API
nor it's descendent API in Xerces2 are for you.

However,  Axis is an xml.apache.org project, as is Xerces.  It seems
perfectly reasonable
to me that you guys push requirements on us, just as Scott and the Xalan
developers have
done (and should continue to do).   I would like to see us engage in a
vigorous and public
discussion of your requirements and why Xerces is not suitable.  It's a
known fact/bug that
Xerces 1 performance on small documents is poor.  It's also true that very
little effort has
been expended on rectifying that.  So far the only real requirement that I
can see coming from
Axis is that we give you good performance on small documents.  Am I missing
something?
In my book it's okay if in the short term Axis has to use XPP, but in the
long term, both
projects should be trying to find a way to make the ASF SOAP a truly ASF
stack.

FYI is posted a SOAP related performance study to xerces-j-dev within the
last few weeks.
I'm glad to see you guys coming to the party.  Especially since you are the
ones who are
going to keep us from getting Hailstormed.

> - James Snell
>      Software Engineer, Emerging Technologies, IBM
>      jasnell@us.ibm.com (online)
>      jsnell@lemoorenet.com (offline)
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>

Re: The Great Debate: Xml Parsers

Posted by Arnaud Le Hors <le...@us.ibm.com>.

James M Snell wrote:

> (I'm cross-posting this to the Xerces-dev list so our friends on the
> parser-side of things can follow along and join in)

Thanks.

>    1  Axis must not force the entire message object model to be in memory
> at one time.  In other words, DOM is out.

This is bogus. There is nothing in the DOM that forces one to have the
whole document in memory. The DOM is just an API. A set of interfaces.

> Xerces 1.x ->  Our concern with Xerces 1.x DOM is that it is slow, huge,
> and complicated.  These are the standard complaints with DOM that we've
> all heard (note to the Xerces guys:  I eagerly await the release of
> Xerces2 ! :-) ....)

So far the Xerces2 DOM and Xerces1 DOM are the same. We're
rearchitecturing the parser. We do not have any plan to write a new DOM
implementation (I believe one could very well fit your requirements
though). So, I'm not sure what you're waiting for here but you might be
disappointed.

> JDOM -> Whlie JDOM is smaller and faster than Xerces and DOM,

Unless I miss something and JDOM now includes a parser, comparing JDOM
to Xerces makes no sense.
I assume you mean "while JDOM is smaller and faster than Xerces DOM"
here, right?
I keep hearing that and I'm willing to believe it but I have yet to see
any metrics that shows this is true. And it is certainly not true in all
use cases. For one thing you can traverse a Xerces DOM tree as many
times as you want without creating any new object. JDOM creates a new
iterator on every node you traverse, and every time you traverse it.

So be careful when making such statements. Uless you make them very
specific they generally are wrong.

> which is
> nice, it still does not meet our requirements listed above.  An additional
> issue raised internally at IBM was that JDOM is nowhere near being a
> standard yet.  (As some of you may know, the current Axis codebase uses
> JDOM for it's message processing).  We've all pretty much decided already
> that JDOM should be removed from the core and should be replaced with a
> lightweight XML parser that meets the requirements.

Again, I don't understand what "JDOM [...] should be replaced with a
lightweight XML parser" means. How can you replace a set of classes to
represent an XML document in memory with a parser? Please, rephrase.
-- 
Arnaud  Le Hors - IBM Cupertino, XML Strategy Group

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: The Great Debate: Xml Parsers

Posted by Ted Leung <tw...@sauria.com>.

----- Original Message -----
From: "Sam Ruby" <ru...@us.ibm.com>
To: <xe...@xml.apache.org>
Sent: Wednesday, March 21, 2001 1:26 PM
Subject: The Great Debate: Xml Parsers


> Cross posting to xerces-j-dev.
>
> - Sam Ruby
>
> ---------------------- Forwarded by Sam Ruby/Raleigh/IBM on 03/21/2001
02:36 PM ---------------------------
>
> James M Snell/Fresno/IBM@IBMUS on 03/21/2001 12:25:20 PM
>
> Please respond to axis-dev@xml.apache.org
>
> To:   axis-dev@xml.apache.org
> cc:   xerces-dev@xml.apache.org
> Subject:  The Great Debate: Xml Parsers
>
>
>
> All,
>
> (I'm cross-posting this to the Xerces-dev list so our friends on the
> parser-side of things can follow along and join in)
>
> As many of you know, we've had discussions in the past about which Xml
> Parser to use as the core of the Axis message processing API.  Throughout
> the course of this discussion, we've touched on several issues that have
> become core requirements of Axis and need to drive our decision.  These
> requirements are:
>
>    1  Axis must not force the entire message object model to be in memory
> at one time.  In other words, DOM is out.

Seems to me that JDOM should be out on this count also.

>    2  Axis must be very fast and very scalable in order to be widely
> adopted over other Web Service implementation platforms

You couldn't be more right.

>    3  We must be able to independently parse individual elements of the
> message either as raw bits, SAX, the Axis defined Message API, DOM or
> whatever else the user wants.

Why?

>    4  We must be able to fully support SOAP semantics (i.e. multiref
> elements, id/href, etc) without an overly negative impact on performance
> (see number 1 and 2)
>
> We've looked at Xerces, we've looked at JDOM, and most recently I've been
> doing some work with a new Xml Pull Parser developed originally by
> Aleksander Slominski as part of a research project for Indiana Univ. Below
> is a basic summary of our thoughts thus far:
>
> Xerces 1.x ->  Our concern with Xerces 1.x DOM is that it is slow, huge,
> and complicated.  These are the standard complaints with DOM that we've
> all heard (note to the Xerces guys:  I eagerly await the release of
> Xerces2 ! :-) ....)  It just won't scale well in the types of environments
> that we foresee Axis being deployed (which include limited capacity
> devices such as handhelds (in which case it probably wouldn't work at all
> due simply to it's size).
>
> We also looked at SAX as an alternative but quickly determined that SAX
> just was not adequate for proper SOAP processing that also met the
> requrements mentioned above.  (for those of you who weren't part of that
> discussion, I will not rehash it here, ping me later and I'll give you the
> rundown).

I'd like to know why this is?  Especially since you are talking about
building a
SAX layer atop XPP below

> JDOM -> Whlie JDOM is smaller and faster than Xerces and DOM, which is
> nice, it still does not meet our requirements listed above.  An additional
> issue raised internally at IBM was that JDOM is nowhere near being a
> standard yet.  (As some of you may know, the current Axis codebase uses
> JDOM for it's message processing).  We've all pretty much decided already
> that JDOM should be removed from the core and should be replaced with a
> lightweight XML parser that meets the requirements.
>
> Xml Pull Parser (XPP) -> XPP is a lightweight (23k) pull parser that is
> completely namespace aware and XML 1.0 compliant.  It's interface needs
> quite a bit of work so I've been working with the author on getting it
> cleaned up.  XPP has two advantages: 1. it's small, 2. it's fast.  The
> parser was originally implemented as part of a research project comparing
> the performance of various parsers in relation to SOAP-deserialization.
> I'll have to try to dig up the results of their tests again, but XPP
> outperformed nearly everything else available.   XPP would meet each of
> our requirements once the interface redesign is complete.  This interface
> redesign includes building a SAX layer over the parser's primary
> interface.
>
> Now, here's what we need to decide:
>
> Which is more important: Performance/Scalability or Standards support?

PERFORMANCE  -- It's already bad enough that you're trying to do RPC like
 things with text files.  VC's aren't dropping out of the sky to buy kids
E10K's or
S80's any more.

> >From earlier decisions, I believe that we have agreed that performance
and
> scalability in the case of Axis far outweigh standards support within the
> core engine itself as long as there are hooks specifically designed into
> the engine that allow full standards support if the developer wishes it.
> Thus the reason we were going to provide our own Axis Message API with
> hooks for optionally processing the message with SAX or DOM.  (i.e. if the
> developer wants to tank their performance by using DOM, so be it)
>
> I would like to invite the Xerces guys to join this discussion so that we
> may figure out how to resolve this issue.  I understand now that Xerces 2
> includes a Pull Parser interface of it's own along with a low level
> interface that enables modularization, but many of us here either haven't
> heard of it yet or aren't quite sure what it could mean for Axis.  Could
> anybody on the Xerces team explain this in greater depth for us?

Actually Xerces 1 contains a pull parser interface as well, but it's poorly
documented
and mostly used internally.  If getting the "product out" is the key, then
neither this API
nor it's descendent API in Xerces2 are for you.

However,  Axis is an xml.apache.org project, as is Xerces.  It seems
perfectly reasonable
to me that you guys push requirements on us, just as Scott and the Xalan
developers have
done (and should continue to do).   I would like to see us engage in a
vigorous and public
discussion of your requirements and why Xerces is not suitable.  It's a
known fact/bug that
Xerces 1 performance on small documents is poor.  It's also true that very
little effort has
been expended on rectifying that.  So far the only real requirement that I
can see coming from
Axis is that we give you good performance on small documents.  Am I missing
something?
In my book it's okay if in the short term Axis has to use XPP, but in the
long term, both
projects should be trying to find a way to make the ASF SOAP a truly ASF
stack.

FYI is posted a SOAP related performance study to xerces-j-dev within the
last few weeks.
I'm glad to see you guys coming to the party.  Especially since you are the
ones who are
going to keep us from getting Hailstormed.

> - James Snell
>      Software Engineer, Emerging Technologies, IBM
>      jasnell@us.ibm.com (online)
>      jsnell@lemoorenet.com (offline)
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
> For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org