You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by Brad O'Hearne <ca...@megapathdsl.net> on 2001/03/16 15:58:28 UTC

RE: [Xerces2] Performance Comments and Other Ramblings

Andy,

I am really liking what I am reading -- and kudos to you for your work thus
far.  I am still not ready to give up on performance, however.  In fact,
there was one paragraph that caught my attention below in regards to this:

Andy Clark wrote:
Don't need to perform well-formedness checking? Fine.
Write a bare-bones XML scanner and use it as the scanner
in a custom configuration. (This capability hints at my
next example.)

I just wanted to clarify something here.  First off, if well-formedness (WF)
checking doesn't exist -- you don't have an XML parser.  We could go on and
talk about this further, and about how in theory eliminating WF essentially
means conforming to no rules, but it isn't my main point.  What it suggests,
however, is that the existing design for the X2 scanner is separated from
well-formedness checking, and that well-formedness checking is happening
after the fact.  Is this correct?  If not, then discard the rest of this,
but if so, I see a couple of issues here:

1) I think we could do better with performance by in-lining WF with
scanning.  To support WF is not discardable from supporting XML, the two are
one in the same.
2) If someone is going to provide a custom scanner, then they should by
definition provide WF checking within that scanner.  If they choose not to,
well, that's their business, but it should be their responsibility, and they
shouldn't be required (they may not be now, but I mention this just in case)
to use our WF framework, or be required to abide by our WF mechanism.  The
thing I am trying to avoid here is a situation where we say, "you can
provide your own custom scanner, and customer WF checker, but they must both
abide by these interfaces."  While we can do this with other parts of X2 --
I don't think we can do it here -- because someone else's approach to WF
checking might need to be entirely different than an interface we provide.
In this case, our modular design would actually create less flexibility --
albeit more flexible in plugging in custom pieces, but less flexible in
practice because we have required our user to design their scanner and WF
checker according to our design.  In my opinion, scanner and WF checker
should be requirements of the same interface.  Otherwise we impose our
approach on these custom components -- and this is one place we don't want
to do that.

I apologize for that discussion if I took your quote a little too far, and
this discussion was for naught (I hope it is for naught -- then there is no
issue).  But in some places in our engine, (the scanner being one of them in
my opinion) it is good not to split the pipeline into modules that are too
granular -- otherwise our "custom component" offering becomes little more
than offering our users the ability to implement things exactly how we
have -- which in practice makes the "interchangeable" nature of the pipeline
components a relatively uninteresting feature IMHO.

BradO

-----Original Message-----
From: Andy Clark [mailto:andyc@apache.org]
Sent: Friday, March 16, 2001 6:15 PM
To: xerces-j-dev@xml.apache.org
Subject: [Xerces2] Performance Comments and Other Ramblings


A few words need to be said about the current performance
of Xerces2 and what we're expecting from the design and
implementation. I think that knowing what we intend and
what can be done with the Xerces2 parser will ease a lot
of people's concerns over performance.

This is a little long but worth the read, IMHO...

First, Xerces2 *is* currently slower than Xerces 1.x which
is currently slower than XML4J 2.0.15. This is a fact, but
not an unalterable fact. Very little performance tuning
(next to none!) has been done on Xerces2 and there's still
a lot of room for improvement in Xerces 1.x.

Second, I am accepting the fact that Xerces2 *may* never
be as fast as Xerces 1.x. This is due to design decisions
made for the architecture and implementation of Xerces2.
(A prime example was the decision to always transcode the
incoming bytes to Unicode characters up front instead of
deferring the transcoding as is the case in Xerces 1.x.)

However, I don't see this as necessarily a terrible thing
because of all of the benefits that we get from the new
design.

Simplicity

The primary benefit of Xerces2 is the simplicity and
clarity of the new design and implementation. Because of
this, it is easier to maintain, extend, and easier for
other developers in the Xerces community to contribute
to the ongoing development of Xerces.

As I have said before, we have greatly simplified the
parser's code by always transcoding the input. In Xerces
1.x, the transcoding could be deferred until needed by
use of the StringPool. While this made the parser faster,
it also made the entire parser more complex.

As a result of using the StringPool, all components in
the parser needed to have a reference to the StringPool
in order to query the strings by ID; memory management
of the strings was harder; and the parser used more
memory.

By removing the StringPool we have also simplified the
readers and scanners. In Xerces 1.x, a separate reader
implementation was required for each encoding that we
wanted to optimize. This required a lot of duplicated
code and allowed the same bugs to appear separately
based on the specific reader implementation.

Grammar Caching

Next, Xerces2 enables a lot of requested features that
the old parser could not support. For example, Xerces2
is designed so that we can support grammar caching and
access. I believe that this will be one of the most
important feature to improve parser performance in
server environments.

In a server environment, the parser needs to process
a large number of small documents. However, most of the
processing time is taken by loading and compiling the
grammar objects to be used for validation. By caching
the grammar objects, the parser never needs to reload
a grammar it has already seen.

The grammar cache will also enable Xerces2 to support
other features. For example, the application could
restrict the grammars applied, overriding the grammar
contained in (or pointed to) by the instance document.
A very useful feature in a server environment where
you cannot trust the document to use the correct
grammar for validation.

Also, it is the intention of the Xerces2 design to
allow grammars to be loaded separately from instance
documents. These loaded grammars can then be applied
to in-memory documents for write-validation. And this
is also important for XML editors that want to use
Xerces.

Configurability

Xerces2 was designed to be the best possible general
purpose parser, not the highest performance parser.
The flexibility of the Xerces2 framework allows users
to create custom configurations that are specifically
tailored to their needs.

Instead of creating the world's fastest single function
parser, we have opted for greater configurability so
that users can optimize specific components while
taking advantage of the other functionality. I have
included some examples to highlight this capability:

Example 1: Server environment exchanging valid machine-
           generated documents.

In this setup, no validation is needed because the
documents are generated to be valid and complete (i.e.
defaulted values are included). In Xerces2, you can
simply create a configuration of the parser that
doesn't include a validator in the pipeline.

Example 2: Server environment exchanging well-formed
           machine generated documents.

Don't need to perform well-formedness checking? Fine.
Write a bare-bones XML scanner and use it as the scanner
in a custom configuration. (This capability hints at my
next example.)

And if the documents are machine-generated then they're
probably valid as well which means that you don't need a
validator, either. This is by far the best way to get the
best performance from Xerces2.

But if you have to write your own scanner anyway, what
does using Xerces2 buy you? A lot, actually, because it's
*everything* else that you're likely to need! Examples:
validator; DOM tree generator (what we call DOMParser);
SAX event generator (what we call SAXParser); grammar
caching; etc...

By writing custom components that conform to XNI and
that follow the Xerces2 parser configuration programming
guidelines (which have yet to be written -- anyone want
to start pulling this together?), you can write parser
configurations that seamlessly integrate custom features
while taking advantage of all of the default components
and functionality that is part of Xerces2. I think that
this is incredibly powerful.

Example 3: HTML parser.

Many people have asked if Xerces can parse HTML. And the
simple answer is "no". But that doesn't mean that it
couldn't! To me, XML is the document that is presented
to the application via some API; whereas I do *not*
believe that XML is a set of crazy angle brackets. If
something can "appear" as an XML stream of events using
the XNI framework, then it is effectively XML as far as
the application is concerned.

In other words, I can see Xerces2 being used as the
basis for an HTML parser. Two components are required
(although they could be combined): an HTML scanner and
a tag balancer. The scanner would break apart the HTML
file, producing raw, ill-formed XNI callbacks. Since
HTML documents are notorious for missing tags and such,
the tag balancer would be there to correct these
problems.

By plugging in the new components at the back end of
the pipeline, you can emit good XML events to populate
a DOM tree, emit SAX events, or do whatever we want.
And the beauty is that you didn't have to write the
rest of the parser framework.

Example 4: XInclude

You want to perform XInclude embedding of documents
before it gets validated? Simple. Write an XInclude
component and insert it into the pipeline after the
scanner but before the validator.

Example 5: Binding XML into Java objects.

So far I've only been talking about replacing single
components in the pipeline. But it is also possible to
use the standard pipeline to produce other output not
restricted to DOM and SAX.

Xerces2 is written such that it is easy to do this.
So it would be trivial to create a parser that builds
a JDOM tree from the XNI callbacks at the tail-end of
the pipeline. But what's more interesting is to have
a processor that would bind the events directly into
Java objects. There'll be a flavor coming out of the
JSR but I'm sure that there will be other flavors as
well.

Library of Parser Components and Configurations

Instead of writing a parser only good at one thing,
writing a general purpose parser allows us to create
a library of useful parser components and parser
configurations. [Tool idea: we could write a tool that
would create custom configurations based on user specs
that would generate the code and generate the custom
package automatically.]

This collection can grow over time as we have time
to develop them and/or as people donate new custom
configurations. So eventually we'll have an entire
family of XML parsers.

Conclusion

Xerces2 is really cool. :) Seriously, though, I want
to make the point that straight performance metrics
on the *default* parser configuration in Xerces2
should not solely drive design and implementation. If
you're writing a performance critical application then
you'll probably want something specific that can't be
accomodated by existing parsers, anyway.

So instead of writing an entire parser yourself (which
I don't recommend) or even bending an existing parser to
your will, Xerces2 can be used to build the parser you
need from a mix of existing components and custom
components.

Thoughts?

--
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

RE: [Xerces2] Performance Comments and Other Ramblings

Posted by Brad O'Hearne <ca...@megapathdsl.net>.

Misprint below -- what I meant to say was:

"Understand that I was never under the impression that X2 *was not* fully
compliant with XML 1.0...

BradO

-----Original Message-----
From: Brad O'Hearne [mailto:cabodog@megapathdsl.net]
Sent: Monday, March 19, 2001 7:22 AM
To: xerces-j-dev@xml.apache.org
Subject: RE: [Xerces2] Performance Comments and Other Ramblings

Andy,

Point taken.  Understand that I was never under the impression that X2 fully
compliant with XML 1.0 -- that wasn't my concern.  My concern is more so if
well-formedness checking is separated from parsing in such a manner that in
order to plug in a custom parser, you are constrained to separate the
well-formedness checking in like manner (to conform to some interface).  I
will go look through the XNI classes and see if I can tell.

Thanks for your response.

BradO

-----Original Message-----
From: Andy Clark [mailto:andyc@apache.org]
Sent: Monday, March 19, 2001 1:59 PM
To: xerces-j-dev@xml.apache.org
Subject: Re: [Xerces2] Performance Comments and Other Ramblings

Brad O'Hearne wrote:
> I just wanted to clarify something here.  First off, if well-formedness
(WF)
> checking doesn't exist -- you don't have an XML parser.  We could go on
and

I want to be perfectly clear so I'll keep this message short. :)

Xerces2 IS FULLY COMPLIANT WITH THE XML SPECIFICATION. Which
means that it checks well-formedness constraints, content model
validation, etc. In short, XNI is the API whereas Xerces2 is
the fully compliant, reference implementation of that API.

However, if the XML document is machine generated to be well-
formed, then you don't need to do that check while scanning.
So someone could write a custom, albeit non-conformant, scanner
for better performance. However, Xerces2 *does* do the well-
formedness checks in the scanner.

--
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

RE: [Xerces2] Performance Comments and Other Ramblings

Posted by Brad O'Hearne <ca...@megapathdsl.net>.

Andy,

Point taken.  Understand that I was never under the impression that X2 fully
compliant with XML 1.0 -- that wasn't my concern.  My concern is more so if
well-formedness checking is separated from parsing in such a manner that in
order to plug in a custom parser, you are constrained to separate the
well-formedness checking in like manner (to conform to some interface).  I
will go look through the XNI classes and see if I can tell.

Thanks for your response.

BradO

-----Original Message-----
From: Andy Clark [mailto:andyc@apache.org]
Sent: Monday, March 19, 2001 1:59 PM
To: xerces-j-dev@xml.apache.org
Subject: Re: [Xerces2] Performance Comments and Other Ramblings

Brad O'Hearne wrote:
> I just wanted to clarify something here.  First off, if well-formedness
(WF)
> checking doesn't exist -- you don't have an XML parser.  We could go on
and

I want to be perfectly clear so I'll keep this message short. :)

Xerces2 IS FULLY COMPLIANT WITH THE XML SPECIFICATION. Which
means that it checks well-formedness constraints, content model
validation, etc. In short, XNI is the API whereas Xerces2 is
the fully compliant, reference implementation of that API.

However, if the XML document is machine generated to be well-
formed, then you don't need to do that check while scanning.
So someone could write a custom, albeit non-conformant, scanner
for better performance. However, Xerces2 *does* do the well-
formedness checks in the scanner.

--
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org

Re: [Xerces2] Performance Comments and Other Ramblings

Posted by Andy Clark <an...@apache.org>.

Brad O'Hearne wrote:
> I just wanted to clarify something here.  First off, if well-formedness (WF)
> checking doesn't exist -- you don't have an XML parser.  We could go on and

I want to be perfectly clear so I'll keep this message short. :)

Xerces2 IS FULLY COMPLIANT WITH THE XML SPECIFICATION. Which 
means that it checks well-formedness constraints, content model
validation, etc. In short, XNI is the API whereas Xerces2 is
the fully compliant, reference implementation of that API.

However, if the XML document is machine generated to be well-
formed, then you don't need to do that check while scanning.
So someone could write a custom, albeit non-conformant, scanner
for better performance. However, Xerces2 *does* do the well-
formedness checks in the scanner.

-- 
Andy Clark * IBM, TRL - Japan * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org