You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-users@xerces.apache.org by "Anderson, John" <Jo...@Barbadosoft.com> on 2002/04/16 14:43:18 UTC
RE: About performance

I've encountered this performance degradation also, and wondering if there
are any suggestions for optimization. We are using Xerces to process DTDs
and build an abstarct model of them. With large DTDs (example: Docbook), we
always get an Out of Memory error. This doesn't occur with Xerces1.4.

Problem is, we also need som of the features of Xerces2.0, in particular the
ability to set external schemas. I'd be happy to switch parsers, but I don't
quite know how I should do this.

John


_______________________________________________________
John Anderson
CTO BarbadosoftTM 
The XML Management Company
+31 (0)20 750 7582 / +31 (0)6 55 347 448 / www.barbadosoft.com

 


-----Original Message-----
From: Andy Clark [mailto:andyc@apache.org]
Sent: 11 March 2002 14:03
To: xerces-j-user@xml.apache.org
Subject: Re: About performance


Yonghui Chen wrote:
> I have just tried 3 parsers, Xerces Java 1, Xerce Java 2 and Crimson,
> use both DOM and SAX parser parse a 960kb XML file for 10 times, the
> time cost are:

What are you using to achieve your results? Are you only
starting a VM, parsing a single document, showing the time,
and exiting the VM? This is not a fair comparison and does
not match real-world use of the parser.

The sax.Counter, dom.Counter, and xni.Counter samples that
come with Xerces2 are very convenient and can provide you
a "poor man's" performance test. The xni.Counter is the
one I use and I'll explain why.

Xerces2 is designed around the new Xerces Native Interface
(XNI) which allows us to more easily create new types of
parsers and re-use the same code to generate DOM trees,
emit SAX events, etc. The default parser configuration does
everything: full-fledged scanning of XML documents, DTD
validation, namespace binding, XML Schema validation, etc.

Depending on your needs, however, you can play tricks with
the parser configuration. For example, if you know that the
documents are generated and therefore are always well-formed
and valid, then you do not need to perform validation. So
the validation components can be removed from the pipeline
to improve performance.

Getting back to my point...

The xni.Counter sample (as well as the other XNI samples)
allow you to set the parser configuration by name so that
you can easily test new parser configurations. There is
an XNI sample included that creates a non-validating
parser configuration. You can use this with the xni.Counter
sample to see how much performance can be gained by not
validating every document.

This is just one example of ways to achieve better perf,
though. However, if you *need* validation then you must
find another way to improve performance. I will say a
few words on this issue, though.

First, in some areas Xerces2 will never be as fast as
Xerces 1.x. In particular, we made the decision in the
Xerces2 implementation to always transcode the document
(i.e. changing the bytes of the document into Java chars).
The old parser would defer this work until needed but
this created a situation where we had duplicated code
which introduced the possibility of more bugs. Also, defer-
ring the conversion of the underlying bytes was an issue in 
terms of memory usage.

Also, Xerces2 has much better support for the various
standards and other features than its predecessor. You
can't do more work in less time so this is one reason
why Xerces2 may appear initially slower. However, we
believe that the inherent modularity of the system is
better in the long-run for continued maintanence and
extension of the parser to add new features in the
future.

Lastly, we have not done serious performing tuning on
the new Xerces2 codebase. So we know that this is an
area in particular that we can definitely improve in
subsequent releases. We want to make the parser faster
and better but the standard parser configuration may
not match Xerces 1.x for larger documents. Xerces 1.x
was heavily optimized but not very flexible so we are
accepting a slight performance hit in certain areas.

But please hang in there -- it will get better! :)

-- 
Andy Clark * andyc@apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org

The information transmitted by this e-mail message is intended solely for
the use of the person to whom or entity to which it is addressed. The
message may contain information that is privileged and confidential.
Disclosure, dissemination, distribution, review, retransmission to, other
use of or taking any action in reliance upon this information by anyone
other than the intended recipient is prohibited. If you are not the intended
recipient, please do not disseminate, distribute or copy this communication,
by e-mail or otherwise. Instead, please notify us immediately by return
e-mail (including the original message with your reply) and then delete and
discard all copies of the message.

Although we have taken precautions to minimize the risk of transmitting
viruses we nevertheless advise you to carry out your own virus checks on any
attachment to this message. We accept no liability for any loss or damage
caused by viruses.