You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Andy Clark <an...@apache.org> on 2000/10/20 04:21:52 UTC
[Xerces2] Update Report
It's time for an update on the Xerces2 concept implementation.
I haven't had much time to write another batch of documentation
but I *have* been adding a ton of javadoc comments all over the
place in an attempt to make parts of the implementation "self
documenting". Of course this doesn't replace real docs which
should and will be written in the future.
Work is progressing nicely on the Xerces2 concept codebase.
Every day more functionality is implemented and soon we'll have
a fully functioning DTD-capable parser! :) This update will be
broken down into functional units.
Entity Management
The entity manager and entity scanner have improved greatly
since Milestone 1. The entity scanner is now fully implemented
and handles newline normalization and line/column counting.
The entity manager is also able to handle all forms of
entities that are referenced in documents and DTDs. Because
the work of switching entities is hidden behind the entity
scanner, managing entities and scanning their contents is
completely transparent from the document and DTD scanners.
Performance
Work was done to improve the performance of Xerces2. While
there is still a lot of performance work that can be done, the
speed is comparable to Xerces 1.x. The major speed increase
from the previous Milestone was due to three major factors:
1) Object creation was minimized. I don't care how tricky
any particular VM is, it takes time to create and destroy
objects. I made a 20% performance gain in *all* VMs by
re-using some objects instead of creating new ones.
2) Custom, optimized UTF-8 reader class. Unfortunately, the
UTF-8 reader supplied with Java isn't very good. The new
reader still needs a lot of testing, though, and doesn't
handle the byte-order-mark (BOM). How can a UTF-8 file
have a BOM, you say? Well, it can and some generated XML
files actually have them!
3) Simplified the inner loop of the scanContent method in
the entity scanner by doing the newline normalization
out of the loop and performing a bit test to see if there
was nothing "special" about the characters in the loop. I
thought that I could remove the bounds check as well but
that didn't yield any performance so I left it out.
The memory performance was helped by re-using objects. This
made the parser stop consuming new memory completely at
about 330K when using SAX. Neat! :) Considering that the
java.net.URL#openStream methods creates nearly 1700 objects
alone and that we use a 64K array to store bit flags to
speed up character tests, this is pretty good!
Document Scanning
The document scanner was pretty much in place before but it
has improved a little bit and now calls into the DTD scanner
for both the internal and external subset of the DTD. In
addition, entity references now work by calling into the
entity manager.
However, there are still some gaps in the document scanner
implementation. It can handle simple documents but needs to
be revisited to complete the implementation.
DTD Scanning
A ton of work went into the DTD scanner and is nearing
completion. It even handles all of those tricky, twisty
parameter entities. Cool! We'll have a better estimate on
how complete both the document and DTD scanner are once we
hook this thing up to the compliance suite and start
running files through it.
The DTD scanner makes all of the appropriate callbacks for
the DTD information. The validator component sits between the
scanner and parser instance. It acts as a tee that passes the
events onward as well as calling into the DTDGrammar object
which populates itself from the callbacks.
Validation
The validation code is almost completely moved over from
Xerces 1.x. The only thing remaining is to plug it in and
see if it starts validating the document! I know that I'm
over-simplifying the status of the validation but it should
be "real soon now".
API Support
We already had SAX1 and SAX2 support. Now we also have DOM
support but not at the same level that we did in version 1.x
of Xerces. The deferred DOM is currently out of commision and
*may* return in the future. The DOM parser simply builds a
DOM tree programmatically from the Xerces DOM implementation.
Supporting some of the older features is still questionable
and open to discussion.
All kinds of different APIs can be supported by Xerces2 by
using the XNI (Xerces Native Interface) callbacks. For a
really good example of all of the callbacks that are available
as well as to see the parser in action, check out the
xni.DocumentTracer sample in the samples/xni/ directory.
Anything Else?
I could really only go into detail about the parts that I
worked on or knew something about, so if I missed anything,
please post your own progress reports to fill in the gaps.
The current implementation is available from CVS. Check out my
web page for instructions on how to check it out from the
repository:
http://www.apache.org/~andyc/xerces2/
Lastly, we had a really good discussion with people at the
Apache Xerces2 Workshop hosted by Dirk at Covalent. There should
be a report of those proceedings posted soon. I want to thank
everyone who showed up and provided both input and a pledge to
help with the development of the next version of the Xerces
parser. I have great optimism for the future! :)
--
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org