You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Andy Clark <an...@apache.org> on 2000/10/20 04:21:52 UTC

[Xerces2] Update Report

It's time for an update on the Xerces2 concept implementation. 
I haven't had much time to write another batch of documentation
but I *have* been adding a ton of javadoc comments all over the 
place in an attempt to make parts of the implementation "self 
documenting". Of course this doesn't replace real docs which 
should and will be written in the future.

Work is progressing nicely on the Xerces2 concept codebase.
Every day more functionality is implemented and soon we'll have 
a fully functioning DTD-capable parser! :) This update will be
broken down into functional units.

Entity Management

  The entity manager and entity scanner have improved greatly
  since Milestone 1. The entity scanner is now fully implemented
  and handles newline normalization and line/column counting.

  The entity manager is also able to handle all forms of 
  entities that are referenced in documents and DTDs. Because 
  the work of switching entities is hidden behind the entity 
  scanner, managing entities and scanning their contents is 
  completely transparent from the document and DTD scanners.

Performance

  Work was done to improve the performance of Xerces2. While 
  there is still a lot of performance work that can be done, the 
  speed is comparable to Xerces 1.x. The major speed increase 
  from the previous Milestone was due to three major factors:

  1) Object creation was minimized. I don't care how tricky
     any particular VM is, it takes time to create and destroy
     objects. I made a 20% performance gain in *all* VMs by
     re-using some objects instead of creating new ones.
  2) Custom, optimized UTF-8 reader class. Unfortunately, the
     UTF-8 reader supplied with Java isn't very good. The new
     reader still needs a lot of testing, though, and doesn't
     handle the byte-order-mark (BOM). How can a UTF-8 file
     have a BOM, you say? Well, it can and some generated XML
     files actually have them!
  3) Simplified the inner loop of the scanContent method in
     the entity scanner by doing the newline normalization
     out of the loop and performing a bit test to see if there
     was nothing "special" about the characters in the loop. I
     thought that I could remove the bounds check as well but
     that didn't yield any performance so I left it out.

  The memory performance was helped by re-using objects. This
  made the parser stop consuming new memory completely at
  about 330K when using SAX. Neat! :) Considering that the
  java.net.URL#openStream methods creates nearly 1700 objects
  alone and that we use a 64K array to store bit flags to
  speed up character tests, this is pretty good!

Document Scanning

  The document scanner was pretty much in place before but it 
  has improved a little bit and now calls into the DTD scanner 
  for both the internal and external subset of the DTD. In 
  addition, entity references now work by calling into the 
  entity manager.

  However, there are still some gaps in the document scanner
  implementation. It can handle simple documents but needs to
  be revisited to complete the implementation.

DTD Scanning

  A ton of work went into the DTD scanner and is nearing
  completion. It even handles all of those tricky, twisty
  parameter entities. Cool! We'll have a better estimate on
  how complete both the document and DTD scanner are once we
  hook this thing up to the compliance suite and start 
  running files through it.

  The DTD scanner makes all of the appropriate callbacks for
  the DTD information. The validator component sits between the
  scanner and parser instance. It acts as a tee that passes the
  events onward as well as calling into the DTDGrammar object
  which populates itself from the callbacks.

Validation

  The validation code is almost completely moved over from
  Xerces 1.x. The only thing remaining is to plug it in and
  see if it starts validating the document! I know that I'm
  over-simplifying the status of the validation but it should
  be "real soon now".

API Support

  We already had SAX1 and SAX2 support. Now we also have DOM 
  support but not at the same level that we did in version 1.x 
  of Xerces. The deferred DOM is currently out of commision and 
  *may* return in the future. The DOM parser simply builds a 
  DOM tree programmatically from the Xerces DOM implementation.
  Supporting some of the older features is still questionable
  and open to discussion.

  All kinds of different APIs can be supported by Xerces2 by
  using the XNI (Xerces Native Interface) callbacks. For a
  really good example of all of the callbacks that are available
  as well as to see the parser in action, check out the
  xni.DocumentTracer sample in the samples/xni/ directory.

Anything Else?

  I could really only go into detail about the parts that I
  worked on or knew something about, so if I missed anything, 
  please post your own progress reports to fill in the gaps.

The current implementation is available from CVS. Check out my
web page for instructions on how to check it out from the
repository:

  http://www.apache.org/~andyc/xerces2/

Lastly, we had a really good discussion with people at the
Apache Xerces2 Workshop hosted by Dirk at Covalent. There should
be a report of those proceedings posted soon. I want to thank
everyone who showed up and provided both input and a pledge to
help with the development of the next version of the Xerces
parser. I have great optimism for the future! :)

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org