You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Andy Clark <an...@apache.org> on 2000/10/01 23:22:00 UTC

[Xerces2] Update

I haven't posted anything for awhile so I thought that I'd give
the Xerces community an update about what we've accomplished. I
apologize ahead of time for the length of this message but I
wanted to make sure that I covered everything.

First, I set a bogus deadline for Milestone 1 for the Xerces 2
concept implementation of last Friday, 29 September. This date
wasn't a hard and fast date but, rather, was a way of framing
the progress of the implementation.

Tasks were set for the milestone and people were assigned as
drivers for each task. Since noone stepped forward to help
contribute, you'll notice that the drivers for the tasks are
"those" guys from a company that will remain unnamed. So I
don't want to hear any complaining... ;)

Anyway, the first milestone was rather simple: write docs;
fill in the various support classes; and get a basic document
scanner working. In that respect, Milestone 1 was reached
successfully but not without some limitations and significant
work left to do. For example:

1) The document scanner was implemented using recursive
   decent; could only handle elements, attributes, and basic
   text content; and didn't use the error reporting mechanism.
2) The entity manager and scanner are very simple. In fact, I
   think it wins an award for being the most inefficient 
   entity scanner ever. :) But performance wasn't the goal;
   the goal was to write a basic entity scanner that only
   implemented the bare minimum for the document scanner. And
   that's what it does. Buffering and other optimizations can
   be added to the entity scanner in the next Milestone.

After reaching Milestone 1, I tagged the source files with
the name "x2m1". Then I realized that it wasn't documented
anywhere about how to actually get to the source code on the
development branch. So I updated the main project page on
my Apache web site to include instructions on how to extract
the source from CVS and document all of the branches that
are available. You can check out these directions at the
following URL:

  http://www.apache.org/~andyc/xerces2/index.html#SourceCode

This should make it easier for people following the Xerces
2 concept implementation development to get the source and
jump in and help. All of the docs for the concept design
are in CVS as well.

Next, I wasn't satisfied with the simple document scanner
so I re-wrote it over the weekend. I borrowed the main idea
from the original Xerces implementation. The scanner is set
up as a state machine which provides the following features:

1) The parsing of the document is always "top level". In
   other words, it never recurses. Instead, scanning of
   components in the XML document cause the state machine
   to transition from state to state.
2) Having a state machine allows the document scanner to
   support "pull-parsing" which has been discussed at some
   length on the mailing list in the past. If you look at
   the code, you'll see the "dispatching" mechanism that
   allows for pull-parsing where the application can
   drive the parser and not the other way around.

A lot of the code of the original Xerces implementation
was tied directly to the old entity scanning mechanism.
This, added with the fact that all of the states were
"unrolled", caused the code to become rather complicated.
The new re-write of the scanner simplifies this with the
thought that it can be modified later for performance.

The new scanner code can also handle more XML constructs
than the scanner checked in as part of Milestone 1. The
scanner can handle the XMLDecl line; the prolog with the
DoctypeDecl; comments and processing instructions; basic
entities; as well as elements, attributes, and text content.

There's still considerable work to do on the document 
scanner, though. But I think it's a reasonable skeleton
to build on.

Last, we'll be starting work towards Milestone 2 soon. We
still have to set tasks for the milestone and would like
help from anyone that wants to help. This coming Friday
may be too ambitious of a "deadline" for Milestone 2 but
that will depend on the tasks for the milestone.

Right now I'm thinking of the following being some tasks
for Milestone 2. 

1) More documentation
2) Re-organizing of package structure to separate core
   interfaces and classes from the implementation. My
   current thoughts are that we'll have an XNI package
   that contains the core interfaces; an implementation 
   package; and a parsers package. There could be more,
   though, but I'd like to "clean up" the Xerces package.
3) Improve entity scanner and handle entity readers.
4) Scanning the DOCTYPE and DTD markup.
5) Storing basic DTD information.

Let me know if anyone has any suggestions or wants to
help with the implementation, documentation, or 
administrative work.

And that's all for my update.

-- 
Andy Clark * IBM, JTC - Silicon Valley * andyc@apache.org