You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Elliotte Harold <el...@metalab.unc.edu> on 2003/10/07 05:22:25 UTC
Reporting of well-formed content in malformed documents
Consider a document such as the following:
<root>
<child1/>
<child2>
</root>
Clearly it is malformed because the </child2> end-tag is missing.
However, a streaming parser using SAX will still report
startDocument(), startElement(root), characters(),
startElement(child1), endElement(child1), characters(), and
startElement(child2) before the malformedness is detected and a
SAXParseException is thrown.
Or will it? In my tests with Xerces-J 2.5 I'm getting only
startDocument() before a SAXParseException is thrown. The XML spec does
not require a parser to throw away content found before the first
well-formedness error. However, Xerces seems to be throwing it away for
me, and I can't find anything in the SAX spec to say this is wrong. Not
having guaranteed access to the well-formed initail section of the
document really decreases the usefulness of a streaming API.
For my app, I would like to guarantee that all content before the first
well-formedness error is reported via the normal mechanisms. is this
possible? Is this a good idea? Should SAX be rewritten to require this
behavior? Or am I out to sea? Thoughts?
--
Elliotte Rusty Harold
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org
Re: [Sax-devel] Reporting of well-formed content in malformed
documents
Posted by Elliotte Rusty Harold <el...@metalab.unc.edu>.
At 11:22 PM -0400 10/6/03, Elliotte Harold wrote:
>Consider a document such as the following:
>
><root>
> <child1/>
> <child2>
></root>
>
Following up on this, I withdraw my original report that Xerces was
not reporting all the content before the first well-formedness error.
It was, but the unusual nature of the code caused me to misattribute
to Xerces a different bug. My apologies.
However, I do think the SAX ContentHandler documentation still needs
to be clearer, that well-formedness errors should not be reported
until after all preceding content has been reported through the usual
mechanism. This strikes me as especially important for filters that
sit on top of other API such as StAX or XNI where it might be
possible for a well-formedness error to be detected in one thread
before the queue of SAX events has been emptied in another thread.
The current documentation says that the order of the events in the
ContentHandler interface is important, but it doesn't say that for
events in the ErrorHandler interface. There are other ordering issues
left unsaid as well. For instance, I can't find any rule that
comments reported by the LexicalHandler need to be reported at their
appropriate position.
I think some general language is needed such as:
The order of events in the ConetentHandler, ErrorHandler, and
LexicalHandler interfaces is very important, and mirrors the order of
information in the document itself. For example, all of an element's
content (character data, comments, processing instructions, and/or
subelements) will appear, in order, between the startElement event
and the corresponding endElement event.
--
Elliotte Rusty Harold
elharo@metalab.unc.edu
Processing XML with Java (Addison-Wesley, 2002)
http://www.cafeconleche.org/books/xmljava
http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA
---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org