You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by Elliotte Harold <el...@metalab.unc.edu> on 2003/10/07 05:22:25 UTC

Reporting of well-formed content in malformed documents

Consider a document such as the following:

<root>
  <child1/>
  <child2>
</root>

Clearly it is malformed because the </child2> end-tag is missing. 
However, a streaming parser using SAX will still report 
startDocument(),  startElement(root), characters(), 
startElement(child1), endElement(child1), characters(), and 
startElement(child2) before the malformedness is detected and a 
SAXParseException is thrown.

Or will it? In my tests with Xerces-J 2.5 I'm getting only 
startDocument() before a SAXParseException is thrown. The XML spec does 
not require a parser to throw away content found before the first 
well-formedness error. However, Xerces seems to be throwing it away for 
me, and I can't find anything in the SAX spec to say this is wrong. Not 
having guaranteed access to the well-formed initail section of the 
document really decreases the usefulness of a streaming API.

For my app, I would like to guarantee that all content before the first 
well-formedness error is reported via the normal mechanisms. is this 
possible? Is this a good idea? Should SAX be rewritten to require this 
behavior? Or am I out to sea? Thoughts?


--
Elliotte Rusty Harold


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org


Re: [Sax-devel] Reporting of well-formed content in malformed documents

Posted by Elliotte Rusty Harold <el...@metalab.unc.edu>.
At 11:22 PM -0400 10/6/03, Elliotte Harold wrote:
>Consider a document such as the following:
>
><root>
>  <child1/>
>  <child2>
></root>
>

Following up on this, I withdraw my original report that Xerces was 
not reporting all the content before the first well-formedness error. 
It was, but the unusual nature of the code caused me to misattribute 
to Xerces a different bug. My apologies.

However, I do think the SAX ContentHandler documentation still needs 
to be clearer, that well-formedness errors should not be reported 
until after all preceding content has been reported through the usual 
mechanism. This strikes me as especially important for filters that 
sit on top of other API such as StAX or XNI where it might be 
possible for a well-formedness error to be detected in one thread 
before the queue of SAX events has been emptied in another thread.

The current documentation says that the order of the events in the 
ContentHandler interface is important, but it doesn't say that for 
events in the ErrorHandler interface. There are other ordering issues 
left unsaid as well. For instance, I can't find any rule that 
comments reported by the LexicalHandler need to be reported at their 
appropriate position.

I think some general language is needed such as:

The order of events in the ConetentHandler, ErrorHandler, and 
LexicalHandler interfaces is very important, and mirrors the order of 
information in the document itself. For example, all of an element's 
content (character data, comments, processing instructions, and/or 
subelements) will appear, in order, between the startElement event 
and the corresponding endElement event.



-- 

   Elliotte Rusty Harold
   elharo@metalab.unc.edu
   Processing XML with Java (Addison-Wesley, 2002)
   http://www.cafeconleche.org/books/xmljava
   http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org