You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by bu...@apache.org on 2001/11/26 15:28:38 UTC

DO NOT REPLY [Bug 5077] - Report top-level whitespace

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=5077>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=5077

Report top-level whitespace





------- Additional Comments From gmarcy@us.ibm.com  2001-11-26 06:28 -------
James Clark writes:

>I think it's consistent with the XNI philosophy of providing a
>lossless information set.  I would suggest that as a general
>principle, for every character in the document entity, there should be
>a callback with which that character is associated.  At the moment, I
>believe the only exception to this is top-level whitespace.

I feel that I should mention that top-level whitespace is not part of
the infoset for a document, but is markup as Section 2.4 of the XML spec
clearly states:

[Definition: Markup takes the form of start-tags, end-tags, empty-element
tags, entity references, character references, comments, CDATA section
delimiters, document type declarations, processing instructions, XML
declarations, text declarations, and any white space that is at the top
level of the document entity (that is, outside the document element and
not inside any other markup).]

However, since it is a unique form of markup, it should probably have a
callback associated with it.  There would be a fairly small number of
callbacks for this in most cases.

>If this is parsed as an external entity, then the whitespace between
>the XML declaration and the element will be preserved; but if it's
>parsed as a document entity, the whitespace is totally thrown away.

And as an external entity, that whitespace is character data and is part
of the document infoset.  As the document entity, that whitespace is not
part of the document infoset, it is markup.  And I assume that you are
referring to external general entities, since an external parameter entity
would have the same issue as the document entity, i.e. whitespace is not
part of the information set.

>If a method is added to XMLDocumentHandler, then it would also be
>natural to add a similar method to XMLDTDHandler providing whitespace
>between markup declarations.

This could have a larger performance impact, as DTDs typically have more
declarations than the prolog/epilog, but it is still probably small enough
a hit to take on small grammars, and we know we want to cache the larger
grammars anyway.  However, keep in mind that without caching there are some
scenarios where many small documents are processed with very large DTDs,
even when not validating (to get attr defaults, etc.) and the performance
of such cases is already quite poor.

And again, keep in mind that this whitespace is not part of the document 
information set.  In the DTD it is not even markup, it is just used to set
apart markup for greater readability.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org