You are viewing a plain text version of this content. The canonical link for it is here.
Posted to j-dev@xerces.apache.org by bu...@apache.org on 2004/03/10 21:40:40 UTC

DO NOT REPLY [Bug 27583] New: - Xerces throws IOExcepitons that should be SAXExceptions for bad UTF-8 and similar

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=27583>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=27583

Xerces throws IOExcepitons that should be SAXExceptions for bad UTF-8 and similar

           Summary: Xerces throws IOExcepitons that should be SAXExceptions
                    for bad UTF-8 and similar
           Product: Xerces2-J
           Version: 2.6.2
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: SAX
        AssignedTo: xerces-j-dev@xml.apache.org
        ReportedBy: elharo@metalab.unc.edu


When Xerces (XMLReader.parse()) encounters malformed Unicode data such as an
invalid UTF-8 sequence it throws an IOException, more specifically a
UTFDataFormatException or a CharConversionException.  However, according to the
SAX and XML specificaitons this should be a SAXException which is reported to
the ErrorHandler's fatalError() mehtod. 

Note first from the XML spec which states, in section 4.3.3:

It is a fatal error when an XML processor encounters an entity with an encoding
that it is unable to process. It is a fatal error if an XML entity is determined
(via default, encoding declaration, or higher-level protocol) to be in a certain
encoding but contains byte sequences that are not legal in that encoding.
Specifically, it is a fatal error if an entity encoded in UTF-8 contains any
irregular code unit sequences, as defined in Unicode 3.1 [Unicode3]. Unless an
encoding is determined by a higher-level protocol, it is also a fatal error if
an XML entity contains no encoding declaration and its content is not legal
UTF-8 or UTF-16.

The SAX spec says of the fatalError() method, "This corresponds to the
definition of "fatal error" in section 1.2 of the W3C XML 1.0 Recommendation.
For example, a parser would use this callback to report the violation of a
well-formedness constraint." At one point I thought it was OK to report this as
an IOException. However, since the XML spec is unambiguous that character
encoding errors are fatal errors, and since the SAX spec does not limit fatal
errors to well-formedness errors, I think character encoding errors should be
reported as SAXExceptions rather than IOExceptions, and should be reported ot
the fatalError method.

---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-dev-help@xml.apache.org