You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Peter A. Volchek" <Pe...@ti.com.od.ua> on 2002/02/26 16:23:03 UTC

Parsing an invalid UTF-8 file

Here two XML files, which both have an invalid UTF-8 symbol (pound)
test_Fail.xml - does not contain the xml declaration, and should be treated
as UTF-8
test_Success.xml - specifies Latin1 encoding

When I open the test_Fail.xml with Internet Explorer I'm getting the
following error:
An invalid character was found in text content. Error processing resource
'file:///F:/test_Fail.xml'. Line 1, Position 4
<A>

When I parsing it with old (XML4C) parser I also getting the exception:
An exception occured! Type:UTFDataFormatException, Message:Invalid second
byte of a UTF-8 character sequence ( line 1, char 4 )

This is what I am expecting, but...
When I parse it with Xerces code it is parsed successfully.

I have lost myself, could someone explain me what is going on and why no
error is raised when parsing with Xerces parser.

Thanks in advance

Peter A. Volchek