You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Jesse Pelton (JIRA)" <xe...@xml.apache.org> on 2007/08/08 18:04:59 UTC
[jira] Commented: (XERCESC-1284) Set "UTF-16" encoding for UTF16-BE
entity with BOM results in parse failure
[ https://issues.apache.org/jira/browse/XERCESC-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518487 ]
Jesse Pelton commented on XERCESC-1284:
---------------------------------------
This was fixed after 2.7 was released.
The relevant commit appears to be 378729. See http://svn.apache.org/viewvc?view=rev&revision=378729
> Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure
> ---------------------------------------------------------------------------
>
> Key: XERCESC-1284
> URL: https://issues.apache.org/jira/browse/XERCESC-1284
> Project: Xerces-C++
> Issue Type: Bug
> Affects Versions: 2.7.0
> Environment: Fedora Core 1, x86 PC, gcc. Also seen similar failures in a Solaris 9 environment with the forte compiler.
> Reporter: Daniel McLean
> Attachments: MemParseEncoding.tar.gz, utf8BOMTest.tar.gz
>
>
> Setting the encoding as "UTF-16" using the InputSource.setEncoding() method seems to create problems during parsing.
> If I have a UTF-16BE document with a BOM, this parses successfully when no encoding set is explicitly set or when the encoding is set to "UTF-16BE". When set to "UTF-16", a fatal error occurs with:
> Fatal Error at (file test, line 1, char 1): Invalid document structure
> Some investigation: Having looked through the Xerces source and done some testing, it appears that when "UTF-16BE" is set, the "UTF-16 (BE)" transcoder is used when a match is detected against the known encoding string. When "UTF-16" is set, no known encoding is detected and the document is probed for an encoding, resulting in the XMLUTF16Transcoder being used. In the latter case, when XMLScanner::scanProlog() is called, it ends up reading the BOM and choking because it doesn't look like a piece of prologue. I'm guessing that either the trancoder should have removed the BOM, the BOM should be detected and ignored, or the BOM should have been trimmed off beforehand.
> I've attached a test case which is derived from the MemParse sample, which parses four different UTF-16 document (BE with BOM, BE without BOM, LE with BOM, LE without BOM (I realise UTF-16 XML entities should have a BOM, but in my case I want to know what happens if a client of my software feeds in a UTF-16 document without a BOM) using four different encoding approaches (no encoding set, "UTF-16", "UTF-16BE", "UTF-16LE").
> A summary of parsing success and failure on linux:
> FILE: UTF-16BE with BOM
> ENCODING: : Succeeded.
> ENCODING: UTF-16: Fatal error.
> ENCODING: UTF-16BE: Succeeded.
> ENCODING: UTF-16LE: Fatal error.
> --------------------------------
> FILE: UTF-16BE without BOM
> ENCODING: Fatal error. (due to guess of UTF-8)
> ENCODING: UTF-16: Succeeded.
> ENCODING: UTF-16BE: Succeeded.
> ENCODING: UTF-16LE: Fatal error.
> --------------------------------
> FILE: UTF-16LE with BOM
> ENCODING: : Succeeded.
> ENCODING: UTF-16: Fatal error.
> ENCODING: UTF-16BE: Fatal error.
> ENCODING: UTF-16LE: Succeeded.
> --------------------------------
> FILE: UTF-16LE with BOM
> ENCODING: : Fatal error. (due to guess of UTF-8)
> ENCODING: UTF-16: Succeeded.
> ENCODING: UTF-16BE: Fatal error.
> ENCODING: UTF-16LE: Succeeded.
> --------------------------------
> Maybe there is a good reason for Xerces current behaviour, but it
> escapes me. I note that the lack of BOM helps parser success
> when setting an encoding of "UTF-16", supporting my assertion above.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org