You are viewing a plain text version of this content. The canonical link for it is here.
Posted to c-dev@xerces.apache.org by "Jesse Pelton (JIRA)" <xe...@xml.apache.org> on 2007/08/08 18:04:59 UTC

[jira] Commented: (XERCESC-1284) Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure

    [ https://issues.apache.org/jira/browse/XERCESC-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12518487 ] 

Jesse Pelton commented on XERCESC-1284:
---------------------------------------

This was fixed after 2.7 was released.

The relevant commit appears to be 378729. See http://svn.apache.org/viewvc?view=rev&revision=378729

> Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure
> ---------------------------------------------------------------------------
>
>                 Key: XERCESC-1284
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1284
>             Project: Xerces-C++
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>         Environment: Fedora Core 1, x86 PC, gcc.  Also seen similar failures in a Solaris 9 environment with the forte compiler.
>            Reporter: Daniel McLean
>         Attachments: MemParseEncoding.tar.gz, utf8BOMTest.tar.gz
>
>
> Setting the encoding as "UTF-16" using the InputSource.setEncoding() method seems to create problems during parsing.
> If I have a UTF-16BE document with a BOM, this parses successfully when no encoding set is explicitly set or when the encoding is set to "UTF-16BE".  When set to "UTF-16", a fatal error occurs with:               
>    Fatal Error at (file test, line 1, char 1): Invalid document structure
> Some investigation: Having looked through the Xerces source and done some testing, it appears that when "UTF-16BE" is set, the "UTF-16 (BE)" transcoder is used when a match is detected against the known encoding string.  When "UTF-16" is set, no known encoding is detected and the document is probed for an encoding, resulting in the XMLUTF16Transcoder being used.  In the latter case, when XMLScanner::scanProlog() is called, it ends up reading the BOM and choking because it doesn't look like a piece of prologue.  I'm guessing that either the trancoder should have removed the BOM, the BOM should be detected and ignored, or the BOM should have been trimmed off beforehand.
> I've attached a test case which is derived from the MemParse sample, which parses four different UTF-16 document (BE with BOM, BE without BOM, LE with BOM, LE without BOM (I realise UTF-16 XML entities should have a BOM, but in my case I want to know what happens if a client of my software feeds in a UTF-16 document without a BOM) using four different encoding approaches (no encoding set, "UTF-16", "UTF-16BE", "UTF-16LE").
> A summary of parsing success and failure on linux:
> FILE: UTF-16BE with BOM
> ENCODING: : Succeeded.
> ENCODING: UTF-16: Fatal error.
> ENCODING: UTF-16BE: Succeeded.
> ENCODING: UTF-16LE: Fatal error.
> --------------------------------
> FILE: UTF-16BE without BOM
> ENCODING: Fatal error. (due to guess of UTF-8)
> ENCODING: UTF-16: Succeeded.
> ENCODING: UTF-16BE: Succeeded.
> ENCODING: UTF-16LE: Fatal error.
> --------------------------------
> FILE: UTF-16LE with BOM
> ENCODING: : Succeeded.
> ENCODING: UTF-16: Fatal error.
> ENCODING: UTF-16BE: Fatal error.
> ENCODING: UTF-16LE: Succeeded.
> --------------------------------
> FILE: UTF-16LE with BOM
> ENCODING: : Fatal error. (due to guess of UTF-8)
> ENCODING: UTF-16: Succeeded.
> ENCODING: UTF-16BE: Fatal error.
> ENCODING: UTF-16LE: Succeeded.
> --------------------------------
> Maybe there is a good reason for Xerces current behaviour, but it
> escapes me.  I note that the lack of BOM helps parser success
> when setting an encoding of "UTF-16", supporting my assertion above.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: c-dev-help@xerces.apache.org