You are viewing a plain text version of this content. The canonical link for it is here.

Posted to c-dev@xerces.apache.org by xe...@xml.apache.org on 2004/10/05 08:55:32 UTC

[jira] Created: (XERCESC-1284) Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure

Message:

  A new issue has been created in JIRA.

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1284

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1284
    Summary: Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure
       Type: Bug

     Status: Unassigned
   Priority: Major

    Project: Xerces-C++
   Versions:
             2.6.0

   Assignee: 
   Reporter: Daniel McLean

    Created: Mon, 4 Oct 2004 11:55 PM
    Updated: Mon, 4 Oct 2004 11:55 PM
Environment: Fedora Core 1, x86 PC, gcc.  Also seen similar failures in a Solaris 9 environment with the forte compiler.

Description:
Setting the encoding as "UTF-16" using the InputSource.setEncoding() method seems to create problems during parsing.

If I have a UTF-16BE document with a BOM, this parses successfully when no encoding set is explicitly set or when the encoding is set to "UTF-16BE".  When set to "UTF-16", a fatal error occurs with:               
   Fatal Error at (file test, line 1, char 1): Invalid document structure

Some investigation: Having looked through the Xerces source and done some testing, it appears that when "UTF-16BE" is set, the "UTF-16 (BE)" transcoder is used when a match is detected against the known encoding string.  When "UTF-16" is set, no known encoding is detected and the document is probed for an encoding, resulting in the XMLUTF16Transcoder being used.  In the latter case, when XMLScanner::scanProlog() is called, it ends up reading the BOM and choking because it doesn't look like a piece of prologue.  I'm guessing that either the trancoder should have removed the BOM, the BOM should be detected and ignored, or the BOM should have been trimmed off beforehand.

I've attached a test case which is derived from the MemParse sample, which parses four different UTF-16 document (BE with BOM, BE without BOM, LE with BOM, LE without BOM (I realise UTF-16 XML entities should have a BOM, but in my case I want to know what happens if a client of my software feeds in a UTF-16 document without a BOM) using four different encoding approaches (no encoding set, "UTF-16", "UTF-16BE", "UTF-16LE").

A summary of parsing success and failure on linux:

FILE: UTF-16BE with BOM
ENCODING: : Succeeded.
ENCODING: UTF-16: Fatal error.
ENCODING: UTF-16BE: Succeeded.
ENCODING: UTF-16LE: Fatal error.
--------------------------------
FILE: UTF-16BE without BOM
ENCODING: Fatal error. (due to guess of UTF-8)
ENCODING: UTF-16: Succeeded.
ENCODING: UTF-16BE: Succeeded.
ENCODING: UTF-16LE: Fatal error.
--------------------------------
FILE: UTF-16LE with BOM
ENCODING: : Succeeded.
ENCODING: UTF-16: Fatal error.
ENCODING: UTF-16BE: Fatal error.
ENCODING: UTF-16LE: Succeeded.
--------------------------------
FILE: UTF-16LE with BOM
ENCODING: : Fatal error. (due to guess of UTF-8)
ENCODING: UTF-16: Succeeded.
ENCODING: UTF-16BE: Fatal error.
ENCODING: UTF-16LE: Succeeded.
--------------------------------

Maybe there is a good reason for Xerces current behaviour, but it
escapes me.  I note that the lack of BOM helps parser success
when setting an encoding of "UTF-16", supporting my assertion above.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

[jira] Updated: (XERCESC-1284) Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure

Posted by xe...@xml.apache.org.

The following issue has been updated:

    Updater: Daniel McLean (mailto:daniel@danielmclean.id.au)
       Date: Thu, 7 Oct 2004 9:31 PM
    Comment:
The same problem appears to occur when explicitly setting an encoding
of "UTF-8" when a BOM occurs at the start of the UTF-8 XML entity
(which is allowed).
I've attached a testcase which demonstrates this problem:
   - no encoding set, UTF-8 with BOM: succeeds
   - no encoding set, UTF-8 without BOM: succeeds
   - "UTF-8" encoding set, UTF-8 with BOM: fails
   - "UTF-8" encoding set, UTF-8 without BOM: succeeds

Again, it appears Xerces is probing the document, recognising
the encoding as UTF-8 and using the 'UTF-8' transcoder, but
the BOM is not trimmed nor is it recognised as a valid character
at line 1, character 1:
   Fatal Error at (file test, line 1, char 1):
      Invalid document structure
    Changes:
             Attachment changed to utf8BOMTest.tar.gz
    ---------------------------------------------------------------------
For a full history of the issue, see:

  http://issues.apache.org/jira/browse/XERCESC-1284?page=history

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1284

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1284
    Summary: Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure
       Type: Bug

     Status: Unassigned
   Priority: Major

    Project: Xerces-C++
   Versions:
             2.6.0

   Assignee: 
   Reporter: Daniel McLean

    Created: Mon, 4 Oct 2004 11:55 PM
    Updated: Thu, 7 Oct 2004 9:31 PM
Environment: Fedora Core 1, x86 PC, gcc.  Also seen similar failures in a Solaris 9 environment with the forte compiler.

Description:
Setting the encoding as "UTF-16" using the InputSource.setEncoding() method seems to create problems during parsing.

If I have a UTF-16BE document with a BOM, this parses successfully when no encoding set is explicitly set or when the encoding is set to "UTF-16BE".  When set to "UTF-16", a fatal error occurs with:               
   Fatal Error at (file test, line 1, char 1): Invalid document structure

Some investigation: Having looked through the Xerces source and done some testing, it appears that when "UTF-16BE" is set, the "UTF-16 (BE)" transcoder is used when a match is detected against the known encoding string.  When "UTF-16" is set, no known encoding is detected and the document is probed for an encoding, resulting in the XMLUTF16Transcoder being used.  In the latter case, when XMLScanner::scanProlog() is called, it ends up reading the BOM and choking because it doesn't look like a piece of prologue.  I'm guessing that either the trancoder should have removed the BOM, the BOM should be detected and ignored, or the BOM should have been trimmed off beforehand.

I've attached a test case which is derived from the MemParse sample, which parses four different UTF-16 document (BE with BOM, BE without BOM, LE with BOM, LE without BOM (I realise UTF-16 XML entities should have a BOM, but in my case I want to know what happens if a client of my software feeds in a UTF-16 document without a BOM) using four different encoding approaches (no encoding set, "UTF-16", "UTF-16BE", "UTF-16LE").

A summary of parsing success and failure on linux:

FILE: UTF-16BE with BOM
ENCODING: : Succeeded.
ENCODING: UTF-16: Fatal error.
ENCODING: UTF-16BE: Succeeded.
ENCODING: UTF-16LE: Fatal error.
--------------------------------
FILE: UTF-16BE without BOM
ENCODING: Fatal error. (due to guess of UTF-8)
ENCODING: UTF-16: Succeeded.
ENCODING: UTF-16BE: Succeeded.
ENCODING: UTF-16LE: Fatal error.
--------------------------------
FILE: UTF-16LE with BOM
ENCODING: : Succeeded.
ENCODING: UTF-16: Fatal error.
ENCODING: UTF-16BE: Fatal error.
ENCODING: UTF-16LE: Succeeded.
--------------------------------
FILE: UTF-16LE with BOM
ENCODING: : Fatal error. (due to guess of UTF-8)
ENCODING: UTF-16: Succeeded.
ENCODING: UTF-16BE: Fatal error.
ENCODING: UTF-16LE: Succeeded.
--------------------------------

Maybe there is a good reason for Xerces current behaviour, but it
escapes me.  I note that the lack of BOM helps parser success
when setting an encoding of "UTF-16", supporting my assertion above.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org

[jira] Updated: (XERCESC-1284) Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure

Posted by xe...@xml.apache.org.

The following issue has been updated:

    Updater: Daniel McLean (mailto:daniel@danielmclean.id.au)
       Date: Mon, 4 Oct 2004 11:56 PM
    Comment:
Here is the testcase demonstrating the problems described.
    Changes:
             Attachment changed to MemParseEncoding.tar.gz
    ---------------------------------------------------------------------
For a full history of the issue, see:

  http://issues.apache.org/jira/browse/XERCESC-1284?page=history

---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESC-1284

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESC-1284
    Summary: Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure
       Type: Bug

     Status: Unassigned
   Priority: Major

    Project: Xerces-C++
   Versions:
             2.6.0

   Assignee: 
   Reporter: Daniel McLean

    Created: Mon, 4 Oct 2004 11:55 PM
    Updated: Mon, 4 Oct 2004 11:56 PM
Environment: Fedora Core 1, x86 PC, gcc.  Also seen similar failures in a Solaris 9 environment with the forte compiler.

Description:
Setting the encoding as "UTF-16" using the InputSource.setEncoding() method seems to create problems during parsing.

If I have a UTF-16BE document with a BOM, this parses successfully when no encoding set is explicitly set or when the encoding is set to "UTF-16BE".  When set to "UTF-16", a fatal error occurs with:               
   Fatal Error at (file test, line 1, char 1): Invalid document structure

Some investigation: Having looked through the Xerces source and done some testing, it appears that when "UTF-16BE" is set, the "UTF-16 (BE)" transcoder is used when a match is detected against the known encoding string.  When "UTF-16" is set, no known encoding is detected and the document is probed for an encoding, resulting in the XMLUTF16Transcoder being used.  In the latter case, when XMLScanner::scanProlog() is called, it ends up reading the BOM and choking because it doesn't look like a piece of prologue.  I'm guessing that either the trancoder should have removed the BOM, the BOM should be detected and ignored, or the BOM should have been trimmed off beforehand.

I've attached a test case which is derived from the MemParse sample, which parses four different UTF-16 document (BE with BOM, BE without BOM, LE with BOM, LE without BOM (I realise UTF-16 XML entities should have a BOM, but in my case I want to know what happens if a client of my software feeds in a UTF-16 document without a BOM) using four different encoding approaches (no encoding set, "UTF-16", "UTF-16BE", "UTF-16LE").

A summary of parsing success and failure on linux:

FILE: UTF-16BE with BOM
ENCODING: : Succeeded.
ENCODING: UTF-16: Fatal error.
ENCODING: UTF-16BE: Succeeded.
ENCODING: UTF-16LE: Fatal error.
--------------------------------
FILE: UTF-16BE without BOM
ENCODING: Fatal error. (due to guess of UTF-8)
ENCODING: UTF-16: Succeeded.
ENCODING: UTF-16BE: Succeeded.
ENCODING: UTF-16LE: Fatal error.
--------------------------------
FILE: UTF-16LE with BOM
ENCODING: : Succeeded.
ENCODING: UTF-16: Fatal error.
ENCODING: UTF-16BE: Fatal error.
ENCODING: UTF-16LE: Succeeded.
--------------------------------
FILE: UTF-16LE with BOM
ENCODING: : Fatal error. (due to guess of UTF-8)
ENCODING: UTF-16: Succeeded.
ENCODING: UTF-16BE: Fatal error.
ENCODING: UTF-16LE: Succeeded.
--------------------------------

Maybe there is a good reason for Xerces current behaviour, but it
escapes me.  I note that the lack of BOM helps parser success
when setting an encoding of "UTF-16", supporting my assertion above.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-c-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-c-dev-help@xml.apache.org