You are viewing a plain text version of this content. The canonical link for it is here.

Posted to j-dev@xerces.apache.org by "Radu Coravu (JIRA)" <xe...@xml.apache.org> on 2012/07/19 09:49:33 UTC

[jira] [Created] (XERCESJ-1574) Problem with detected encoding for UTF-16 encoded as Unicode Little

Radu Coravu created XERCESJ-1574:
------------------------------------

             Summary: Problem with detected encoding for UTF-16 encoded as Unicode Little
                 Key: XERCESJ-1574
                 URL: https://issues.apache.org/jira/browse/XERCESJ-1574
             Project: Xerces2-J
          Issue Type: Bug
          Components: DOM (Level 3 Core)
    Affects Versions: 2.11.0
            Reporter: Radu Coravu


I have the following test case:

    ByteArrayInputStream bis = new ByteArrayInputStream(
          "<?xml version=\"1.0\" encoding=\"UTF-16\"?> <a/>".getBytes("UnicodeLittle"));
    InputSource is = new InputSource(bis);
    DOMParser dp = new DOMParser();
    dp.parse(is);
    assertEquals("UTF-16LE", dp.getDocument().getInputEncoding());

The input stream is encoded as "UnicodeLittle" and " dp.getDocument().getInputEncoding()" should return "UTF-16LE" (at least it did so in the previous Xerces version). Right now it returns "UTF-16" regardless of the byte order mark in the input stream.

So a developer using the information from "dp.getDocument().getInputEncoding()" information does not know how to save the document in order to preserve the same BOM.

This problem is related to the modifications which were made in the XMLEntityManager related to encoding detection.

As a proposed modification, in the method:

org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(String, XMLInputSource, boolean, boolean)

before the code:

fCurrentEntity = new ScannedEntity(name,....

we could add the following code:

        if("UTF-16".equals(encoding)) {
          if(isBigEndian != null) {
            if(isBigEndian) {
              encoding = "UTF-16BE"; 
            } else {
              encoding = "UTF-16LE";
            }
          }
        }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

[jira] [Assigned] (XERCESJ-1574) Problem with detected encoding for UTF-16 encoded as Unicode Little

Posted by "Michael Glavassevich (JIRA)" <xe...@xml.apache.org>.

     [ https://issues.apache.org/jira/browse/XERCESJ-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Glavassevich reassigned XERCESJ-1574:
---------------------------------------------

    Assignee: Michael Glavassevich
    
> Problem with detected encoding for UTF-16 encoded as Unicode Little
> -------------------------------------------------------------------
>
>                 Key: XERCESJ-1574
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1574
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: DOM (Level 3 Core)
>    Affects Versions: 2.11.0
>            Reporter: Radu Coravu
>            Assignee: Michael Glavassevich
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have the following test case:
>     ByteArrayInputStream bis = new ByteArrayInputStream(
>           "<?xml version=\"1.0\" encoding=\"UTF-16\"?> <a/>".getBytes("UnicodeLittle"));
>     InputSource is = new InputSource(bis);
>     DOMParser dp = new DOMParser();
>     dp.parse(is);
>     assertEquals("UTF-16LE", dp.getDocument().getInputEncoding());
> The input stream is encoded as "UnicodeLittle" and " dp.getDocument().getInputEncoding()" should return "UTF-16LE" (at least it did so in the previous Xerces version). Right now it returns "UTF-16" regardless of the byte order mark in the input stream.
> So a developer using the information from "dp.getDocument().getInputEncoding()" information does not know how to save the document in order to preserve the same BOM.
> This problem is related to the modifications which were made in the XMLEntityManager related to encoding detection.
> As a proposed modification, in the method:
> org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(String, XMLInputSource, boolean, boolean)
> before the code:
> fCurrentEntity = new ScannedEntity(name,....
> we could add the following code:
>         if("UTF-16".equals(encoding)) {
>           if(isBigEndian != null) {
>             if(isBigEndian) {
>               encoding = "UTF-16BE"; 
>             } else {
>               encoding = "UTF-16LE";
>             }
>           }
>         }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

[jira] [Updated] (XERCESJ-1574) Problem with detected encoding for UTF-16 encoded as Unicode Little

Posted by "Chris Simmons (JIRA)" <xe...@xml.apache.org>.

     [ https://issues.apache.org/jira/browse/XERCESJ-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Simmons updated XERCESJ-1574:
-----------------------------------

    Attachment: patch.txt

It seems that the same changes completely broke encoding detection for the "ISO-10646-UCS-2" encoding.  This test passes with r578659 of XMLEntityManager (which was in 2.9.1) but was broken in the subsequent revision r581487 leading to this failure:-

org.xml.sax.SAXParseException: Content is not allowed in prolog.

This problem was also fixed in r1363647 so probably has the same underlying cause.  I haven't investigated whether any other encodings are affected.

This seemed sufficiently bad for me build a patched Xerces locally, perhaps a 2.11 point release is warranted?
                
> Problem with detected encoding for UTF-16 encoded as Unicode Little
> -------------------------------------------------------------------
>
>                 Key: XERCESJ-1574
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1574
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: DOM (Level 3 Core)
>    Affects Versions: 2.11.0
>            Reporter: Radu Coravu
>            Assignee: Michael Glavassevich
>         Attachments: patch.txt
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have the following test case:
>     ByteArrayInputStream bis = new ByteArrayInputStream(
>           "<?xml version=\"1.0\" encoding=\"UTF-16\"?> <a/>".getBytes("UnicodeLittle"));
>     InputSource is = new InputSource(bis);
>     DOMParser dp = new DOMParser();
>     dp.parse(is);
>     assertEquals("UTF-16LE", dp.getDocument().getInputEncoding());
> The input stream is encoded as "UnicodeLittle" and " dp.getDocument().getInputEncoding()" should return "UTF-16LE" (at least it did so in the previous Xerces version). Right now it returns "UTF-16" regardless of the byte order mark in the input stream.
> So a developer using the information from "dp.getDocument().getInputEncoding()" information does not know how to save the document in order to preserve the same BOM.
> This problem is related to the modifications which were made in the XMLEntityManager related to encoding detection.
> As a proposed modification, in the method:
> org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(String, XMLInputSource, boolean, boolean)
> before the code:
> fCurrentEntity = new ScannedEntity(name,....
> we could add the following code:
>         if("UTF-16".equals(encoding)) {
>           if(isBigEndian != null) {
>             if(isBigEndian) {
>               encoding = "UTF-16BE"; 
>             } else {
>               encoding = "UTF-16LE";
>             }
>           }
>         }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

[jira] [Resolved] (XERCESJ-1574) Problem with detected encoding for UTF-16 encoded as Unicode Little

Posted by "Michael Glavassevich (JIRA)" <xe...@xml.apache.org>.

     [ https://issues.apache.org/jira/browse/XERCESJ-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Glavassevich resolved XERCESJ-1574.
-------------------------------------------

    Resolution: Fixed

There was some code reorganization around Xerces 2.9.1 which would have caused this. I've restored the old behaviour. Thanks for reporting. See SVN rev 1363647 for the fix.
                
> Problem with detected encoding for UTF-16 encoded as Unicode Little
> -------------------------------------------------------------------
>
>                 Key: XERCESJ-1574
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1574
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: DOM (Level 3 Core)
>    Affects Versions: 2.11.0
>            Reporter: Radu Coravu
>            Assignee: Michael Glavassevich
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have the following test case:
>     ByteArrayInputStream bis = new ByteArrayInputStream(
>           "<?xml version=\"1.0\" encoding=\"UTF-16\"?> <a/>".getBytes("UnicodeLittle"));
>     InputSource is = new InputSource(bis);
>     DOMParser dp = new DOMParser();
>     dp.parse(is);
>     assertEquals("UTF-16LE", dp.getDocument().getInputEncoding());
> The input stream is encoded as "UnicodeLittle" and " dp.getDocument().getInputEncoding()" should return "UTF-16LE" (at least it did so in the previous Xerces version). Right now it returns "UTF-16" regardless of the byte order mark in the input stream.
> So a developer using the information from "dp.getDocument().getInputEncoding()" information does not know how to save the document in order to preserve the same BOM.
> This problem is related to the modifications which were made in the XMLEntityManager related to encoding detection.
> As a proposed modification, in the method:
> org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(String, XMLInputSource, boolean, boolean)
> before the code:
> fCurrentEntity = new ScannedEntity(name,....
> we could add the following code:
>         if("UTF-16".equals(encoding)) {
>           if(isBigEndian != null) {
>             if(isBigEndian) {
>               encoding = "UTF-16BE"; 
>             } else {
>               encoding = "UTF-16LE";
>             }
>           }
>         }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org

[jira] [Commented] (XERCESJ-1574) Problem with detected encoding for UTF-16 encoded as Unicode Little

Posted by "Radu Coravu (JIRA)" <xe...@xml.apache.org>.

    [ https://issues.apache.org/jira/browse/XERCESJ-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418932#comment-13418932 ] 

Radu Coravu commented on XERCESJ-1574:
--------------------------------------

Thanks Michael, the fix looks solid.
                
> Problem with detected encoding for UTF-16 encoded as Unicode Little
> -------------------------------------------------------------------
>
>                 Key: XERCESJ-1574
>                 URL: https://issues.apache.org/jira/browse/XERCESJ-1574
>             Project: Xerces2-J
>          Issue Type: Bug
>          Components: DOM (Level 3 Core)
>    Affects Versions: 2.11.0
>            Reporter: Radu Coravu
>            Assignee: Michael Glavassevich
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> I have the following test case:
>     ByteArrayInputStream bis = new ByteArrayInputStream(
>           "<?xml version=\"1.0\" encoding=\"UTF-16\"?> <a/>".getBytes("UnicodeLittle"));
>     InputSource is = new InputSource(bis);
>     DOMParser dp = new DOMParser();
>     dp.parse(is);
>     assertEquals("UTF-16LE", dp.getDocument().getInputEncoding());
> The input stream is encoded as "UnicodeLittle" and " dp.getDocument().getInputEncoding()" should return "UTF-16LE" (at least it did so in the previous Xerces version). Right now it returns "UTF-16" regardless of the byte order mark in the input stream.
> So a developer using the information from "dp.getDocument().getInputEncoding()" information does not know how to save the document in order to preserve the same BOM.
> This problem is related to the modifications which were made in the XMLEntityManager related to encoding detection.
> As a proposed modification, in the method:
> org.apache.xerces.impl.XMLEntityManager.setupCurrentEntity(String, XMLInputSource, boolean, boolean)
> before the code:
> fCurrentEntity = new ScannedEntity(name,....
> we could add the following code:
>         if("UTF-16".equals(encoding)) {
>           if(isBigEndian != null) {
>             if(isBigEndian) {
>               encoding = "UTF-16BE"; 
>             } else {
>               encoding = "UTF-16LE";
>             }
>           }
>         }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: j-dev-unsubscribe@xerces.apache.org
For additional commands, e-mail: j-dev-help@xerces.apache.org