You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@santuario.apache.org by "TH Heung (JIRA)" <ji...@apache.org> on 2014/06/13 05:05:01 UTC

[jira] [Commented] (SANTUARIO-307) utf8 encode is broken

    [ https://issues.apache.org/jira/browse/SANTUARIO-307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030209#comment-14030209 ] 

TH Heung commented on SANTUARIO-307:
------------------------------------

May I know the status of this issue?  Is someone reviewing the code I uploaded?  Will the update be included in the next release?

> utf8 encode is broken
> ---------------------
>
>                 Key: SANTUARIO-307
>                 URL: https://issues.apache.org/jira/browse/SANTUARIO-307
>             Project: Santuario
>          Issue Type: Bug
>      Security Level: Public(Public issues, viewable by everyone) 
>          Components: Java
>    Affects Versions: Java 1.4.4
>         Environment: Ubuntu 10.04 LTS 32 bit
> JRE 5.x
>            Reporter: steve
>            Assignee: Colm O hEigeartaigh
>              Labels: Surrogates, UtfHelpper, utf8
>         Attachments: CanonicalizerBase.java, CanonicalizerBase.java, UtfHelperTest.java, UtfHelpper.java, UtfHelpper.java
>
>
> This code:
> if ((c >= 0xD800 && c <= 0xDBFF) || (c >= 0xDC00 && c <= 0xDFFF) ){
>   //No Surrogates in sun java
>   out.write(0x3f);
>   return;
> }
> from UtfHelpper.writeCharToUtf8 and other methods in UtfHelpper seems to be excluding these 3 unicode blocks:
> http://www.fileformat.info/info/unicode/block/high_surrogates/index.htm
> http://www.fileformat.info/info/unicode/block/high_private_use_surrogates/index.htm
> http://www.fileformat.info/info/unicode/block/low_surrogates/index.htm
> The problem is that some characters from other sections fall in that range when encoded in UTF-16. For example this character 0x0002000B encoded as UTF16 is 0xD840 0xDC0B.
> http://www.fileformat.info/info/unicode/char/2000b/index.htm
> This causes the output to be corrupted as ? in some cases.
> here is a code sample:
> Canonicalizer c14n = Canonicalizer.getInstance(Canonicalizer.ALGO_ID_C14N_EXCL_OMIT_COMMENTS);
> String doc1 = "<a>\u4E1F</a>";
> String doc2 = "<a>\uD840\uDC0B</a>";
> System.out.println("doc1 before:" + doc1);
> byte [] output = c14n.canonicalize(doc1.getBytes("UTF8"));
> System.out.println("doc1 after:" + new String(output, "UTF8"));
> System.out.println("doc2 before:" + doc2);
> output = c14n.canonicalize(doc2.getBytes("UTF8"));
> System.out.println("doc2 after:" + new String(output, "UTF8"));
> the output is:
> doc1 before:<a>丟</a>
> doc1 after:<a>丟</a>
> doc2 before:<a>𠀋</a>
> doc2 after:<a>??</a>
> Notice that "doc2 after" corrupted, as <a>??</a> instead of <a>𠀋</a>
> Based on the code there does not seem to be any workaround.



--
This message was sent by Atlassian JIRA
(v6.2#6252)