You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by "Peter De Maeyer (JIRA)" <ji...@apache.org> on 2018/08/30 20:17:00 UTC

[jira] [Commented] (XALANJ-2617) Serializer produces separately escaped surrogate pair instead of codepoint

    [ https://issues.apache.org/jira/browse/XALANJ-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16597890#comment-16597890 ] 

Peter De Maeyer commented on XALANJ-2617:
-----------------------------------------

[~danielkec], I reviewed your patch, but I wonder if the 'if' statement you added shouldn't have been an 'if else' statement. The reason I wonder is that the 'else if' and 'else' branches that immediately follow your changes now get a different meaning. Especially the 'else' branch that has a code comment "// This is a fallback plan, we should never get here" unsettles me. I can imagine that some _new_ scenario is indeed fixed, but I'm worried that some _existing_ scenarios (which were supposed to trigger this "fallback plan") might be broken. The fact that I did not find any unit tests to illustrate those scenarios does not help take away my concern either. I'm not familiar with the Xalan codebase (I am very familiar with Java code in general though), so maybe I misunderstand a couple of things, it would really appreciate it if you could reassure me this patch is indeed the right fix.

Then as a cosmetic side note, I noticed that your patch does not respect the code style of the surrounding code. '{' should not be on a new line, and there are 2 instances of a ',' that aren't followed by a space as they should. Call me pedantic, but I'm just being thorough here.

> Serializer produces separately escaped surrogate pair instead of codepoint
> --------------------------------------------------------------------------
>
>                 Key: XALANJ-2617
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2617
>             Project: XalanJ2
>          Issue Type: Bug
>      Security Level: No security risk; visible to anyone(Ordinary problems in Xalan projects.  Anybody can view the issue.) 
>          Components: Serialization, Xalan
>    Affects Versions: 2.7.1, 2.7.2
>            Reporter: Daniel Kec
>            Assignee: Steven J. Hathaway
>            Priority: Major
>         Attachments: JI9053942.java, XALANJ-2617_Fix_missing_surrogate_pairs_support.patch
>
>
> When trying to serialize XML with char consisting of unicode surogate char "\uD840\uDC0B" I have tried several and non worked. XML Transformer creates XML string with escaped surogate pair separately, which makes XML unparseable. eg.: SAXParseException; Character reference "&#55360" is an invalid XML character. It looks like a bug introduced in the XALANJ-2271 fix.
>  
> {code:java|title=Output of Xalan ver. 2.7.2}
> kec@phoebe:~/Downloads$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> kec@phoebe:~/Downloads$ java -cp /home/kec/.m2/repository/xml-apis/xml-apis/1.4.01/xml-apis-1.4.01.jar:/home/kec/.m2/repository/xalan/xalan/2.7.2/xalan-2.7.2.jar:/home/kec/.m2/repository/xalan/serializer/2.7.2/serializer-2.7.2.jar:. JI9053942
> Character: 𠀋
> EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>&#131083;</a>
>  ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>&#55360;&#56331;</a>
> [Fatal Error] :1:50: Character reference "&#
> {code}
> {code:java|title=But Xalan ver. 2.7.0 works OK}
> kec@phoebe:~/Downloads$ java -cp /home/kec/.m2/repository/xml-apis/xml-apis/1.4.01/xml-apis-1.4.01.jar:/home/kec/.m2/repository/xalan/xalan/2.7.0/xalan-2.7.0.jar:/home/kec/.m2/repository/xalan/serializer/2.7.0/serializer-2.7.0.jar:. JI9053942
> Character: 𠀋
> EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>&#131083;</a>
>  ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>&#131083;</a>
> ACTUAL PARSED CHAR 𠀋
> {code}
> {code:java|title=Test}
> String value = "\uD840\uDC0B"; 
> System.out.println("Character: " + value); 
> System.out.println("EXPECTED: <?xml version=\"1.0\" encoding=\"UTF-8\"?><a>&#" + value.codePointAt(0) + ";</a>"); 
> StringWriter writer = new StringWriter(); 
> final DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder(); 
> Document dom = documentBuilder.newDocument(); 
> final Element rootEl = dom.createElement("a"); 
> rootEl.setTextContent(value); 
> dom.appendChild(rootEl); 
> Transformer transformer = TransformerFactory.newInstance().newTransformer(); 
> transformer.transform(new DOMSource(dom), new javax.xml.transform.stream.StreamResult(writer)); 
> String xml = writer.toString(); 
> System.out.println(" ACTUAL: " + xml); 
> InputSource inputSource = new InputSource(); 
> inputSource.setCharacterStream(new StringReader(xml)); 
> System.out.println("ACTUAL PARSED CHAR " + documentBuilder.parse(inputSource).getDocumentElement().getTextContent()); 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xalan.apache.org
For additional commands, e-mail: dev-help@xalan.apache.org