You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by "Peter De Maeyer (JIRA)" <ji...@apache.org> on 2018/09/12 20:19:00 UTC
[jira] [Comment Edited] (XALANJ-2617) Serializer produces
separately escaped surrogate pair instead of codepoint
[ https://issues.apache.org/jira/browse/XALANJ-2617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612680#comment-16612680 ]
Peter De Maeyer edited comment on XALANJ-2617 at 9/12/18 8:18 PM:
------------------------------------------------------------------
It can be proven with a unit test that Daniel's fix breaks some scenarios that used to work. As I suspected, the "if" has to be an "else if". I've attached my own new patch + unit tests.
Note that the patch spans 2 repositories: the fix is relative to [http://svn.apache.org/repos/asf/xalan/java/trunk,] the unit test is relative to [http://svn.apache.org/repos/asf/xalan/test/trunk|http://svn.apache.org/repos/asf/xalan/test/trunk.].
Here is the essence of the test code:
{code:java}
/**
* This test case illustrates the original problem with high-surrogate characters.
* This is broken in Xalan 2.7.2, hence the need for a fix.
*/
public void serializationOfHighSurrogateCharactersInUtf8() throws Throwable {
reporter.testCaseInit("serializationOfHighSurrogateCharactersInUtf8");
try {
String value = "\uD840\uDC0B";
serializationOf(value, "&#" + toCodePoint(value.charAt(0), value.charAt(1)) + ";", "UTF-8");
} finally {
reporter.testCaseClose();
}
}
/**
* This is a sanity test case illustrating some US-ASCII characters and some low-surrogate non-ASCII characters.
* It works in Xalan 2.7.2 and with any of the patches, it's just a basic sanity check.
*/
public void serializationOfLowSurrogateCharactersInUtf8() throws Throwable {
reporter.testCaseInit("serializationOfLowSurrogateCharactersInUtf8");
try {
serializationOf("This is gonna cost ya some €€€", "This is gonna cost ya some €€€", "UTF-8");
} finally {
reporter.testCaseClose();
}
}
/**
* This test case illustrates a use case which works in Xalan 2.7.2 but which got <i>broken</i> by Daniel's patch.
*/
public void serializationOfLineSeparatorInAscii() throws Throwable {
reporter.testCaseInit("serializationOfLineSeparatorInAscii");
try {
serializationOf(String.valueOf((char) 0x2028), "
", "US-ASCII");
} finally {
reporter.testCaseClose();
}
}
private void serializationOf(String value, String expectedXmlValue, String encoding) throws ParserConfigurationException, TransformerException, IOException, SAXException {
System.out.println("Expected value: " + value);
String expected = "<?xml version=\"1.0\" encoding=\"" + encoding + "\"?><a>" + expectedXmlValue + "</a>";
System.out.println(" Expected XML: " + expected);
StringWriter writer = new StringWriter();
final DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document dom = documentBuilder.newDocument();
final Element rootEl = dom.createElement("a");
rootEl.setTextContent(value);
dom.appendChild(rootEl);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(javax.xml.transform.OutputKeys.ENCODING, encoding);
transformer.transform(new DOMSource(dom), new javax.xml.transform.stream.StreamResult(writer));
String actual = writer.toString();
System.out.println(" Actual XML: " + actual);
InputSource inputSource = new InputSource();
inputSource.setCharacterStream(new StringReader(actual));
System.out.println(" Actual value: " + documentBuilder.parse(inputSource).getDocumentElement().getTextContent());
reporter.check(actual, expected, actual + Character.LINE_SEPARATOR + " must be equal to " + Character.LINE_SEPARATOR + expected);
}
/**
* This is a duplicate of {@link org.apache.xml.serializer.Encodings#toCodePoint(char, char)}.
* We can't use that method because it's package-private.
* We can't use {@link String#codePointAt(int)} either because it's @Since Java 1.5 and this codebase needs to be Java 1.3 compliant.
*/
static int toCodePoint(char highSurrogate, char lowSurrogate) {
int codePoint =
((highSurrogate - 0xd800) << 10)
+ (lowSurrogate - 0xdc00)
+ 0x10000;
return codePoint;
}
{code}
was (Author: peterdm):
It can be proven with a unit test that Daniel's fix breaks some scenarios that used to work. As I suspected, the "if" has to be an "else if". I've attached my own new patch + unit tests.
Note that the patch spans 2 repositories: the fix is relative to [http://svn.apache.org/repos/asf/xalan/java/trunk,] the unit test is relative to [http://svn.apache.org/repos/asf/xalan/test/trunk|http://svn.apache.org/repos/asf/xalan/test/trunk.].
Just in case the patch isn't readable, this is essence of the test code:
{code:java}
/**
* This test case illustrates the original problem with high-surrogate characters.
* This is broken in Xalan 2.7.2, hence the need for a fix.
*/
public void serializationOfHighSurrogateCharactersInUtf8() throws Throwable {
reporter.testCaseInit("serializationOfHighSurrogateCharactersInUtf8");
try {
String value = "\uD840\uDC0B";
serializationOf(value, "&#" + toCodePoint(value.charAt(0), value.charAt(1)) + ";", "UTF-8");
} finally {
reporter.testCaseClose();
}
}
/**
* This is a sanity test case illustrating some US-ASCII characters and some low-surrogate non-ASCII characters.
* It works in Xalan 2.7.2 and with any of the patches, it's just a basic sanity check.
*/
public void serializationOfLowSurrogateCharactersInUtf8() throws Throwable {
reporter.testCaseInit("serializationOfLowSurrogateCharactersInUtf8");
try {
serializationOf("This is gonna cost ya some €€€", "This is gonna cost ya some €€€", "UTF-8");
} finally {
reporter.testCaseClose();
}
}
/**
* This test case illustrates a use case which works in Xalan 2.7.2 but which got <i>broken</i> by Daniel's patch.
*/
public void serializationOfLineSeparatorInAscii() throws Throwable {
reporter.testCaseInit("serializationOfLineSeparatorInAscii");
try {
serializationOf(String.valueOf((char) 0x2028), "
", "US-ASCII");
} finally {
reporter.testCaseClose();
}
}
private void serializationOf(String value, String expectedXmlValue, String encoding) throws ParserConfigurationException, TransformerException, IOException, SAXException {
System.out.println("Expected value: " + value);
String expected = "<?xml version=\"1.0\" encoding=\"" + encoding + "\"?><a>" + expectedXmlValue + "</a>";
System.out.println(" Expected XML: " + expected);
StringWriter writer = new StringWriter();
final DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document dom = documentBuilder.newDocument();
final Element rootEl = dom.createElement("a");
rootEl.setTextContent(value);
dom.appendChild(rootEl);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(javax.xml.transform.OutputKeys.ENCODING, encoding);
transformer.transform(new DOMSource(dom), new javax.xml.transform.stream.StreamResult(writer));
String actual = writer.toString();
System.out.println(" Actual XML: " + actual);
InputSource inputSource = new InputSource();
inputSource.setCharacterStream(new StringReader(actual));
System.out.println(" Actual value: " + documentBuilder.parse(inputSource).getDocumentElement().getTextContent());
reporter.check(actual, expected, actual + Character.LINE_SEPARATOR + " must be equal to " + Character.LINE_SEPARATOR + expected);
}
/**
* This is a duplicate of {@link org.apache.xml.serializer.Encodings#toCodePoint(char, char)}.
* We can't use that method because it's package-private.
* We can't use {@link String#codePointAt(int)} either because it's @Since Java 1.5 and this codebase needs to be Java 1.3 compliant.
*/
static int toCodePoint(char highSurrogate, char lowSurrogate) {
int codePoint =
((highSurrogate - 0xd800) << 10)
+ (lowSurrogate - 0xdc00)
+ 0x10000;
return codePoint;
}
{code}
> Serializer produces separately escaped surrogate pair instead of codepoint
> --------------------------------------------------------------------------
>
> Key: XALANJ-2617
> URL: https://issues.apache.org/jira/browse/XALANJ-2617
> Project: XalanJ2
> Issue Type: Bug
> Security Level: No security risk; visible to anyone(Ordinary problems in Xalan projects. Anybody can view the issue.)
> Components: Serialization, Xalan
> Affects Versions: 2.7.1, 2.7.2
> Reporter: Daniel Kec
> Assignee: Steven J. Hathaway
> Priority: Major
> Attachments: JI9053942.java, XALANJ-2617_Fix_missing_surrogate_pairs_support.patch, XALANJ-2617_Fix_missing_surrogate_pairs_support_new.patch
>
>
> When trying to serialize XML with char consisting of unicode surogate char "\uD840\uDC0B" I have tried several and non worked. XML Transformer creates XML string with escaped surogate pair separately, which makes XML unparseable. eg.: SAXParseException; Character reference "�" is an invalid XML character. It looks like a bug introduced in the XALANJ-2271 fix.
>
> {code:java|title=Output of Xalan ver. 2.7.2}
> kec@phoebe:~/Downloads$ java -version
> java version "1.8.0_171"
> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
> kec@phoebe:~/Downloads$ java -cp /home/kec/.m2/repository/xml-apis/xml-apis/1.4.01/xml-apis-1.4.01.jar:/home/kec/.m2/repository/xalan/xalan/2.7.2/xalan-2.7.2.jar:/home/kec/.m2/repository/xalan/serializer/2.7.2/serializer-2.7.2.jar:. JI9053942
> Character: 𠀋
> EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>𠀋</a>
> ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>��</a>
> [Fatal Error] :1:50: Character reference "&#
> {code}
> {code:java|title=But Xalan ver. 2.7.0 works OK}
> kec@phoebe:~/Downloads$ java -cp /home/kec/.m2/repository/xml-apis/xml-apis/1.4.01/xml-apis-1.4.01.jar:/home/kec/.m2/repository/xalan/xalan/2.7.0/xalan-2.7.0.jar:/home/kec/.m2/repository/xalan/serializer/2.7.0/serializer-2.7.0.jar:. JI9053942
> Character: 𠀋
> EXPECTED: <?xml version="1.0" encoding="UTF-8"?><a>𠀋</a>
> ACTUAL: <?xml version="1.0" encoding="UTF-8"?><a>𠀋</a>
> ACTUAL PARSED CHAR 𠀋
> {code}
> {code:java|title=Test}
> String value = "\uD840\uDC0B";
> System.out.println("Character: " + value);
> System.out.println("EXPECTED: <?xml version=\"1.0\" encoding=\"UTF-8\"?><a>&#" + value.codePointAt(0) + ";</a>");
> StringWriter writer = new StringWriter();
> final DocumentBuilder documentBuilder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
> Document dom = documentBuilder.newDocument();
> final Element rootEl = dom.createElement("a");
> rootEl.setTextContent(value);
> dom.appendChild(rootEl);
> Transformer transformer = TransformerFactory.newInstance().newTransformer();
> transformer.transform(new DOMSource(dom), new javax.xml.transform.stream.StreamResult(writer));
> String xml = writer.toString();
> System.out.println(" ACTUAL: " + xml);
> InputSource inputSource = new InputSource();
> inputSource.setCharacterStream(new StringReader(xml));
> System.out.println("ACTUAL PARSED CHAR " + documentBuilder.parse(inputSource).getDocumentElement().getTextContent());
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xalan.apache.org
For additional commands, e-mail: dev-help@xalan.apache.org