You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by "Richard Evans (JIRA)" <xa...@xml.apache.org> on 2008/09/01 11:45:44 UTC
[jira] Commented: (XALANJ-2419) Astral characters written as a pair
of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627425#action_12627425 ]
Richard Evans commented on XALANJ-2419:
---------------------------------------
This is a serious problem because Xalan fails to encode any of the Unicode supplementary characters.
The version shipped with Java 1.6 does not have this problem. However if I need to run with Java 1.5 (on systems which do not support 1.6), there is no solution for encoding supplementary characters - the 1.5 built in code does not handle them and plugging in a different Xalan will not help.
> Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
> ---------------------------------------------------------------------------------------------
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
> Issue Type: Bug
> Components: Serialization
> Affects Versions: 2.7.1
> Reporter: Henri Sivonen
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
>
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do? We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because isInEncoding() for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters in it ends up in an ill-formed serialization and does not parse back using an XML parser.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: xalan-dev-unsubscribe@xml.apache.org
For additional commands, e-mail: xalan-dev-help@xml.apache.org