You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by "Max (Jira)" <ji...@apache.org> on 2022/02/17 14:51:00 UTC
[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
[ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493984#comment-17493984 ]
Max commented on XALANJ-2419:
-----------------------------
[~jharrop] thanks for pointing to your work here. Saved me time. Would be nice to have it merged in.
> Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
> ---------------------------------------------------------------------------------------------
>
> Key: XALANJ-2419
> URL: https://issues.apache.org/jira/browse/XALANJ-2419
> Project: XalanJ2
> Issue Type: Bug
> Components: Serialization
> Affects Versions: 2.7.1
> Reporter: Henri Sivonen
> Priority: Major
> Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
> else if (m_encodingInfo.isInEncoding(ch)) {
> // If the character is in the encoding, and
> // not in the normal ASCII range, we also
> // just leave it get added on to the clean characters
>
> }
> else {
> // This is a fallback plan, we should never get here
> // but if the character wasn't previously handled
> // (i.e. isn't in the encoding, etc.) then what
> // should we do? We choose to write out an entity
> writeOutCleanChars(chars, i, lastDirtyCharProcessed);
> writer.write("&#");
> writer.write(Integer.toString(ch));
> writer.write(';');
> lastDirtyCharProcessed = i;
> }
> This leads to the wrong (latter) if branch running for surrogates, because isInEncoding() for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters in it ends up in an ill-formed serialization and does not parse back using an XML parser.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xalan.apache.org
For additional commands, e-mail: dev-help@xalan.apache.org
Re: [jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
Posted by Joseph Kesselman <ke...@alum.mit.edu>.
Good catch!
\--
Sent from palmtop; apologies for any auto-incorrections.
() | Text Mail Campaign
/\ | HTML mail is _evil_!
On Feb 17, 2022 9:51 AM, "Max (Jira)" <ji...@apache.org> wrote:
>
> [
> https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=17493984#comment-17493984 ]
>
> Max commented on XALANJ-2419:
> \-----------------------------
>
> [~jharrop] thanks for pointing to your work here. Saved me time. Would be
> nice to have it merged in.
>
> > Astral characters written as a pair of NCRs with the surrogate scalar
> values when using UTF-8
> >
> \---------------------------------------------------------------------------------------------
> >
> > Key: XALANJ-2419
> > URL: https://issues.apache.org/jira/browse/XALANJ-2419
> > Project: XalanJ2
> > Issue Type: Bug
> > Components: Serialization
> > Affects Versions: 2.7.1
> > Reporter: Henri Sivonen
> > Priority: Major
> > Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
> >
> >
> > org.apache.xml.serializer.ToStream contains the following code:
> > else if (m_encodingInfo.isInEncoding(ch)) {
> > // If the character is in the encoding, and
> > // not in the normal ASCII range, we also
> > // just leave it get added on to the clean
> characters
> >
> > }
> > else {
> > // This is a fallback plan, we should never get
> here
> > // but if the character wasn't previously handled
> > // (i.e. isn't in the encoding, etc.) then what
> > // should we do? We choose to write out an
> entity
> > writeOutCleanChars(chars, i,
> lastDirtyCharProcessed);
> > writer.write("&#");
> > writer.write(Integer.toString(ch));
> > writer.write(';');
> > lastDirtyCharProcessed = i;
> > }
> > This leads to the wrong (latter) if branch running for surrogates,
> because isInEncoding() for UTF-8 returns false for surrogates. It is always
> wrong (regardless of encoding) to escape a surrogate as an NCR.
> > The practical effect of this bug is that any document with astral
> characters in it ends up in an ill-formed serialization and does not parse
> back using an XML parser.
>
>
>
> \--
> This message was sent by Atlassian Jira
> (v8.20.1#820001)
>
> \---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@xalan.apache.org
> For additional commands, e-mail: dev-help@xalan.apache.org
>
>
\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@xalan.apache.org For additional commands,
e-mail: dev-help@xalan.apache.org