You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@xalan.apache.org by "Max (Jira)" <ji...@apache.org> on 2022/02/17 14:51:00 UTC

[jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

    [ https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493984#comment-17493984 ] 

Max commented on XALANJ-2419:
-----------------------------

[~jharrop] thanks for pointing to your work here. Saved me time. Would be nice to have it merged in.

> Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8
> ---------------------------------------------------------------------------------------------
>
>                 Key: XALANJ-2419
>                 URL: https://issues.apache.org/jira/browse/XALANJ-2419
>             Project: XalanJ2
>          Issue Type: Bug
>          Components: Serialization
>    Affects Versions: 2.7.1
>            Reporter: Henri Sivonen
>            Priority: Major
>         Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt
>
>
> org.apache.xml.serializer.ToStream contains the following code:
>                     else if (m_encodingInfo.isInEncoding(ch)) {
>                         // If the character is in the encoding, and
>                         // not in the normal ASCII range, we also
>                         // just leave it get added on to the clean characters
>                         
>                     }
>                     else {
>                         // This is a fallback plan, we should never get here
>                         // but if the character wasn't previously handled
>                         // (i.e. isn't in the encoding, etc.) then what
>                         // should we do?  We choose to write out an entity
>                         writeOutCleanChars(chars, i, lastDirtyCharProcessed);
>                         writer.write("&#");
>                         writer.write(Integer.toString(ch));
>                         writer.write(';');
>                         lastDirtyCharProcessed = i;
>                     }
> This leads to the wrong (latter) if branch running for surrogates, because isInEncoding() for UTF-8 returns false for surrogates. It is always wrong (regardless of encoding) to escape a surrogate as an NCR.
> The practical effect of this bug is that any document with astral characters in it ends up in an ill-formed serialization and does not parse back using an XML parser.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@xalan.apache.org
For additional commands, e-mail: dev-help@xalan.apache.org


Re: [jira] [Commented] (XALANJ-2419) Astral characters written as a pair of NCRs with the surrogate scalar values when using UTF-8

Posted by Joseph Kesselman <ke...@alum.mit.edu>.
Good catch!  
  

\--  
Sent from palmtop; apologies for any auto-incorrections.  
() | Text Mail Campaign  
/\ | HTML mail is _evil_!

  

On Feb 17, 2022 9:51 AM, "Max (Jira)" <ji...@apache.org> wrote:  

>  
>      [
> https://issues.apache.org/jira/browse/XALANJ-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
> tabpanel&focusedCommentId=17493984#comment-17493984 ]  
>  
>  Max commented on XALANJ-2419:  
>  \-----------------------------  
>  
>  [~jharrop] thanks for pointing to your work here. Saved me time. Would be
> nice to have it merged in.  
>  
>  > Astral characters written as a pair of NCRs with the surrogate scalar
> values when using UTF-8  
>  >
> \---------------------------------------------------------------------------------------------  
>  >  
>  >                 Key: XALANJ-2419  
>  >                 URL: https://issues.apache.org/jira/browse/XALANJ-2419  
>  >             Project: XalanJ2  
>  >          Issue Type: Bug  
>  >          Components: Serialization  
>  >    Affects Versions: 2.7.1  
>  >            Reporter: Henri Sivonen  
>  >            Priority: Major  
>  >         Attachments: XALANJ-2419-fix-v3.txt, XALANJ-2419-tests-v3.txt  
>  >  
>  >  
>  > org.apache.xml.serializer.ToStream contains the following code:  
>  >                     else if (m_encodingInfo.isInEncoding(ch)) {  
>  >                         // If the character is in the encoding, and  
>  >                         // not in the normal ASCII range, we also  
>  >                         // just leave it get added on to the clean
> characters  
>  >  
>  >                     }  
>  >                     else {  
>  >                         // This is a fallback plan, we should never get
> here  
>  >                         // but if the character wasn't previously handled  
>  >                         // (i.e. isn't in the encoding, etc.) then what  
>  >                         // should we do?  We choose to write out an
> entity  
>  >                         writeOutCleanChars(chars, i,
> lastDirtyCharProcessed);  
>  >                         writer.write("&#");  
>  >                         writer.write(Integer.toString(ch));  
>  >                         writer.write(';');  
>  >                         lastDirtyCharProcessed = i;  
>  >                     }  
>  > This leads to the wrong (latter) if branch running for surrogates,
> because isInEncoding() for UTF-8 returns false for surrogates. It is always
> wrong (regardless of encoding) to escape a surrogate as an NCR.  
>  > The practical effect of this bug is that any document with astral
> characters in it ends up in an ill-formed serialization and does not parse
> back using an XML parser.  
>  
>  
>  
>  \--  
>  This message was sent by Atlassian Jira  
>  (v8.20.1#820001)  
>  
>  \---------------------------------------------------------------------  
>  To unsubscribe, e-mail: dev-unsubscribe@xalan.apache.org  
>  For additional commands, e-mail: dev-help@xalan.apache.org  
>  
>

  

\--------------------------------------------------------------------- To
unsubscribe, e-mail: dev-unsubscribe@xalan.apache.org For additional commands,
e-mail: dev-help@xalan.apache.org