You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Cyril Parsons (Jira)" <ji...@apache.org> on 2020/12/11 13:53:00 UTC

[jira] [Updated] (TEXT-192) HTML unescape does not parse Windows-1252 correctly

     [ https://issues.apache.org/jira/browse/TEXT-192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cyril Parsons updated TEXT-192:
-------------------------------
    Description: 
Looking at [https://en.wikipedia.org/wiki/Windows-1252#Character_set] there are differences in parsing Windows-1252 and ISO 8859-1. Code points between 128 and 159 (on Windows-1252) are improperly decoded.

In a MMVE:
{code:java}
String w1252 = "&#126;&#151;&#161;";
String output = StringEscapeUtils.unescapeHtml4(w1252);
System.out.println(output);
System.out.println(output.chars().mapToLong(Long::valueOf)
        .boxed().collect(Collectors.toList()));
{code}
 

The output is:
{code:java}
~— ¡
[126, 151, 161]
{code}
 

(Space substituted for the Unicode character "End Of Guarded Area".) Expected output would be that em-dash would appear.

 

  was:
Looking at [https://en.wikipedia.org/wiki/Windows-1252#Character_set] there are differences in parsing Windows-1252 and ISO 8859-1. Code points between 128 and 159 (on Windows-1252) are improperly decoded.

In a MMVE:

 
{code:java}
String w1252 = "&#126;&#151;&#161;";
String output = StringEscapeUtils.unescapeHtml4(w1252);
System.out.println(output);
System.out.println(output.chars().mapToLong(Long::valueOf)
        .boxed().collect(Collectors.toList()));
{code}
The output is:

 

 
{code:java}
~— ¡
[126, 151, 161]
{code}
(Space substituted for the Unicode character "End Of Guarded Area".) Expected output would be that em-dash would appear.

 


> HTML unescape does not parse Windows-1252 correctly
> ---------------------------------------------------
>
>                 Key: TEXT-192
>                 URL: https://issues.apache.org/jira/browse/TEXT-192
>             Project: Commons Text
>          Issue Type: Bug
>    Affects Versions: 1.8
>         Environment: Java, macOS; should not be platform specific
>            Reporter: Cyril Parsons
>            Priority: Minor
>
> Looking at [https://en.wikipedia.org/wiki/Windows-1252#Character_set] there are differences in parsing Windows-1252 and ISO 8859-1. Code points between 128 and 159 (on Windows-1252) are improperly decoded.
> In a MMVE:
> {code:java}
> String w1252 = "&#126;&#151;&#161;";
> String output = StringEscapeUtils.unescapeHtml4(w1252);
> System.out.println(output);
> System.out.println(output.chars().mapToLong(Long::valueOf)
>         .boxed().collect(Collectors.toList()));
> {code}
>  
> The output is:
> {code:java}
> ~— ¡
> [126, 151, 161]
> {code}
>  
> (Space substituted for the Unicode character "End Of Guarded Area".) Expected output would be that em-dash would appear.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)