You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ostico (Jira)" <ji...@apache.org> on 2022/06/16 16:57:00 UTC

[jira] [Comment Edited] (TIKA-3479) UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1

    [ https://issues.apache.org/jira/browse/TIKA-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555198#comment-17555198 ] 

Ostico edited comment on TIKA-3479 at 6/16/22 4:56 PM:
-------------------------------------------------------

Maybe, a better implementation could be to exclude characters which are not present in ISO-8859 characters map to allow an easy differentiation between wondows-125n and iso-8859-n:

 
{noformat}
public void report(String name) {
    if (Constants.CHARSET_WINDOWS_1252.equals(name)) {
        if (hint != null) {
            // Use the encoding hint when available
            name = hint;
        } else if (hasNonexistentHexInCharsetWindows125n() ) {
            // If it has nonexistent hex value in charset windows-1252 or
            // If there are no CR(LF)s, then the encoding is more
            // likely to be ISO-8859-1(5) than windows-1252
            if (statistics.count(0xa4) > 0) { // currency/euro sign
                // The general currency sign is hardly ever used in
                // ISO-8859-1, so it's more likely that we're dealing
                // with ISO-8859-15, where the character is used for
                // the euro symbol, which is more commonly used.
                name = CHARSET_ISO_8859_15;
            } else {
                name = CHARSET_ISO_8859_1;
            }
        }
    }
    try {
        this.charset = CharsetUtils.forName(name);
    } catch (IllegalArgumentException e) {
        // ignore
    }
}

/*
 * hex value 0x80 - 0x9f don't exist in charset iso-8859-n.
 * If these value's count > 0, return true
 * */
private Boolean hasNonexistentHexInCharsetWindows125n() { 
  for (int i = 0x80; i <= 0x9F; i += 1) {
    if (statistics.count(i) != 0) {
        return true;
    }
  } 
}{noformat}
*In ISO-8859-n, the characters from 128 to 159 are not defined.*

 

A better deep classification of the 8859-x and windows-125x can be made based on the single characters ( a map ) when the main family has been identified.

This code is a false positive in my opinion, because the \r is included in the windows-2252 encoding:

 
{noformat}
|| statistics.count('\r') == 0{noformat}
Not all windows-125n files are made/handled/modified on windows systems .

 

 

 


was (Author: ostico):
Maybe, a better implementation could be to exclude characters which are not present in ISO-8859 characters map to allow an easy differentiation between wondows-125n and iso-8859-n:

 
{noformat}
public void report(String name) {
    if (Constants.CHARSET_WINDOWS_1252.equals(name)) {
        if (hint != null) {
            // Use the encoding hint when available
            name = hint;
        } else if (hasNonexistentHexInCharsetWindows125n() ) {
            // If it has nonexistent hex value in charset windows-1252 or
            // If there are no CR(LF)s, then the encoding is more
            // likely to be ISO-8859-1(5) than windows-1252
            if (statistics.count(0xa4) > 0) { // currency/euro sign
                // The general currency sign is hardly ever used in
                // ISO-8859-1, so it's more likely that we're dealing
                // with ISO-8859-15, where the character is used for
                // the euro symbol, which is more commonly used.
                name = CHARSET_ISO_8859_15;
            } else {
                name = CHARSET_ISO_8859_1;
            }
        }
    }
    try {
        this.charset = CharsetUtils.forName(name);
    } catch (IllegalArgumentException e) {
        // ignore
    }
}

/*
 * hex value 0x80 - 0x9f don't exist in charset windows-1252.
 * If these value's count > 0, return true
 * */
private Boolean hasNonexistentHexInCharsetWindows125n() { 
  for (int i = 0x80; i <= 0x9F; i += 1) {
    if (statistics.count(i) != 0) {
        return true;
    }
  } 
}{noformat}
*In ISO-8859-n, the characters from 128 to 159 are not defined.*

 

A better deep classification of the 8859-x and windows-125x can be made based on the single characters ( a map ) when the main family has been identified.

This code is a false positive in my opinion, because the \r is included in the windows-2252 encoding:

 
{noformat}
|| statistics.count('\r') == 0{noformat}
Not all windows-125n files are made/handled/modified on windows systems .

 

 

 

> UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-3479
>                 URL: https://issues.apache.org/jira/browse/TIKA-3479
>             Project: Tika
>          Issue Type: Task
>    Affects Versions: 2.0.0-BETA
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: Bates.Motel.S02E08.HDTV.x264-KILLERS.srt
>
>
> We've lost quite a few "common words" for Czech and Slovak text files in 2.x vs. 1.x.  The key issue appears to be the following (which we do not have in 1.x).
> {noformat}
>     /*
>      * hex value 0x81, 0x8d, 0x8f, 0x90 don't exist in charset windows-1252.
>      * If these value's count > 0, return true
>      * */
>     private Boolean hasNonexistentHexInCharsetWindows1252() {
>         return (statistics.count(0x81) > 0 || statistics.count(0x8d) > 0 ||
>                 statistics.count(0x8f) > 0 || statistics.count(0x90) > 0 ||
>                 statistics.count(0x9d) > 0);
>     }
> {noformat}
> The icu4j detector detects windows-1250 (not supported by the UniversalEncodingDetector), and the characters decoded with encoding do better on google. windows-1252 is _generally_ a better match for windows-1250 than ISO-8859-1.
> Not sure how best to handle this...



--
This message was sent by Atlassian Jira
(v8.20.7#820007)