You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ostico (Jira)" <ji...@apache.org> on 2022/06/16 16:57:00 UTC
[jira] [Comment Edited] (TIKA-3479) UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1
[ https://issues.apache.org/jira/browse/TIKA-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555198#comment-17555198 ]
Ostico edited comment on TIKA-3479 at 6/16/22 4:56 PM:
-------------------------------------------------------
Maybe, a better implementation could be to exclude characters which are not present in ISO-8859 characters map to allow an easy differentiation between wondows-125n and iso-8859-n:
{noformat}
public void report(String name) {
if (Constants.CHARSET_WINDOWS_1252.equals(name)) {
if (hint != null) {
// Use the encoding hint when available
name = hint;
} else if (hasNonexistentHexInCharsetWindows125n() ) {
// If it has nonexistent hex value in charset windows-1252 or
// If there are no CR(LF)s, then the encoding is more
// likely to be ISO-8859-1(5) than windows-1252
if (statistics.count(0xa4) > 0) { // currency/euro sign
// The general currency sign is hardly ever used in
// ISO-8859-1, so it's more likely that we're dealing
// with ISO-8859-15, where the character is used for
// the euro symbol, which is more commonly used.
name = CHARSET_ISO_8859_15;
} else {
name = CHARSET_ISO_8859_1;
}
}
}
try {
this.charset = CharsetUtils.forName(name);
} catch (IllegalArgumentException e) {
// ignore
}
}
/*
* hex value 0x80 - 0x9f don't exist in charset iso-8859-n.
* If these value's count > 0, return true
* */
private Boolean hasNonexistentHexInCharsetWindows125n() {
for (int i = 0x80; i <= 0x9F; i += 1) {
if (statistics.count(i) != 0) {
return true;
}
}
}{noformat}
*In ISO-8859-n, the characters from 128 to 159 are not defined.*
A better deep classification of the 8859-x and windows-125x can be made based on the single characters ( a map ) when the main family has been identified.
This code is a false positive in my opinion, because the \r is included in the windows-2252 encoding:
{noformat}
|| statistics.count('\r') == 0{noformat}
Not all windows-125n files are made/handled/modified on windows systems .
was (Author: ostico):
Maybe, a better implementation could be to exclude characters which are not present in ISO-8859 characters map to allow an easy differentiation between wondows-125n and iso-8859-n:
{noformat}
public void report(String name) {
if (Constants.CHARSET_WINDOWS_1252.equals(name)) {
if (hint != null) {
// Use the encoding hint when available
name = hint;
} else if (hasNonexistentHexInCharsetWindows125n() ) {
// If it has nonexistent hex value in charset windows-1252 or
// If there are no CR(LF)s, then the encoding is more
// likely to be ISO-8859-1(5) than windows-1252
if (statistics.count(0xa4) > 0) { // currency/euro sign
// The general currency sign is hardly ever used in
// ISO-8859-1, so it's more likely that we're dealing
// with ISO-8859-15, where the character is used for
// the euro symbol, which is more commonly used.
name = CHARSET_ISO_8859_15;
} else {
name = CHARSET_ISO_8859_1;
}
}
}
try {
this.charset = CharsetUtils.forName(name);
} catch (IllegalArgumentException e) {
// ignore
}
}
/*
* hex value 0x80 - 0x9f don't exist in charset windows-1252.
* If these value's count > 0, return true
* */
private Boolean hasNonexistentHexInCharsetWindows125n() {
for (int i = 0x80; i <= 0x9F; i += 1) {
if (statistics.count(i) != 0) {
return true;
}
}
}{noformat}
*In ISO-8859-n, the characters from 128 to 159 are not defined.*
A better deep classification of the 8859-x and windows-125x can be made based on the single characters ( a map ) when the main family has been identified.
This code is a false positive in my opinion, because the \r is included in the windows-2252 encoding:
{noformat}
|| statistics.count('\r') == 0{noformat}
Not all windows-125n files are made/handled/modified on windows systems .
> UniversalCharsetDetector in 2.x is misidentifying windows-1250 as ISO-8859-1
> ----------------------------------------------------------------------------
>
> Key: TIKA-3479
> URL: https://issues.apache.org/jira/browse/TIKA-3479
> Project: Tika
> Issue Type: Task
> Affects Versions: 2.0.0-BETA
> Reporter: Tim Allison
> Priority: Minor
> Attachments: Bates.Motel.S02E08.HDTV.x264-KILLERS.srt
>
>
> We've lost quite a few "common words" for Czech and Slovak text files in 2.x vs. 1.x. The key issue appears to be the following (which we do not have in 1.x).
> {noformat}
> /*
> * hex value 0x81, 0x8d, 0x8f, 0x90 don't exist in charset windows-1252.
> * If these value's count > 0, return true
> * */
> private Boolean hasNonexistentHexInCharsetWindows1252() {
> return (statistics.count(0x81) > 0 || statistics.count(0x8d) > 0 ||
> statistics.count(0x8f) > 0 || statistics.count(0x90) > 0 ||
> statistics.count(0x9d) > 0);
> }
> {noformat}
> The icu4j detector detects windows-1250 (not supported by the UniversalEncodingDetector), and the characters decoded with encoding do better on google. windows-1252 is _generally_ a better match for windows-1250 than ISO-8859-1.
> Not sure how best to handle this...
--
This message was sent by Atlassian Jira
(v8.20.7#820007)