You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2017/09/01 15:23:00 UTC

[jira] [Comment Edited] (TIKA-2219) CharsetDetector no longer detects windows-1252 charset

    [ https://issues.apache.org/jira/browse/TIKA-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150712#comment-16150712 ] 

Tim Allison edited comment on TIKA-2219 at 9/1/17 3:22 PM:
-----------------------------------------------------------

Not sure if there's anything we can do here.

The attached file doesn't go through the charset detector at all because it is parsed by the RFC822Parser.  If you parse the file with the TXTParser (which in turn calls CharsetDetector), it is correctly id'd as windows-1252 and correctly parsed.

IIRC, the correct way to encode windows-1252 in RFC822  should be something like {{?windows-1252?Q?100_=80?}}.

Any recommended fix?


was (Author: tallison@mitre.org):
Not sure if there's anything we can do here.

The attached file doesn't go through the charset detector at all because it is parsed by the RFC822Parser.  If you do force it to go through the TXTParser, it is correctly id'd as windows-1252.

IIRC, the correct way to encode windows-1252 in RFC822  should be something like {{?windows-1252?Q?100_=80?}}.

Any recommended fix?

> CharsetDetector no longer detects windows-1252 charset
> ------------------------------------------------------
>
>                 Key: TIKA-2219
>                 URL: https://issues.apache.org/jira/browse/TIKA-2219
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>         Environment: Any.
>            Reporter: Pascal Essiembre
>            Priority: Minor
>             Fix For: 2.0, 1.15
>
>         Attachments: test.txt
>
>
> Starting with Tika 1.14, windows-1252 is no longer detected, as ISO-8859-1 is always detected instead.  While not tested, this likely affects other windows-125* encodings as well.
> I tracked it down to a change in the {{CharsetRecog_sbcs.CharsetRecog_8859_1#getName()}} method.  Now it always returns "ISO-8859-1" whereas before it was: {{return haveC1Bytes ? "windows-1252" : "ISO-8859-1";}}
> Now that condition has been moved to the {{match(CharsetDetector det)}} method so that the returned CharsetMatch has the proper name.  The problem with that is {{CharsetDetector#detectAll()}} method overwrites the correct match with a new one that will return the value of {{#getName()}}  from the {{CharsetRecognizer}} instead (which is always "ISO-8859-1" in this case).
> There might be legitimate reasons why the {{CharsetMatch}} instances in {{detectAll()}} method are replaced with new ones, but changing this code in that method appears to work for me:
> // Remove this:
> //                    CharsetMatch m = new CharsetMatch(this, csr, confidence);
> //                    matches.add(m);
> // Add this instead:
>                     matches.add(charsetMatch);



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)