You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2014/02/11 04:55:20 UTC
[jira] [Commented] (TIKA-1236) CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG files

    [ https://issues.apache.org/jira/browse/TIKA-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13897508#comment-13897508 ] 

Ken Krugler commented on TIKA-1236:
-----------------------------------

Hi Tim - I don't think EncodingDetector should silently return a different (supported) encoding for this case.

So then the choice seems to be between failing in some predictable manner, or returning nothing from the document. What do we currently do when mime-type detection returns a parser that isn't available? That seems like the most similar situation.

> CharsetDetector returning unsupported encoding for some 7-bit Outlook/MSG files
> -------------------------------------------------------------------------------
>
>                 Key: TIKA-1236
>                 URL: https://issues.apache.org/jira/browse/TIKA-1236
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Tim Allison
>            Priority: Minor
>
> When parsing a 7-bit encoded Outlook post (.msg without headers), Tika tries to detect the encoding.  For a handful of files, the CharsetDetector returns "IBM424_rtl" with a confidence > the threshold.  This encoding is then set  with MAPIMessage.set7BitEncoding().  When MAPI tries to use this encoding, it finds that it is unsupported and throws an exception. 
> Full stacktrace:
> {noformat}
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@72ccd846
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
> 	at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookEncoding(OutlookParserTest.java:264)
> ...irrelevant test framework junk...
> Caused by: java.lang.RuntimeException: Encoding not found - IBM424_rtl
> 	at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:149)
> 	at org.apache.poi.hsmf.datatypes.StringChunk.parseString(StringChunk.java:85)
> 	at org.apache.poi.hsmf.datatypes.StringChunk.set7BitEncoding(StringChunk.java:74)
> 	at org.apache.poi.hsmf.MAPIMessage.set7BitEncoding(MAPIMessage.java:455)
> 	at org.apache.tika.parser.microsoft.OutlookExtractor.parse(OutlookExtractor.java:95)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:223)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
> 	... 26 more
> Caused by: java.io.UnsupportedEncodingException: IBM424_rtl
> 	at java.lang.StringCoding.decode(Unknown Source)
> 	at java.lang.String.<init>(Unknown Source)
> 	at java.lang.String.<init>(Unknown Source)
> 	at org.apache.poi.hsmf.datatypes.StringChunk.parseAs7BitData(StringChunk.java:147)
> 	... 33 more
> {noformat}
> Unfortunately, I can't share the problematic documents, and I can't create a synthetic document that triggers this issue.
> Two questions:
> 1)  Should CharsetDetector return an encoding that is not supported?
> 2)  If so, should we add a simple check before calling set7BitEncoding()?



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)