You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2013/09/30 17:20:24 UTC

[jira] [Commented] (TIKA-1162) content-type/charset problem with RFC822Parser

    [ https://issues.apache.org/jira/browse/TIKA-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13781922#comment-13781922 ] 

Tim Allison commented on TIKA-1162:
-----------------------------------

Dear Colleague,
  I'm on paternity leave.  Will be back part time on October 14.

   Best,

            Tim



> content-type/charset problem with RFC822Parser
> ----------------------------------------------
>
>                 Key: TIKA-1162
>                 URL: https://issues.apache.org/jira/browse/TIKA-1162
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Maciej Lizewski
>
> RFC822Parser (mime mail) uses MailContentHandler which internally uses AutoDetectParser to handle each mime part. The problem is that MailContentHandler reads mime part headers and sets CONTENT_TYPE and CONTENT_ENCODING metadata properly and passes this metadata to AutoDetectParser::parse method. But that method ignores those headers and overwrites it:
>         MediaType type = this.getDetector().detect(tis, metadata);
>         metadata.set(Metadata.CONTENT_TYPE, type.toString());
> this leads to some additional recursion loops (Detector returns message/rfc822 mime type instead of proper mimetype for current mime part) and finally somehow it skips out of the loop but without proper content-type and content-encoding headers...
> My proposition is to add check if metadata already contains CONTENT_TYPE in AutoDetectPArser::parse and in such case do not override it. If this is not valid behavior in general - then RFC822Parser should use custom parser in MailContentHandler which respects passed content-type...



--
This message was sent by Atlassian JIRA
(v6.1#6144)