You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/02/26 22:32:00 UTC

[jira] [Resolved] (TIKA-2578) Mails not recognized when unknown X-headers are present

     [ https://issues.apache.org/jira/browse/TIKA-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison resolved TIKA-2578.
-------------------------------
       Resolution: Fixed
         Assignee: Tim Allison
    Fix Version/s: 2.0.0
                   1.18

Thank you, [~AndreasMeier].  I look forward to running this updated version against our regression corpus.

[~gagravarr] and [~lfcnassif], if my proposed modifications are too broad, let me know.

> Mails not recognized when unknown X-headers are present
> -------------------------------------------------------
>
>                 Key: TIKA-2578
>                 URL: https://issues.apache.org/jira/browse/TIKA-2578
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, mime
>    Affects Versions: 1.17, 1.18, 2.0.0
>            Reporter: Andreas Meier
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.18, 2.0.0
>
>         Attachments: testRFC822_with_leading_x_header
>
>
> Found some mails with leading X-headers.
> These mails are recognized as text/plain.
> One example is CISCOs IronPort, which might add "X-IronPort-AV" to the beginning of mails.
> Therefore I would like to discuss if and how TIKA shall handle these cases.
> In my opinion TIKA should try to detect files with x-headers and preprocess them to get a valid mail.
> Suggestion:
> {code:xml}
> <mime-type type="text/x-tika-x-header">
>   <magic priority="50">
>     <match value="X-" type="string" offset="0">
>       <match value="Message-ID:" type="string" offset="0:8192"/>
>       <match value="From:" type="stringignorecase" offset="0:8192"/>
>       <match value="To:" type="stringignorecase" offset="0:8192"/>
>       <match value="Subject:" type="string" offset="0:8192"/>
>       <match value="MIME-Version:" type="stringignorecase" offset="0:8192"/>
>     </match>
>   </magic>
>   <sub-class-of type="text/x-tika-text-based-message"/>
> </mime-type>
> {code}
> See also: [RFC6648|https://tools.ietf.org/html/rfc6648]
> Attached an example file.
> Regards
> Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)