You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Andreas Meier (JIRA)" <ji...@apache.org> on 2018/02/20 08:53:00 UTC

[jira] [Created] (TIKA-2578) Mails not recognized when unknown X-headers are present

Andreas Meier created TIKA-2578:
-----------------------------------

             Summary: Mails not recognized when unknown X-headers are present
                 Key: TIKA-2578
                 URL: https://issues.apache.org/jira/browse/TIKA-2578
             Project: Tika
          Issue Type: Bug
          Components: mime
    Affects Versions: 1.17, 1.18, 2.0.0
            Reporter: Andreas Meier
         Attachments: testRFC822_with_leading_x_header

Found some mails with leading X-headers.

These mails are recognized as text/plain.

One example is CISCOs IronPort, which might add "X-IronPort-AV" to the beginning of mails.

Therefore I would like to discuss if and how TIKA shall handle these cases.

In my opinion TIKA should try to detect files with x-headers and preprocess them to get a valid mail.

Suggestion:

{code:xml}
<mime-type type="text/x-tika-x-header">
  <magic priority="50">
    <match value="X-" type="string" offset="0">
      <match value="Message-ID:" type="string" offset="0:8192"/>
      <match value="From:" type="stringignorecase" offset="0:8192"/>
      <match value="To:" type="stringignorecase" offset="0:8192"/>
      <match value="Subject:" type="string" offset="0:8192"/>
      <match value="MIME-Version:" type="stringignorecase" offset="0:8192"/>
    </match>
  </magic>
  <sub-class-of type="text/x-tika-text-based-message"/>
</mime-type>
{code}

See also: [RFC6648|https://tools.ietf.org/html/rfc6648]

Attached an example file.

Regards

Andreas



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)