You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Luis Filipe Nassif (JIRA)" <ji...@apache.org> on 2015/03/20 23:09:39 UTC

[jira] [Comment Edited] (TIKA-1267) Improve Mbox file detection

    [ https://issues.apache.org/jira/browse/TIKA-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14372204#comment-14372204 ] 

Luis Filipe Nassif edited comment on TIKA-1267 at 3/20/15 10:09 PM:
--------------------------------------------------------------------

Detection only by extension is very poor because many mail apps do not use any extension. Maybe we can make application/mbox a subclass of message/rfc822 (after widening rfc822 magic offsets, not semantically true). Does default detector check for parent magics?

Or maybe include some rfc822 extended magics as a prerequisite because they should be present in the first email:
{code}
<mime-type type="application/mbox">
  <magic priority="70">
   <match value="From " type="string" offset="0">
      <match value="Forward\ to" type="string" offset="0:1024"/>
      <match value="Return-Path:" type="stringignorecase" offset="0:1024"/>
      <match value="Received:" type="stringignorecase" offset="0:1024"/>
      <match value="Message-ID:" type="stringignorecase" offset="0:1024"/>
   </match>
  </magic>
  <sub-class-of type="text/plain"/>
  <glob pattern="*.mbox"/>
</mime-type>
{code}


was (Author: lfcnassif):
Detection only by extension is very poor because many mail apps do not use any extension. Maybe we can make application/mbox a subclass of message/rfc822 (not semantically true). Does default detector check for parent magics?

Or maybe include rfc822 magics as a prerequisite because they will be present in the first email:
{code}
<mime-type type="application/mbox">
  <magic priority="70">
   <match value="From " type="string" offset="0">
      <match value="Relay-Version:" type="stringignorecase" offset="0"/>
      <match value="#!\ rnews" type="string" offset="0"/>
      <match value="N#!\ rnews" type="string" offset="0"/>
      <match value="Forward\ to" type="string" offset="0"/>
      <match value="Pipe\ to" type="string" offset="0"/>
      <match value="Return-Path:" type="stringignorecase" offset="0"/>
      <match value="From:" type="stringignorecase" offset="0"/>
      <match value="Received:" type="stringignorecase" offset="0"/>
      <match value="Message-ID:" type="stringignorecase" offset="0"/>
      <match value="Date:" type="string" offset="0"/>
      <match value="MIME-Version:" type="stringignorecase" offset="0"/>
      <match value="X-Notes-Item:" type="string" offset="0">
        <match value="Message-ID:" type="string" offset="0:8192"/>
      </match> 
   </match>
  </magic>
  <sub-class-of type="text/plain"/>
  <glob pattern="*.mbox"/>
</mime-type>
{code}

> Improve Mbox file detection
> ---------------------------
>
>                 Key: TIKA-1267
>                 URL: https://issues.apache.org/jira/browse/TIKA-1267
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>    Affects Versions: 1.5
>            Reporter: Luis Filipe Nassif
>            Priority: Minor
>
> Could we add to application/mbox mime-type definition code below:
> {code}
> <magic priority="70">
> <match value="From " type="string" offset="0"/>
> </magic>
> {code}
> Or is it too common out there?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)