You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Luis Filipe Nassif (JIRA)" <ji...@apache.org> on 2015/03/20 23:09:39 UTC
[jira] [Comment Edited] (TIKA-1267) Improve Mbox file detection
[ https://issues.apache.org/jira/browse/TIKA-1267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14372204#comment-14372204 ]
Luis Filipe Nassif edited comment on TIKA-1267 at 3/20/15 10:09 PM:
--------------------------------------------------------------------
Detection only by extension is very poor because many mail apps do not use any extension. Maybe we can make application/mbox a subclass of message/rfc822 (after widening rfc822 magic offsets, not semantically true). Does default detector check for parent magics?
Or maybe include some rfc822 extended magics as a prerequisite because they should be present in the first email:
{code}
<mime-type type="application/mbox">
<magic priority="70">
<match value="From " type="string" offset="0">
<match value="Forward\ to" type="string" offset="0:1024"/>
<match value="Return-Path:" type="stringignorecase" offset="0:1024"/>
<match value="Received:" type="stringignorecase" offset="0:1024"/>
<match value="Message-ID:" type="stringignorecase" offset="0:1024"/>
</match>
</magic>
<sub-class-of type="text/plain"/>
<glob pattern="*.mbox"/>
</mime-type>
{code}
was (Author: lfcnassif):
Detection only by extension is very poor because many mail apps do not use any extension. Maybe we can make application/mbox a subclass of message/rfc822 (not semantically true). Does default detector check for parent magics?
Or maybe include rfc822 magics as a prerequisite because they will be present in the first email:
{code}
<mime-type type="application/mbox">
<magic priority="70">
<match value="From " type="string" offset="0">
<match value="Relay-Version:" type="stringignorecase" offset="0"/>
<match value="#!\ rnews" type="string" offset="0"/>
<match value="N#!\ rnews" type="string" offset="0"/>
<match value="Forward\ to" type="string" offset="0"/>
<match value="Pipe\ to" type="string" offset="0"/>
<match value="Return-Path:" type="stringignorecase" offset="0"/>
<match value="From:" type="stringignorecase" offset="0"/>
<match value="Received:" type="stringignorecase" offset="0"/>
<match value="Message-ID:" type="stringignorecase" offset="0"/>
<match value="Date:" type="string" offset="0"/>
<match value="MIME-Version:" type="stringignorecase" offset="0"/>
<match value="X-Notes-Item:" type="string" offset="0">
<match value="Message-ID:" type="string" offset="0:8192"/>
</match>
</match>
</magic>
<sub-class-of type="text/plain"/>
<glob pattern="*.mbox"/>
</mime-type>
{code}
> Improve Mbox file detection
> ---------------------------
>
> Key: TIKA-1267
> URL: https://issues.apache.org/jira/browse/TIKA-1267
> Project: Tika
> Issue Type: Improvement
> Components: mime
> Affects Versions: 1.5
> Reporter: Luis Filipe Nassif
> Priority: Minor
>
> Could we add to application/mbox mime-type definition code below:
> {code}
> <magic priority="70">
> <match value="From " type="string" offset="0"/>
> </magic>
> {code}
> Or is it too common out there?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)