You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Yury Kats (JIRA)" <ji...@apache.org> on 2018/07/17 22:10:00 UTC
[jira] [Updated] (TIKA-2688) MBOX not recognized when unknown
X-headers are present
[ https://issues.apache.org/jira/browse/TIKA-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yury Kats updated TIKA-2688:
----------------------------
Description:
This is a spin off from TIKA-2578
I have mbox files that are not being recognized as such because they have X- headers at the top.
Current config:
{noformat}
<mime-type type="application/mbox">
<!-- MBOX files start with "From [sender] [date]" -->
<!-- To avoid false matches, check for other headers after that -->
<magic priority="70">
<match value="From " type="string" offset="0">
<match value="\nFrom: " type="string" offset="32:256"/>
<match value="\nDate: " type="string" offset="32:256"/>
<match value="\nSubject: " type="string" offset="32:256"/>
<match value="\nDelivered-To: " type="string" offset="32:256"/>
<match value="\nReceived: by " type="string" offset="32:256"/>
<match value="\nReceived: via " type="string" offset="32:256"/>
<match value="\nReceived: from " type="string" offset="32:256"/>
<match value="\nMime-Version: " type="string" offset="32:256"/>
</match>
{noformat}
mbox file:
{noformat}
From "naveen.andrews@enron.com" Wed Jan 30 18:07:01 2002
X-EDO-Dataset: EnronData.org Abridged Email Dataset (AED)
X-EDO-AED-Version: 1.0
X-EDO-AED-License: Creative Commons Attribution 3.0 United States;
http://creativecommons.org/licenses/by/3.0/us/;
To provide attribution, please cite to "EnronData.org."
X-EDO-AED-ID: 516172
X-EDO-AED-File: zipper-a/inbox/38.eml
Message-ID: <82...@thyme>
Date: Wed, 30 Jan 2002 15:07:01 -0800 (PST)
From: naveen.andrews@enron.com
To: andy.zipper@enron.com
Subject: RE: Var simulation
...
{noformat}
MBOX rule look for additional headers only in the first 256 bytes, which is not enough when X- headers are present.
Side-note: prior to 1.17 such mbox was detected as text/plain. As of 1.17 it is detected as message/rfc822 (due to TIKA-2594 that added a rule for Message-ID being present in the first 1000 bytes). Neither is correct!
was:
This is a spin off from TIKA-2578
I have mbox files that are not being recognized as such because they have X- headers at the top.
Current config:
{noformat}
<mime-type type="application/mbox">
<!-- MBOX files start with "From [sender] [date]" -->
<!-- To avoid false matches, check for other headers after that -->
<magic priority="70">
<match value="From " type="string" offset="0">
<match value="\nFrom: " type="string" offset="32:256"/>
<match value="\nDate: " type="string" offset="32:256"/>
<match value="\nSubject: " type="string" offset="32:256"/>
<match value="\nDelivered-To: " type="string" offset="32:256"/>
<match value="\nReceived: by " type="string" offset="32:256"/>
<match value="\nReceived: via " type="string" offset="32:256"/>
<match value="\nReceived: from " type="string" offset="32:256"/>
<match value="\nMime-Version: " type="string" offset="32:256"/>
</match>
{noformat}
mbox file:
{noformat}
From "naveen.andrews@enron.com" Wed Jan 30 18:07:01 2002
X-EDO-Dataset: EnronData.org Abridged Email Dataset (AED)
X-EDO-AED-Version: 1.0
X-EDO-AED-License: Creative Commons Attribution 3.0 United States;
http://creativecommons.org/licenses/by/3.0/us/;
To provide attribution, please cite to "EnronData.org."
X-EDO-AED-ID: 516172
X-EDO-AED-File: zipper-a/inbox/38.eml
Message-ID: <82...@thyme>
Date: Wed, 30 Jan 2002 15:07:01 -0800 (PST)
From: naveen.andrews@enron.com
To: andy.zipper@enron.com
Subject: RE: Var simulation
...
{noformat}
MBOX rule look for additional headers only in the first 256 bytes, which is not enough when X- headers are present.
Side-note: prior to 1.17 such mbox was detected as text/plain. As of 1.17 it is detected as message/rfc822 (due to TIKA-2594 that added a rule for Message-ID being present in the first 1000 bytes).
> MBOX not recognized when unknown X-headers are present
> ------------------------------------------------------
>
> Key: TIKA-2688
> URL: https://issues.apache.org/jira/browse/TIKA-2688
> Project: Tika
> Issue Type: Bug
> Components: detector, mime
> Affects Versions: 1.18
> Reporter: Yury Kats
> Priority: Major
>
> This is a spin off from TIKA-2578
> I have mbox files that are not being recognized as such because they have X- headers at the top.
> Current config:
> {noformat}
> <mime-type type="application/mbox">
> <!-- MBOX files start with "From [sender] [date]" -->
> <!-- To avoid false matches, check for other headers after that -->
> <magic priority="70">
> <match value="From " type="string" offset="0">
> <match value="\nFrom: " type="string" offset="32:256"/>
> <match value="\nDate: " type="string" offset="32:256"/>
> <match value="\nSubject: " type="string" offset="32:256"/>
> <match value="\nDelivered-To: " type="string" offset="32:256"/>
> <match value="\nReceived: by " type="string" offset="32:256"/>
> <match value="\nReceived: via " type="string" offset="32:256"/>
> <match value="\nReceived: from " type="string" offset="32:256"/>
> <match value="\nMime-Version: " type="string" offset="32:256"/>
> </match>
> {noformat}
> mbox file:
> {noformat}
> From "naveen.andrews@enron.com" Wed Jan 30 18:07:01 2002
> X-EDO-Dataset: EnronData.org Abridged Email Dataset (AED)
> X-EDO-AED-Version: 1.0
> X-EDO-AED-License: Creative Commons Attribution 3.0 United States;
> http://creativecommons.org/licenses/by/3.0/us/;
> To provide attribution, please cite to "EnronData.org."
> X-EDO-AED-ID: 516172
> X-EDO-AED-File: zipper-a/inbox/38.eml
> Message-ID: <82...@thyme>
> Date: Wed, 30 Jan 2002 15:07:01 -0800 (PST)
> From: naveen.andrews@enron.com
> To: andy.zipper@enron.com
> Subject: RE: Var simulation
> ...
> {noformat}
> MBOX rule look for additional headers only in the first 256 bytes, which is not enough when X- headers are present.
> Side-note: prior to 1.17 such mbox was detected as text/plain. As of 1.17 it is detected as message/rfc822 (due to TIKA-2594 that added a rule for Message-ID being present in the first 1000 bytes). Neither is correct!
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)