You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Vjeran Marcinko <vm...@gmail.com> on 2016/07/25 18:30:43 UTC

Problem with detection of .mbox file

Hello,

I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
and later, after debugging Tika source code, I found what the problem
is - default detector doesn't even recognize it as "applciation/mbox"
MIME type, and although file extension is .mbox, it ignores this hint
because its "magic" way of detecting file type based on some amount of
initial bytes detects it is "text/html" so it ignores the hint, and
returns "text/html"...And by consequence, the parsing never goes to
the correct parser.

Is there some way I could override this magic detection and enforce
that detection in this case is based solely on file extension for
these .mbox files?

-Vjeran

#################################################################################
Anyway, here is the beginning of my MBOX file which I got from Google
exporting my GMAil emails:


From 1540828415824941917@xxx Mon Jul 25 12:08:06 +0000 2016
X-GM-THRID: 1540828415824941917
X-Gmail-Labels: Inbox,Important,clojure
Delivered-To: vmarcinko@gmail.com
Received: by 10.31.56.17 with SMTP id f17csp1614203vka;
        Mon, 25 Jul 2016 05:08:06 -0700 (PDT)
X-Received: by 10.202.95.133 with SMTP id t127mr8226795oib.80.1469448485990;
        Mon, 25 Jul 2016 05:08:05 -0700 (PDT)
Return-Path: <bo...@m.dripemail2.com>
Received: from o1678940x148.outbound-mail.sendgrid.net
(o1678940x148.outbound-mail.sendgrid.net. [167.89.40.148])
        by mx.google.com with ESMTPS id k58si11358370otb.279.2016.07.25.05.08.05
        for <vm...@gmail.com>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Mon, 25 Jul 2016 05:08:05 -0700 (PDT)
Received-SPF: pass (google.com: domain of
bounces+2693180-18a0-vmarcinko=gmail.com@m.dripemail2.com designates
167.89.40.148 as permitted sender) client-ip=167.89.40.148;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@dripemail2.com;
       dkim=pass header.i=@sendgrid.info;
       spf=pass (google.com: domain of
bounces+2693180-18a0-vmarcinko=gmail.com@m.dripemail2.com designates
167.89.40.148 as permitted sender)
smtp.mailfrom=bounces+2693180-18a0-vmarcinko=gmail.com@m.dripemail2.com
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=dripemail2.com;
h=content-type:from:mime-version:subject:to; s=s1;
bh=wbY8sP/TelOpmU6q09dgY8v3muI=; b=Vo/m0Lx7f8jNAHU2m0vLO6StuGms/
XeJeiLBV4CHyhwMNr4UuuBIJmDVGIuv6YGSJPN9REUYVuCqFyaPOAZiBtlie8Awq
7uB7KxZKnFPDh/7XQRz1Z1kKx0dGiENBOoymZFglCebm9my2i+trZ6EzN4YFOB/+
ZNpksoRirEVhws=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sendgrid.info;
h=content-type:from:mime-version:subject:to:x-feedback-id;
s=smtpapi; bh=wbY8sP/TelOpmU6q09dgY8v3muI=; b=vnSfe24bbcPSeungct
GphBd1h4S4i96PxeapkjmxCLyzeItTItNETiCtkLFbGnzFTVYVvzDOmcI47BYFHu
yOM0kILRdMzFt1d7HNVE1EJCB0DHVS83Yk7vaH/jc+IU34jJgZBlG0yR292QYtYk
7WA4ETOIQnQ+3K3pJ+wUYNGKs=
Received: by filter0448p1mdw1.sendgrid.net with SMTP id
filter0448p1mdw1.23984.5796012246
        2016-07-25 12:08:02.669274519 +0000 UTC
Received: from MjY5MzE4MA (ec2-54-210-139-199.compute-1.amazonaws.com
[54.210.139.199])
by ismtpd0002p1iad1.sendgrid.net (SG) with HTTP id zyxIxF_lRFKgFZxIoq9BKA
for <vm...@gmail.com>; Mon, 25 Jul 2016 12:08:02.739 +0000 (UTC)
Content-Type: multipart/alternative;
boundary=0082ce9e57fb837e9dfa9ca77bc69f450567ae3138b24a5db1e7237fc121
Date: Mon, 25 Jul 2016 12:08:02 +0000
From: "Eric at PurelyFunctional.tv" <er...@lispcast.com>
Mime-Version: 1.0
Subject: Twitter Bot, Atom Editor, and Scraping HTML
To: vmarcinko@gmail.com
Message-ID: <zy...@ismtpd0002p1iad1.sendgrid.net>
X-SG-EID: pywWA7gL46oOK7j8609IHsuM8bBS72IBx+uWB+d8D/N9t0rE4+TMmdgXQpvC7JIN3ekubbU2qCgHqS
 7W8GJ+aKX8qAKYokC5jzRvyv4CX3KHlasoMaqSUGqYEuHYx1e9vMNhqBIB4+nZN4uZmnKvRrvnYMZy
 NtpRNDKB0S28xjv5CxGmqbRggtf8RLQ7d2s5RIuQwIMIZQ3nLl3OrnmbjtZAP91VtQFkbhRATrKx7i
 o=
X-SG-ID: 6l1ICXxVk1U2NQBE+KPgx+uy7/oBj9jrT6lO2L7BaL4cap+kBh3uUy+RmDmEF7s+mSBwxVfvlgfHyu
 osKIvS9Q==
X-Feedback-ID: 2693180:l1fkQA9YLlZ4PTqywTL3Zu+zLq2XYmkeuiZ1WV+xvFE=:l1fkQA9YLlZ4PTqywTL3Zu+zLq2XYmkeuiZ1WV+xvFE=:SG

--0082ce9e57fb837e9dfa9ca77bc69f450567ae3138b24a5db1e7237fc121
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset=UTF-8
Mime-Version: 1.0

Dear Clojurist,

Thanks again for being there. I am so lucky to have you here on
my PurelyFunctional.tv email list.

A lot of people ask me what it takes to be hirable in Clojure. Of
course, the answer is complicated, but the short version is "not
very much". I wrote about it.

Read What do I have to learn to be hirable in Clojure? ( http://t.dripemail=
2.com/c/eyJhY2NvdW50X2lkIjoiMzY1MTcxNyIsImRlbGl2ZXJ5X2lkIjoiMjE3NTQ4MzEyIiw=
idXJsIjoiaHR0cDovL3d3dy5saXNwY2FzdC5jb20vaGlyYWJsZS1pbi1jbG9qdXJlP19fcz15bj=
R6dm8xcnY5cGhkazR4cG11diJ9 )

Re: Problem with detection of .mbox file

Posted by Vjeran Marcinko <vm...@gmail.com>.
Thanx a bunch for a suggested workaround.

Also, I have checked and bug exists in latest 1.4 nightly build

-Vjeran

On Tue, Jul 26, 2016 at 2:22 AM, Luís Filipe Nassif <lf...@gmail.com> wrote:
> Hi,
>
> Based on https://en.wikipedia.org/wiki/Mbox, you can add the following entry
> in org/apache/tika/mime/custom-mimetypes.xml:
>
> <mime-type type="application/mbox">
>         <magic priority="70">
>             <match value="From " type="string" offset="0"/>
>         </magic>
>         <glob pattern="*.mbox"/>
>     </mime-type>
>
> The priority must be greater than message/rfc822. It sometimes returns false
> positives, but detects mbox files without extension, which are very very
> commom.
>
> Luis
>
> 2016-07-25 16:36 GMT-03:00 Allison, Timothy B. <ta...@mitre.org>:
>>
>>     <repositories>
>>         <repository>
>>             <id>apache.snapshots</id>
>>             <name>Apache Development Snapshot Repository</name>
>>
>> <url>https://repository.apache.org/content/repositories/snapshots/</url>
>>             <releases>
>>                 <enabled>false</enabled>
>>             </releases>
>>             <snapshots>
>>                 <enabled>true</enabled>
>>             </snapshots>
>>         </repository>
>>     </repositories>
>>
>> -----Original Message-----
>> From: Vjeran Marcinko [mailto:vmarcinko@gmail.com]
>> Sent: Monday, July 25, 2016 3:25 PM
>> To: user@tika.apache.org
>> Subject: Re: Problem with detection of .mbox file
>>
>> Thanx guys, I can do it in some clumsy way, but before I try it, is there
>> some maven repo for such nightly builds that I can include and specify these
>> 1.4-SNAPSHOT deps ?
>>
>> On Mon, Jul 25, 2016 at 9:14 PM, Allison, Timothy B. <ta...@mitre.org>
>> wrote:
>> >> Can you try with a recent Tika nightly build?
>> > e.g.
>> > https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tik
>> > a-app/
>> >
>> > -----Original Message-----
>> > From: Nick Burch [mailto:apache@gagravarr.org]
>> > Sent: Monday, July 25, 2016 3:03 PM
>> > To: user@tika.apache.org
>> > Subject: Re: Problem with detection of .mbox file
>> >
>> > On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
>> >> I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
>> >> and later, after debugging Tika source code, I found what the problem
>> >> is - default detector doesn't even recognize it as "applciation/mbox"
>> >> MIME type, and although file extension is .mbox, it ignores this hint
>> >> because its "magic" way of detecting file type based on some amount
>> >> of initial bytes detects it is "text/html"
>> >
>> > Can you try with a recent Tika nightly build? Only there have been
>> > some tweaks done around that sort of thing recently
>> >
>> > If a nightly build / build from Git still shows the issue, please open a
>> > bug in Jira and attach a problematic file, then we can take a look!
>> >
>> > Nick
>
>

Re: Problem with detection of .mbox file

Posted by Luís Filipe Nassif <lf...@gmail.com>.
Hi,

Based on https://en.wikipedia.org/wiki/Mbox, you can add the following
entry in org/apache/tika/mime/custom-mimetypes.xml:

<mime-type type="application/mbox">
        <magic priority="70">
            <match value="From " type="string" offset="0"/>
        </magic>
        <glob pattern="*.mbox"/>
    </mime-type>

The priority must be greater than message/rfc822. It sometimes returns
false positives, but detects mbox files without extension, which are very
very commom.

Luis

2016-07-25 16:36 GMT-03:00 Allison, Timothy B. <ta...@mitre.org>:

>     <repositories>
>         <repository>
>             <id>apache.snapshots</id>
>             <name>Apache Development Snapshot Repository</name>
>             <url>
> https://repository.apache.org/content/repositories/snapshots/</url>
>             <releases>
>                 <enabled>false</enabled>
>             </releases>
>             <snapshots>
>                 <enabled>true</enabled>
>             </snapshots>
>         </repository>
>     </repositories>
>
> -----Original Message-----
> From: Vjeran Marcinko [mailto:vmarcinko@gmail.com]
> Sent: Monday, July 25, 2016 3:25 PM
> To: user@tika.apache.org
> Subject: Re: Problem with detection of .mbox file
>
> Thanx guys, I can do it in some clumsy way, but before I try it, is there
> some maven repo for such nightly builds that I can include and specify
> these 1.4-SNAPSHOT deps ?
>
> On Mon, Jul 25, 2016 at 9:14 PM, Allison, Timothy B. <ta...@mitre.org>
> wrote:
> >> Can you try with a recent Tika nightly build?
> > e.g.
> > https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tik
> > a-app/
> >
> > -----Original Message-----
> > From: Nick Burch [mailto:apache@gagravarr.org]
> > Sent: Monday, July 25, 2016 3:03 PM
> > To: user@tika.apache.org
> > Subject: Re: Problem with detection of .mbox file
> >
> > On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
> >> I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
> >> and later, after debugging Tika source code, I found what the problem
> >> is - default detector doesn't even recognize it as "applciation/mbox"
> >> MIME type, and although file extension is .mbox, it ignores this hint
> >> because its "magic" way of detecting file type based on some amount
> >> of initial bytes detects it is "text/html"
> >
> > Can you try with a recent Tika nightly build? Only there have been
> > some tweaks done around that sort of thing recently
> >
> > If a nightly build / build from Git still shows the issue, please open a
> bug in Jira and attach a problematic file, then we can take a look!
> >
> > Nick
>

RE: Problem with detection of .mbox file

Posted by "Allison, Timothy B." <ta...@mitre.org>.
    <repositories>
        <repository>
            <id>apache.snapshots</id>
            <name>Apache Development Snapshot Repository</name>
            <url>https://repository.apache.org/content/repositories/snapshots/</url>
            <releases>
                <enabled>false</enabled>
            </releases>
            <snapshots>
                <enabled>true</enabled>
            </snapshots>
        </repository>
    </repositories>

-----Original Message-----
From: Vjeran Marcinko [mailto:vmarcinko@gmail.com] 
Sent: Monday, July 25, 2016 3:25 PM
To: user@tika.apache.org
Subject: Re: Problem with detection of .mbox file

Thanx guys, I can do it in some clumsy way, but before I try it, is there some maven repo for such nightly builds that I can include and specify these 1.4-SNAPSHOT deps ?

On Mon, Jul 25, 2016 at 9:14 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
>> Can you try with a recent Tika nightly build?
> e.g. 
> https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tik
> a-app/
>
> -----Original Message-----
> From: Nick Burch [mailto:apache@gagravarr.org]
> Sent: Monday, July 25, 2016 3:03 PM
> To: user@tika.apache.org
> Subject: Re: Problem with detection of .mbox file
>
> On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
>> I fist noticed that my .mbox file doesn't get parsed by MBoxParser, 
>> and later, after debugging Tika source code, I found what the problem 
>> is - default detector doesn't even recognize it as "applciation/mbox"
>> MIME type, and although file extension is .mbox, it ignores this hint 
>> because its "magic" way of detecting file type based on some amount 
>> of initial bytes detects it is "text/html"
>
> Can you try with a recent Tika nightly build? Only there have been 
> some tweaks done around that sort of thing recently
>
> If a nightly build / build from Git still shows the issue, please open a bug in Jira and attach a problematic file, then we can take a look!
>
> Nick

Re: Problem with detection of .mbox file

Posted by Vjeran Marcinko <vm...@gmail.com>.
Thanx guys, I can do it in some clumsy way, but before I try it, is
there some maven repo for such nightly builds that I can include and
specify these 1.4-SNAPSHOT deps ?

On Mon, Jul 25, 2016 at 9:14 PM, Allison, Timothy B. <ta...@mitre.org> wrote:
>> Can you try with a recent Tika nightly build?
> e.g. https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tika-app/
>
> -----Original Message-----
> From: Nick Burch [mailto:apache@gagravarr.org]
> Sent: Monday, July 25, 2016 3:03 PM
> To: user@tika.apache.org
> Subject: Re: Problem with detection of .mbox file
>
> On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
>> I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
>> and later, after debugging Tika source code, I found what the problem
>> is - default detector doesn't even recognize it as "applciation/mbox"
>> MIME type, and although file extension is .mbox, it ignores this hint
>> because its "magic" way of detecting file type based on some amount of
>> initial bytes detects it is "text/html"
>
> Can you try with a recent Tika nightly build? Only there have been some tweaks done around that sort of thing recently
>
> If a nightly build / build from Git still shows the issue, please open a bug in Jira and attach a problematic file, then we can take a look!
>
> Nick

RE: Problem with detection of .mbox file

Posted by "Allison, Timothy B." <ta...@mitre.org>.
> Can you try with a recent Tika nightly build?
e.g. https://builds.apache.org/job/Tika-trunk/lastBuild/org.apache.tika$tika-app/

-----Original Message-----
From: Nick Burch [mailto:apache@gagravarr.org] 
Sent: Monday, July 25, 2016 3:03 PM
To: user@tika.apache.org
Subject: Re: Problem with detection of .mbox file

On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
> I fist noticed that my .mbox file doesn't get parsed by MBoxParser, 
> and later, after debugging Tika source code, I found what the problem 
> is - default detector doesn't even recognize it as "applciation/mbox"
> MIME type, and although file extension is .mbox, it ignores this hint 
> because its "magic" way of detecting file type based on some amount of 
> initial bytes detects it is "text/html"

Can you try with a recent Tika nightly build? Only there have been some tweaks done around that sort of thing recently

If a nightly build / build from Git still shows the issue, please open a bug in Jira and attach a problematic file, then we can take a look!

Nick

Re: Problem with detection of .mbox file

Posted by Nick Burch <ap...@gagravarr.org>.
On Mon, 25 Jul 2016, Vjeran Marcinko wrote:
> I fist noticed that my .mbox file doesn't get parsed by MBoxParser,
> and later, after debugging Tika source code, I found what the problem
> is - default detector doesn't even recognize it as "applciation/mbox"
> MIME type, and although file extension is .mbox, it ignores this hint
> because its "magic" way of detecting file type based on some amount of
> initial bytes detects it is "text/html"

Can you try with a recent Tika nightly build? Only there have been some 
tweaks done around that sort of thing recently

If a nightly build / build from Git still shows the issue, please open a 
bug in Jira and attach a problematic file, then we can take a look!

Nick