You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Vjeran Marcinko <vm...@gmail.com> on 2016/07/28 07:36:34 UTC

Problem with detection of RFC822 message

Hello again,

Just as I resolved the rpoblem with MBOX parser, I noticed that it
doesn't correctly detect contained RFC822 messages as message/rfc822,
but usually text/html or some variation of it.

And question as before, is there some workaround for 1.13 to place in
custom-mimetypes.xml that would fix this?

Here is a start of one such message from my mbox. file (I ommitted
MBOX message start line "From " which just marks start of each
contained message), because this is sent to embeedded parser which
doesn't recognize this as RFC822 type. I Even extracted this portion
of content to separate file and convinced myself that Tika truly don't
detect this as RFC822

X-GM-THRID: 1512463556322914280
X-Gmail-Labels: Inbox,clojure
Delivered-To: vmarcinko@gmail.com
Received: by 10.31.204.67 with SMTP id c64csp1943840vkg;
        Wed, 16 Sep 2015 03:00:48 -0700 (PDT)
X-Received: by 10.140.238.214 with SMTP id j205mr1658705qhc.21.1442397647994;
        Wed, 16 Sep 2015 03:00:47 -0700 (PDT)
Return-Path: <m-...@bounce.linkedin.com>
Received: from mailb-af.linkedin.com (mailb-af.linkedin.com. [108.174.3.150])
        by mx.google.com with ESMTPS id q7si21212015qki.84.2015.09.16.03.00.47
        for <vm...@gmail.com>
        (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Wed, 16 Sep 2015 03:00:47 -0700 (PDT)
Received-SPF: pass (google.com: domain of
m-86i29s6rppu2flx1nqebu0g0hk5wgxj5s0vlvfx11p94yc32jypnkf41i0j@bounce.linkedin.com
designates 108.174.3.150 as permitted sender) client-ip=108.174.3.150;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of
m-86i29s6rppu2flx1nqebu0g0hk5wgxj5s0vlvfx11p94yc32jypnkf41i0j@bounce.linkedin.com
designates 108.174.3.150 as permitted sender)
smtp.mailfrom=m-86i29s6rppu2flx1nqebu0g0hk5wgxj5s0vlvfx11p94yc32jypnkf41i0j@bounce.linkedin.com;
       dkim=pass header.i=@linkedin.com;
       dmarc=pass (p=REJECT dis=NONE) header.from=linkedin.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linkedin.com;
   s=proddkim1024; t=1442397647;
   bh=ZsM2cpYAX84d5ECwhjitGaKaCqYUJu7THSfox9AGoGs=;
   h=From:Subject:MIME-Version:Content-Type:To:Date:X-LinkedIn-Class:
    X-LinkedIn-Template:X-LinkedIn-fbl;
   b=1rRg1j7tjk4zOq0f/yFbL4EbM2JuVP9c5yKr7FdpYYdoTRytYoLbdXjLrawfgvgh+
    dJ7L20UCIOrIyft1tez88CK/NkJ9g0fuor4klj+lpQ57NN/XURbXukRwJBwWpCGJ+g
    pYc3hZgxJ/DrKILG1xTfoUO9qW3AziA6CGCNprr4=
From: Paulina Peczkowska <hi...@linkedin.com>
Message-ID: <62...@lva1-app2979.prod.linkedin.com>
Subject: =?UTF-8?Q?BIG_DATA_Developer/_Engineer_Wante?=
 =?UTF-8?Q?d!_=E2=80=93_Job_offer_in_WrocLove,_Poland?=
MIME-Version: 1.0
Content-Type: multipart/mixed;
   boundary="----=_Part_80197_1293222758.1442397647784"
To: Vjeran Marcinko <vm...@gmail.com>
Date: Wed, 16 Sep 2015 10:00:47 +0000 (UTC)
X-LinkedIn-Class: INMAIL
X-LinkedIn-Template: inmail_sent
X-LinkedIn-fbl:
m2-aszuze4gtmmy1h9ue2u7ub4ja7lqa8rsm59a91z648i3mnj3ljeeftgfblvpeyfttcogcdluvmwr6zwye8x4iqxk9nfut2ks3v79en
X-LinkedIn-Id: 5t9t4k-iemmbb9o-1s
List-Unsubscribe:
<mailto:list-unsubscribe@linkedin.com?subject=unsubscribe/AQFzeuoICPGM0QAAAU_VmXOnwp9hn8S4D89ESIFfgOZDuW-H1luaGdeqgtrsCStdPHfZwYCxr1-9TPs/5t9t4k-iemmbb9o-1s/m2-aszuze4gtmmy1h9ue2u7ub4ja7lqa8rsm59a91z648i3mnj3ljeeftgfblvpeyfttcogcdluvmwr6zwye8x4iqxk9nfut2ks3v79en>
Reply-To: Paulina Peczkowska
<67...@reply.linkedin.com>
Feedback-ID: inmail_sent:linkedin

------=_Part_80197_1293222758.1442397647784
Content-Type: multipart/alternative;
   boundary="----=_Part_80194_920030466.1442397647781"

------=_Part_80194_920030466.1442397647781
Content-Type: text/plain;charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Content-ID: text-body

BIG DATA Developer/ Engineer Wanted! =E2=80=93 Job offer in WrocLove, Polan=
d

Dear Vjeran,
<br>

<br>
My name is Paulina P&#x119;czkowska and I&#x2019;m a recruitment specialist=
 at IT Kontrakt GmbH=20
<br>
dedicated to international projects.=20
<br>

Re: Problem with detection of RFC822 message

Posted by Luís Filipe Nassif <lf...@gmail.com>.
Check TIKA-879 where a general solution was discussed. These problems with
rfc822 detection are very recurrent.

Luis

2016-07-28 4:36 GMT-03:00 Vjeran Marcinko <vm...@gmail.com>:

> Hello again,
>
> Just as I resolved the rpoblem with MBOX parser, I noticed that it
> doesn't correctly detect contained RFC822 messages as message/rfc822,
> but usually text/html or some variation of it.
>
> And question as before, is there some workaround for 1.13 to place in
> custom-mimetypes.xml that would fix this?
>
> Here is a start of one such message from my mbox. file (I ommitted
> MBOX message start line "From " which just marks start of each
> contained message), because this is sent to embeedded parser which
> doesn't recognize this as RFC822 type. I Even extracted this portion
> of content to separate file and convinced myself that Tika truly don't
> detect this as RFC822
>
> X-GM-THRID: 1512463556322914280
> X-Gmail-Labels: Inbox,clojure
> Delivered-To: vmarcinko@gmail.com
> Received: by 10.31.204.67 with SMTP id c64csp1943840vkg;
>         Wed, 16 Sep 2015 03:00:48 -0700 (PDT)
> X-Received: by 10.140.238.214 with SMTP id
> j205mr1658705qhc.21.1442397647994;
>         Wed, 16 Sep 2015 03:00:47 -0700 (PDT)
> Return-Path: <
> m-86i29s6rppu2flx1nqebu0g0hk5wgxj5s0vlvfx11p94yc32jypnkf41i0j@bounce.linkedin.com
> >
> Received: from mailb-af.linkedin.com (mailb-af.linkedin.com.
> [108.174.3.150])
>         by mx.google.com with ESMTPS id
> q7si21212015qki.84.2015.09.16.03.00.47
>         for <vm...@gmail.com>
>         (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
>         Wed, 16 Sep 2015 03:00:47 -0700 (PDT)
> Received-SPF: pass (google.com: domain of
>
> m-86i29s6rppu2flx1nqebu0g0hk5wgxj5s0vlvfx11p94yc32jypnkf41i0j@bounce.linkedin.com
> designates 108.174.3.150 as permitted sender) client-ip=108.174.3.150;
> Authentication-Results: mx.google.com;
>        spf=pass (google.com: domain of
>
> m-86i29s6rppu2flx1nqebu0g0hk5wgxj5s0vlvfx11p94yc32jypnkf41i0j@bounce.linkedin.com
> designates 108.174.3.150 as permitted sender)
> smtp.mailfrom=
> m-86i29s6rppu2flx1nqebu0g0hk5wgxj5s0vlvfx11p94yc32jypnkf41i0j@bounce.linkedin.com
> ;
>        dkim=pass header.i=@linkedin.com;
>        dmarc=pass (p=REJECT dis=NONE) header.from=linkedin.com
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linkedin.com;
>    s=proddkim1024; t=1442397647;
>    bh=ZsM2cpYAX84d5ECwhjitGaKaCqYUJu7THSfox9AGoGs=;
>    h=From:Subject:MIME-Version:Content-Type:To:Date:X-LinkedIn-Class:
>     X-LinkedIn-Template:X-LinkedIn-fbl;
>    b=1rRg1j7tjk4zOq0f/yFbL4EbM2JuVP9c5yKr7FdpYYdoTRytYoLbdXjLrawfgvgh+
>     dJ7L20UCIOrIyft1tez88CK/NkJ9g0fuor4klj+lpQ57NN/XURbXukRwJBwWpCGJ+g
>     pYc3hZgxJ/DrKILG1xTfoUO9qW3AziA6CGCNprr4=
> From: Paulina Peczkowska <hi...@linkedin.com>
> Message-ID: <
> 620370407.80198.1442397647784.JavaMail.app@lva1-app2979.prod.linkedin.com>
> Subject: =?UTF-8?Q?BIG_DATA_Developer/_Engineer_Wante?=
>  =?UTF-8?Q?d!_=E2=80=93_Job_offer_in_WrocLove,_Poland?=
> MIME-Version: 1.0
> Content-Type: multipart/mixed;
>    boundary="----=_Part_80197_1293222758.1442397647784"
> To: Vjeran Marcinko <vm...@gmail.com>
> Date: Wed, 16 Sep 2015 10:00:47 +0000 (UTC)
> X-LinkedIn-Class: INMAIL
> X-LinkedIn-Template: inmail_sent
> X-LinkedIn-fbl:
>
> m2-aszuze4gtmmy1h9ue2u7ub4ja7lqa8rsm59a91z648i3mnj3ljeeftgfblvpeyfttcogcdluvmwr6zwye8x4iqxk9nfut2ks3v79en
> X-LinkedIn-Id: 5t9t4k-iemmbb9o-1s
> List-Unsubscribe:
> <mailto:list-unsubscribe@linkedin.com
> ?subject=unsubscribe/AQFzeuoICPGM0QAAAU_VmXOnwp9hn8S4D89ESIFfgOZDuW-H1luaGdeqgtrsCStdPHfZwYCxr1-9TPs/5t9t4k-iemmbb9o-1s/m2-aszuze4gtmmy1h9ue2u7ub4ja7lqa8rsm59a91z648i3mnj3ljeeftgfblvpeyfttcogcdluvmwr6zwye8x4iqxk9nfut2ks3v79en>
> Reply-To: Paulina Peczkowska
> <67...@reply.linkedin.com>
> Feedback-ID: inmail_sent:linkedin
>
> ------=_Part_80197_1293222758.1442397647784
> Content-Type: multipart/alternative;
>    boundary="----=_Part_80194_920030466.1442397647781"
>
> ------=_Part_80194_920030466.1442397647781
> Content-Type: text/plain;charset=UTF-8
> Content-Transfer-Encoding: quoted-printable
> Content-ID: text-body
>
> BIG DATA Developer/ Engineer Wanted! =E2=80=93 Job offer in WrocLove,
> Polan=
> d
>
> Dear Vjeran,
> <br>
>
> <br>
> My name is Paulina P&#x119;czkowska and I&#x2019;m a recruitment
> specialist=
>  at IT Kontrakt GmbH=20
> <br>
> dedicated to international projects.=20
> <br>
>

Re: Problem with detection of RFC822 message

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 28 Jul 2016, Vjeran Marcinko wrote:
> Just as I resolved the rpoblem with MBOX parser, I noticed that it 
> doesn't correctly detect contained RFC822 messages as message/rfc822, 
> but usually text/html or some variation of it.
>
> And question as before, is there some workaround for 1.13 to place in
> custom-mimetypes.xml that would fix this?

Can you create a small junit testcase that shows the problem, using either 
a small mbox file of your own, or one of the ones in the tika-parsers test 
documents directory? Attach that to a new JIRA issue, and one of us can 
use it to take a look at what's going wrong. Once we know the underlying 
issue, we can hopefully fix it, and maybe let you know a workaround!

Nick