You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Vjeran Marcinko <vj...@email.t-com.hr> on 2015/11/13 08:23:52 UTC

Detection problem with RFC822 file with HTML content

Hello,

I saved 2 .eml files saved by my Thunderbird, and one of them contained 
plain text content, whereas other one rich HTML content.

The plain text one got recognized by Tika as "message/rfc822" file, but 
the other one incorrectly as "text/html" (and textual content being 
incorrectly extracted).

Any suggestion how to overcome this ?

Here is my HTML .eml file from Thunderbird:

X-Mozilla-Status: 0001
X-Mozilla-Status2: 01000000
X-Mozilla-Keys:
FCC: mailbox://vmarcin@some.smptpserver.com/Sent
X-Identity-Key: id1
X-Account-Key: account1
From: Vjeran Marcinko <vj...@someemail.com>
Subject: My rich mail with signature
To: somedestination@someone.com
Message-ID: <56...@someemail.com>
Date: Fri, 13 Nov 2015 07:07:42 +0100
X-Mozilla-Draft-Info: internal/draft; vcard=0; receipt=0; DSN=0; 
uuencode=0;
  attachmentreminder=0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
  Thunderbird/38.3.0
MIME-Version: 1.0
Content-Type: multipart/related;
  boundary="------------010102060501000809020808"

This is a multi-part message in MIME format.
--------------010102060501000809020808
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit

<html>
   <head>
     <meta http-equiv="content-type" content="text/html; charset=utf-8">
     <title>My rich mail with signature</title>
   </head>
   <body text="#000000" bgcolor="#FFFFFF">
     This is the beginning of <b>rich formatted email text</b>. Here is
     my signature: <img alt="here should be signature picture"
       src="cid:part1.05070507.06010907@email.t-com.hr" height="104"
       width="182" align="middle"><br>
     After that the <font color="#ff0000">RED COLOR </font> is shown.<br>
     <br>
   </body>
</html>

--------------010102060501000809020808
Content-Type: image/jpeg
Content-Transfer-Encoding: base64
Content-ID: <pa...@someemail.com>

/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYH
BwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcI
DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAAR
CABoALYDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAA
AgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkK
FhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWG
....

Re: Detection problem with RFC822 file with HTML content

Posted by Vjeran Marcinko <vj...@email.t-com.hr>.
Ok, here it is:
https://issues.apache.org/jira/browse/TIKA-1793

-Vjeran

On 13.11.2015 13:48, Nick Burch wrote:
> On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
>> On 13.11.2015 11:51, Nick Burch wrote:
>>> On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
>>>> I saved 2 .eml files saved by my Thunderbird, and one of them 
>>>> contained plain text content, whereas other one rich HTML content.
>>>
>>> Did you try with the latest version of Apache Tika? IIRC we did some 
>>> fixes around this moderately recently
>>
>> Yep, I'm using v1.11
>
> Can you open a new jira bug entry, and attach one of the emails which 
> isn't being detected?
>
> Nick
>


Re: Detection problem with RFC822 file with HTML content

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
> On 13.11.2015 11:51, Nick Burch wrote:
>> On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
>>> I saved 2 .eml files saved by my Thunderbird, and one of them contained 
>>> plain text content, whereas other one rich HTML content.
>> 
>> Did you try with the latest version of Apache Tika? IIRC we did some 
>> fixes around this moderately recently
>
> Yep, I'm using v1.11

Can you open a new jira bug entry, and attach one of the emails which 
isn't being detected?

Nick

Re: Detection problem with RFC822 file with HTML content

Posted by Vjeran Marcinko <vj...@email.t-com.hr>.
Yep, I'm using v1.11

On 13.11.2015 11:51, Nick Burch wrote:
> On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
>> I saved 2 .eml files saved by my Thunderbird, and one of them 
>> contained plain text content, whereas other one rich HTML content.
>
> Did you try with the latest version of Apache Tika? IIRC we did some 
> fixes around this moderately recently
>
> Nick
>


Re: Detection problem with RFC822 file with HTML content

Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
> I saved 2 .eml files saved by my Thunderbird, and one of them contained 
> plain text content, whereas other one rich HTML content.

Did you try with the latest version of Apache Tika? IIRC we did some fixes 
around this moderately recently

Nick