You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Vjeran Marcinko <vj...@email.t-com.hr> on 2015/11/13 08:23:52 UTC
Detection problem with RFC822 file with HTML content
Hello,
I saved 2 .eml files saved by my Thunderbird, and one of them contained
plain text content, whereas other one rich HTML content.
The plain text one got recognized by Tika as "message/rfc822" file, but
the other one incorrectly as "text/html" (and textual content being
incorrectly extracted).
Any suggestion how to overcome this ?
Here is my HTML .eml file from Thunderbird:
X-Mozilla-Status: 0001
X-Mozilla-Status2: 01000000
X-Mozilla-Keys:
FCC: mailbox://vmarcin@some.smptpserver.com/Sent
X-Identity-Key: id1
X-Account-Key: account1
From: Vjeran Marcinko <vj...@someemail.com>
Subject: My rich mail with signature
To: somedestination@someone.com
Message-ID: <56...@someemail.com>
Date: Fri, 13 Nov 2015 07:07:42 +0100
X-Mozilla-Draft-Info: internal/draft; vcard=0; receipt=0; DSN=0;
uuencode=0;
attachmentreminder=0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
Thunderbird/38.3.0
MIME-Version: 1.0
Content-Type: multipart/related;
boundary="------------010102060501000809020808"
This is a multi-part message in MIME format.
--------------010102060501000809020808
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 7bit
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<title>My rich mail with signature</title>
</head>
<body text="#000000" bgcolor="#FFFFFF">
This is the beginning of <b>rich formatted email text</b>. Here is
my signature: <img alt="here should be signature picture"
src="cid:part1.05070507.06010907@email.t-com.hr" height="104"
width="182" align="middle"><br>
After that the <font color="#ff0000">RED COLOR </font> is shown.<br>
<br>
</body>
</html>
--------------010102060501000809020808
Content-Type: image/jpeg
Content-Transfer-Encoding: base64
Content-ID: <pa...@someemail.com>
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAIBAQIBAQICAgICAgICAwUDAwMDAwYEBAMFBwYH
BwcGBwcICQsJCAgKCAcHCg0KCgsMDAwMBwkODw0MDgsMDAz/2wBDAQICAgMDAwYDAwYMCAcI
DAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAz/wAAR
CABoALYDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAA
AgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkK
FhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWG
....
Re: Detection problem with RFC822 file with HTML content
Posted by Vjeran Marcinko <vj...@email.t-com.hr>.
Ok, here it is:
https://issues.apache.org/jira/browse/TIKA-1793
-Vjeran
On 13.11.2015 13:48, Nick Burch wrote:
> On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
>> On 13.11.2015 11:51, Nick Burch wrote:
>>> On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
>>>> I saved 2 .eml files saved by my Thunderbird, and one of them
>>>> contained plain text content, whereas other one rich HTML content.
>>>
>>> Did you try with the latest version of Apache Tika? IIRC we did some
>>> fixes around this moderately recently
>>
>> Yep, I'm using v1.11
>
> Can you open a new jira bug entry, and attach one of the emails which
> isn't being detected?
>
> Nick
>
Re: Detection problem with RFC822 file with HTML content
Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
> On 13.11.2015 11:51, Nick Burch wrote:
>> On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
>>> I saved 2 .eml files saved by my Thunderbird, and one of them contained
>>> plain text content, whereas other one rich HTML content.
>>
>> Did you try with the latest version of Apache Tika? IIRC we did some
>> fixes around this moderately recently
>
> Yep, I'm using v1.11
Can you open a new jira bug entry, and attach one of the emails which
isn't being detected?
Nick
Re: Detection problem with RFC822 file with HTML content
Posted by Vjeran Marcinko <vj...@email.t-com.hr>.
Yep, I'm using v1.11
On 13.11.2015 11:51, Nick Burch wrote:
> On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
>> I saved 2 .eml files saved by my Thunderbird, and one of them
>> contained plain text content, whereas other one rich HTML content.
>
> Did you try with the latest version of Apache Tika? IIRC we did some
> fixes around this moderately recently
>
> Nick
>
Re: Detection problem with RFC822 file with HTML content
Posted by Nick Burch <ap...@gagravarr.org>.
On Fri, 13 Nov 2015, Vjeran Marcinko wrote:
> I saved 2 .eml files saved by my Thunderbird, and one of them contained
> plain text content, whereas other one rich HTML content.
Did you try with the latest version of Apache Tika? IIRC we did some fixes
around this moderately recently
Nick