You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Eli Trucco <th...@yahoo.com> on 2016/07/20 14:38:38 UTC

Problems with email attachments

Hi guys,

So I'm currently writing a small app that reads a directory and 
generally parses all documents inside it including extracting all their 
attachments/embedded files (if exist). I use Tika to achieve this, 
however I stumbled across a couple of problems while parsing .eml files 
from Thunderbird. Some of them are wrongly identified (as text/html, or 
application/xhtml+xml) and in a lot of them, the attachments are not 
detected. I tried to parse 20 random eml files with attachments 
(pdf,txt,html,etc), and at least 10 of them are either identified as 
html, or correctly identified as rfc822 but the attachments are not 
extracted. I tried the same files using TikaCLI -z option with the same 
result :(

What I did: I extended the class ParsingEmbeddedDocumentExtractor to 
extract and store the attachments somewhere else (exactly as shown in 
this example code 
https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java). 
Is it the correct, or is that another way to do this? Any idea to 
improve the type detection or how to extract the attachments better 
would be really appreciated !

Regards,

Eli Trucco


Re: Problems with email attachments

Posted by Eli Trucco <th...@yahoo.com>.
I have created a ticket here 
https://issues.apache.org/jira/browse/TIKA-2037

Regards,
Eli Trucco

On 20.07.2016 16:56, Allison, Timothy B. wrote:
> Are you able to share some examples?  If so, please open a ticket.
>
> -----Original Message-----
> From: Eli Trucco [mailto:theknights91@yahoo.com]
> Sent: Wednesday, July 20, 2016 10:39 AM
> To: user@tika.apache.org
> Subject: Problems with email attachments
>
> Hi guys,
>
> So I'm currently writing a small app that reads a directory and generally parses all documents inside it including extracting all their attachments/embedded files (if exist). I use Tika to achieve this, however I stumbled across a couple of problems while parsing .eml files from Thunderbird. Some of them are wrongly identified (as text/html, or
> application/xhtml+xml) and in a lot of them, the attachments are not detected. I tried to parse 20 random eml files with attachments (pdf,txt,html,etc), and at least 10 of them are either identified as html, or correctly identified as rfc822 but the attachments are not extracted. I tried the same files using TikaCLI -z option with the same result :(
>
> What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract and store the attachments somewhere else (exactly as shown in this example code https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java).
> Is it the correct, or is that another way to do this? Any idea to improve the type detection or how to extract the attachments better would be really appreciated !
>
> Regards,
>
> Eli Trucco
>


RE: Problems with email attachments

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Are you able to share some examples?  If so, please open a ticket.

-----Original Message-----
From: Eli Trucco [mailto:theknights91@yahoo.com] 
Sent: Wednesday, July 20, 2016 10:39 AM
To: user@tika.apache.org
Subject: Problems with email attachments

Hi guys,

So I'm currently writing a small app that reads a directory and generally parses all documents inside it including extracting all their attachments/embedded files (if exist). I use Tika to achieve this, however I stumbled across a couple of problems while parsing .eml files from Thunderbird. Some of them are wrongly identified (as text/html, or
application/xhtml+xml) and in a lot of them, the attachments are not detected. I tried to parse 20 random eml files with attachments (pdf,txt,html,etc), and at least 10 of them are either identified as html, or correctly identified as rfc822 but the attachments are not extracted. I tried the same files using TikaCLI -z option with the same result :(

What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract and store the attachments somewhere else (exactly as shown in this example code https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java). 
Is it the correct, or is that another way to do this? Any idea to improve the type detection or how to extract the attachments better would be really appreciated !

Regards,

Eli Trucco