You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Eli Trucco <th...@yahoo.com> on 2016/07/20 14:38:38 UTC
Problems with email attachments
Hi guys,
So I'm currently writing a small app that reads a directory and
generally parses all documents inside it including extracting all their
attachments/embedded files (if exist). I use Tika to achieve this,
however I stumbled across a couple of problems while parsing .eml files
from Thunderbird. Some of them are wrongly identified (as text/html, or
application/xhtml+xml) and in a lot of them, the attachments are not
detected. I tried to parse 20 random eml files with attachments
(pdf,txt,html,etc), and at least 10 of them are either identified as
html, or correctly identified as rfc822 but the attachments are not
extracted. I tried the same files using TikaCLI -z option with the same
result :(
What I did: I extended the class ParsingEmbeddedDocumentExtractor to
extract and store the attachments somewhere else (exactly as shown in
this example code
https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java).
Is it the correct, or is that another way to do this? Any idea to
improve the type detection or how to extract the attachments better
would be really appreciated !
Regards,
Eli Trucco
Re: Problems with email attachments
Posted by Eli Trucco <th...@yahoo.com>.
I have created a ticket here
https://issues.apache.org/jira/browse/TIKA-2037
Regards,
Eli Trucco
On 20.07.2016 16:56, Allison, Timothy B. wrote:
> Are you able to share some examples? If so, please open a ticket.
>
> -----Original Message-----
> From: Eli Trucco [mailto:theknights91@yahoo.com]
> Sent: Wednesday, July 20, 2016 10:39 AM
> To: user@tika.apache.org
> Subject: Problems with email attachments
>
> Hi guys,
>
> So I'm currently writing a small app that reads a directory and generally parses all documents inside it including extracting all their attachments/embedded files (if exist). I use Tika to achieve this, however I stumbled across a couple of problems while parsing .eml files from Thunderbird. Some of them are wrongly identified (as text/html, or
> application/xhtml+xml) and in a lot of them, the attachments are not detected. I tried to parse 20 random eml files with attachments (pdf,txt,html,etc), and at least 10 of them are either identified as html, or correctly identified as rfc822 but the attachments are not extracted. I tried the same files using TikaCLI -z option with the same result :(
>
> What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract and store the attachments somewhere else (exactly as shown in this example code https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java).
> Is it the correct, or is that another way to do this? Any idea to improve the type detection or how to extract the attachments better would be really appreciated !
>
> Regards,
>
> Eli Trucco
>
RE: Problems with email attachments
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Are you able to share some examples? If so, please open a ticket.
-----Original Message-----
From: Eli Trucco [mailto:theknights91@yahoo.com]
Sent: Wednesday, July 20, 2016 10:39 AM
To: user@tika.apache.org
Subject: Problems with email attachments
Hi guys,
So I'm currently writing a small app that reads a directory and generally parses all documents inside it including extracting all their attachments/embedded files (if exist). I use Tika to achieve this, however I stumbled across a couple of problems while parsing .eml files from Thunderbird. Some of them are wrongly identified (as text/html, or
application/xhtml+xml) and in a lot of them, the attachments are not detected. I tried to parse 20 random eml files with attachments (pdf,txt,html,etc), and at least 10 of them are either identified as html, or correctly identified as rfc822 but the attachments are not extracted. I tried the same files using TikaCLI -z option with the same result :(
What I did: I extended the class ParsingEmbeddedDocumentExtractor to extract and store the attachments somewhere else (exactly as shown in this example code https://github.com/apache/tika/blob/master/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java).
Is it the correct, or is that another way to do this? Any idea to improve the type detection or how to extract the attachments better would be really appreciated !
Regards,
Eli Trucco