You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Mark Kerzner <ma...@elephantscale.com> on 2015/06/04 07:06:12 UTC

Tika parsing of emails

Hi,

usually I just do new Tika().parse(myfile...), and Tika does all the work.

Is there anything special about *.eml files? How does Tika treat
attachments? What would be a reference for me to read?

Thank you

-- 
Mark Kerzner, Managing Partner, Elephant Scale <http://elephantscale.com/>
Mobile: 713-724-2534, Skype: mark.kerzner1
https://www.linkedin.com/in/markkerzner
To schedule a meeting with me: http://www.meetme.so/markkerzner

Re: Tika parsing of emails

Posted by Chris Mattmann <ch...@gmail.com>.
woot!

----
Chris Mattmann
chris.mattmann@gmail.com






-----Original Message-----
From: Mark Kerzner <ma...@shmsoft.com>
Reply-To: <us...@tika.apache.org>
Date: Thursday, June 4, 2015 at 9:42 PM
To: Tika User <us...@tika.apache.org>
Subject: Re: Tika parsing of emails

>Thank you, Konstantin. That is a wealth of information that will last me
>for both my current project and the next two :)
>Mark
>
>
>On Thu, Jun 4, 2015 at 3:44 AM, Konstantin Gribov <gr...@gmail.com>
>wrote:
>
>Hi, Mark.
>
>If you use Tika facade you will receive all text content to
>ContentHandler passed to parse(...), including attachments. You can use
>XHTMLContentHandler to receive each part of email to it's own <div
>class="email-entry">. Tika usually parse content recursively and emits
>all to ContentHandler.
>If you need more fine-grained control take a look at
>RecursiveParserWrapper
>(http://tika.apache.org/1.8/api/org/apache/tika/parser/RecursiveParserWrap
>per.html). It returns metadata object for each parsed document and its
>children with content stored in that metadata object. It isn't thread
>safe (so create new object for each thread) and you have to reset it
>after each parse call. Also, this method is not suitable for large files
>since their content will be stored in memory.
>
>If you need even more fine-grained control -- use Apache James Mime4j
>(which is used in Tika itself to parse emails). If your application is
>email-centric and you don't need metadata normalization (provided by
>Tika) for email messages it can be right way. Also, each multipart
>message body can be parsed by Tika. I recommend to set at least
>content-type info to metadata object from MIME Content-Type of
>appropriate multipart/* headers before parsing it with Tika. You'll get
>metadata and content for each message part and can stream content if it's
>quite large.
>
>-- Best regards,
>Konstantin Gribov
>
>
>
>чт, 4 июня 2015 г. в 8:07, Mark Kerzner <ma...@elephantscale.com>:
>
>
>Hi,
>usually I just do new Tika().parse(myfile...), and Tika does all the work.
>
>Is there anything special about *.eml files? How does Tika treat
>attachments? What would be a reference for me to read?
>
>Thank you
>
>
>-- 
>Mark Kerzner, Managing Partner, Elephant Scale <http://elephantscale.com/>
>Mobile: 713-724-2534 <tel:713-724-2534>, Skype: mark.kerzner1
>https://www.linkedin.com/in/markkerzner
>
>To schedule a meeting with me: http://www.meetme.so/markkerzner
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>-- 
>Mark Kerzner, President & CEO, SHMsoft <http://shmsoft.com/>,
>To schedule a meeting with me: http://www.meetme.so/markkerzner
>
>Mobile: 713-724-2534
>Skype: mark.kerzner1
>Office: One Riverway Suite 1700
>Houston, TX 77056
>
>Privileged and Confidential
> <http://shmsoft.com/>
>
>
>
>



Re: Tika parsing of emails

Posted by Mark Kerzner <ma...@shmsoft.com>.
Thank you, Konstantin. That is a wealth of information that will last me
for both my current project and the next two :)

Mark

On Thu, Jun 4, 2015 at 3:44 AM, Konstantin Gribov <gr...@gmail.com> wrote:

> Hi, Mark.
>
> If you use Tika facade you will receive all text content to ContentHandler
> passed to parse(...), including attachments. You can use
> XHTMLContentHandler to receive each part of email to it's own <div
> class="email-entry">. Tika usually parse content recursively and emits all
> to ContentHandler.
>
> If you need more fine-grained control take a look at
> RecursiveParserWrapper (
> http://tika.apache.org/1.8/api/org/apache/tika/parser/RecursiveParserWrapper.html).
> It returns metadata object for each parsed document and its children with
> content stored in that metadata object. It isn't thread safe (so create new
> object for each thread) and you have to reset it after each parse call.
> Also, this method is not suitable for large files since their content will
> be stored in memory.
>
> If you need even more fine-grained control -- use Apache James Mime4j
> (which is used in Tika itself to parse emails). If your application is
> email-centric and you don't need metadata normalization (provided by Tika)
> for email messages it can be right way. Also, each multipart message body
> can be parsed by Tika. I recommend to set at least content-type info to
> metadata object from MIME Content-Type of appropriate multipart/* headers
> before parsing it with Tika. You'll get metadata and content for each
> message part and can stream content if it's quite large.
>
> --
> Best regards,
> Konstantin Gribov
>
> чт, 4 июня 2015 г. в 8:07, Mark Kerzner <ma...@elephantscale.com>:
>
>> Hi,
>>
>> usually I just do new Tika().parse(myfile...), and Tika does all the work.
>>
>> Is there anything special about *.eml files? How does Tika treat
>> attachments? What would be a reference for me to read?
>>
>> Thank you
>>
>> --
>> Mark Kerzner, Managing Partner, Elephant Scale
>> <http://elephantscale.com/>
>> Mobile: 713-724-2534, Skype: mark.kerzner1
>> https://www.linkedin.com/in/markkerzner
>> To schedule a meeting with me: http://www.meetme.so/markkerzner
>>
>>


-- 
Mark Kerzner, President & CEO, SHMsoft <http://shmsoft.com/>,
To schedule a meeting with me: http://www.meetme.so/markkerzner

Mobile: 713-724-2534
Skype: mark.kerzner1
Office: One Riverway Suite 1700
Houston, TX 77056

*Privileged and Confidential *
<http://shmsoft.com/>

Re: Tika parsing of emails

Posted by Konstantin Gribov <gr...@gmail.com>.
Hi, Mark.

If you use Tika facade you will receive all text content to ContentHandler
passed to parse(...), including attachments. You can use
XHTMLContentHandler to receive each part of email to it's own <div
class="email-entry">. Tika usually parse content recursively and emits all
to ContentHandler.

If you need more fine-grained control take a look at RecursiveParserWrapper
(
http://tika.apache.org/1.8/api/org/apache/tika/parser/RecursiveParserWrapper.html).
It returns metadata object for each parsed document and its children with
content stored in that metadata object. It isn't thread safe (so create new
object for each thread) and you have to reset it after each parse call.
Also, this method is not suitable for large files since their content will
be stored in memory.

If you need even more fine-grained control -- use Apache James Mime4j
(which is used in Tika itself to parse emails). If your application is
email-centric and you don't need metadata normalization (provided by Tika)
for email messages it can be right way. Also, each multipart message body
can be parsed by Tika. I recommend to set at least content-type info to
metadata object from MIME Content-Type of appropriate multipart/* headers
before parsing it with Tika. You'll get metadata and content for each
message part and can stream content if it's quite large.

-- 
Best regards,
Konstantin Gribov

чт, 4 июня 2015 г. в 8:07, Mark Kerzner <ma...@elephantscale.com>:

> Hi,
>
> usually I just do new Tika().parse(myfile...), and Tika does all the work.
>
> Is there anything special about *.eml files? How does Tika treat
> attachments? What would be a reference for me to read?
>
> Thank you
>
> --
> Mark Kerzner, Managing Partner, Elephant Scale <http://elephantscale.com/>
> Mobile: 713-724-2534, Skype: mark.kerzner1
> https://www.linkedin.com/in/markkerzner
> To schedule a meeting with me: http://www.meetme.so/markkerzner
>
>