You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Swapna Vuppala <Sw...@arup.com> on 2011/09/28 12:51:28 UTC

Metadata extracted by OutlookExtractor

Hi,

Am new to using Solr and Tika. Am trying to index .msg files (Outlook mails) into Solr. For this, I need a list of metadata extracted by Tika from emails. I would like to know what all fields from a .msg file are extracted by Tika's outlookextractor.

Can you please direct me where I can find such list and how I can customize existing parser to get more metadata (like number of attachments, count of embedded and non-embedded etc )from emails ?

Thanks in advance,
Swapna.
____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

RE: Metadata extracted by OutlookExtractor

Posted by Swapna Vuppala <Sw...@arup.com>.
Thanks for the info Nick, I'll have a look at that.

Best Regards,
Swapna.

-----Original Message-----
From: Nick Burch [mailto:nick.burch@alfresco.com] 
Sent: Wednesday, September 28, 2011 4:29 PM
To: user@tika.apache.org
Subject: Re: Metadata extracted by OutlookExtractor

On Wed, 28 Sep 2011, Swapna Vuppala wrote:
> Am new to using Solr and Tika. Am trying to index .msg files (Outlook 
> mails) into Solr. For this, I need a list of metadata extracted by Tika 
> from emails. I would like to know what all fields from a .msg file are 
> extracted by Tika's outlookextractor.

Your best bet is probably just to look at the code:
http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java

> how I can customize existing parser to get more metadata (like number of 
> attachments, count of embedded and non-embedded etc )from emails ?

If you want to know about attachments, you'll need to register a recursing 
Parser onto the ParserContext. This'll then be called once per attachment, 
and you can do whatever you want with the information at that point

Nick
____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses


Re: Metadata extracted by OutlookExtractor

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 28 Sep 2011, Swapna Vuppala wrote:
> Am new to using Solr and Tika. Am trying to index .msg files (Outlook 
> mails) into Solr. For this, I need a list of metadata extracted by Tika 
> from emails. I would like to know what all fields from a .msg file are 
> extracted by Tika's outlookextractor.

Your best bet is probably just to look at the code:
http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java

> how I can customize existing parser to get more metadata (like number of 
> attachments, count of embedded and non-embedded etc )from emails ?

If you want to know about attachments, you'll need to register a recursing 
Parser onto the ParserContext. This'll then be called once per attachment, 
and you can do whatever you want with the information at that point

Nick