You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Swapna Vuppala <Sw...@arup.com> on 2011/09/28 12:51:28 UTC
Metadata extracted by OutlookExtractor
Hi,
Am new to using Solr and Tika. Am trying to index .msg files (Outlook mails) into Solr. For this, I need a list of metadata extracted by Tika from emails. I would like to know what all fields from a .msg file are extracted by Tika's outlookextractor.
Can you please direct me where I can find such list and how I can customize existing parser to get more metadata (like number of attachments, count of embedded and non-embedded etc )from emails ?
Thanks in advance,
Swapna.
____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses
RE: Metadata extracted by OutlookExtractor
Posted by Swapna Vuppala <Sw...@arup.com>.
Thanks for the info Nick, I'll have a look at that.
Best Regards,
Swapna.
-----Original Message-----
From: Nick Burch [mailto:nick.burch@alfresco.com]
Sent: Wednesday, September 28, 2011 4:29 PM
To: user@tika.apache.org
Subject: Re: Metadata extracted by OutlookExtractor
On Wed, 28 Sep 2011, Swapna Vuppala wrote:
> Am new to using Solr and Tika. Am trying to index .msg files (Outlook
> mails) into Solr. For this, I need a list of metadata extracted by Tika
> from emails. I would like to know what all fields from a .msg file are
> extracted by Tika's outlookextractor.
Your best bet is probably just to look at the code:
http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
> how I can customize existing parser to get more metadata (like number of
> attachments, count of embedded and non-embedded etc )from emails ?
If you want to know about attachments, you'll need to register a recursing
Parser onto the ParserContext. This'll then be called once per attachment,
and you can do whatever you want with the information at that point
Nick
____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses
Re: Metadata extracted by OutlookExtractor
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 28 Sep 2011, Swapna Vuppala wrote:
> Am new to using Solr and Tika. Am trying to index .msg files (Outlook
> mails) into Solr. For this, I need a list of metadata extracted by Tika
> from emails. I would like to know what all fields from a .msg file are
> extracted by Tika's outlookextractor.
Your best bet is probably just to look at the code:
http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java
> how I can customize existing parser to get more metadata (like number of
> attachments, count of embedded and non-embedded etc )from emails ?
If you want to know about attachments, you'll need to register a recursing
Parser onto the ParserContext. This'll then be called once per attachment,
and you can do whatever you want with the information at that point
Nick