You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Swapna Vuppala <Sw...@arup.com> on 2011/12/07 10:28:38 UTC

Body of Outlook msg files

Hi,

Am using Tika with Solr to index Outlook .msg files. Looking at the file OutlookExtractor.java, I understand that XHTML stream is generated by the parser in such a way that "h1" contains subject, "d1" contains From, To, Cc, Bcc, Recipients and so on. Please correct me if am wrong. Am interested in knowing where the body of the message goes to. Am looking for this because am planning to capture this using the capture parameter, in the ExtractingRequestHandler of solrconfig.xml. All am interested is in capturing the body of .msg file, exclusively, into a field so that I can use to index and search in solr.

Thanks and Regards,
Swapna.

____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

RE: Body of Outlook msg files

Posted by Swapna Vuppala <Sw...@arup.com>.
Thanks Nick for addressing it so quickly.

Thanks and Regards,
Swapna.

-----Original Message-----
From: Nick Burch [mailto:nick.burch@alfresco.com] 
Sent: Tuesday, December 13, 2011 9:49 AM
To: user@tika.apache.org
Subject: Re: Body of Outlook msg files

On Wed, 7 Dec 2011, Jukka Zitting wrote:
> The best way forward on this would be to file an improvement request
> [1] to make the Outlook parsing result mark the message body with
> something like <div class="message-body">...</div>.

I've gone ahead and done this

> We might also want to reconsider the decision to put the message subject 
> and other header fields in the XHTML body, or at least make that 
> behavior configurable.

Those fields are currently included so that the contents can easily be 
used as-is for previews. Possibly we should do a little bit more on what 
is and isn't included, then make it optional, but that was the use case in 
mind with the current code

Nick
____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses


Re: Body of Outlook msg files

Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 7 Dec 2011, Jukka Zitting wrote:
> The best way forward on this would be to file an improvement request
> [1] to make the Outlook parsing result mark the message body with
> something like <div class="message-body">...</div>.

I've gone ahead and done this

> We might also want to reconsider the decision to put the message subject 
> and other header fields in the XHTML body, or at least make that 
> behavior configurable.

Those fields are currently included so that the contents can easily be 
used as-is for previews. Possibly we should do a little bit more on what 
is and isn't included, then make it optional, but that was the use case in 
mind with the current code

Nick

RE: Body of Outlook msg files

Posted by Swapna Vuppala <Sw...@arup.com>.
Hi,

Thanks for the reply. As per your suggestion, I have filed an improvement request stating this issue (TIKA-803).

Thanks and Regards,
Swapna.

-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com] 
Sent: Wednesday, December 07, 2011 9:04 PM
To: user@tika.apache.org
Subject: Re: Body of Outlook msg files

Hi,

On Wed, Dec 7, 2011 at 10:28 AM, Swapna Vuppala <Sw...@arup.com> wrote:
> Am interested in knowing where the body of the message goes to.

Currently the Outlook parser doesn't mark the message body in any
special way, so there's no easy way to achieve your use case.

The best way forward on this would be to file an improvement request
[1] to make the Outlook parsing result mark the message body with
something like <div class="message-body">...</div>. We might also want
to reconsider the decision to put the message subject and other header
fields in the XHTML body, or at least make that behavior configurable.

[1] https://issues.apache.org/jira/browse/TIKA

BR,

Jukka Zitting
____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses


Re: Body of Outlook msg files

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Dec 7, 2011 at 10:28 AM, Swapna Vuppala <Sw...@arup.com> wrote:
> Am interested in knowing where the body of the message goes to.

Currently the Outlook parser doesn't mark the message body in any
special way, so there's no easy way to achieve your use case.

The best way forward on this would be to file an improvement request
[1] to make the Outlook parsing result mark the message body with
something like <div class="message-body">...</div>. We might also want
to reconsider the decision to put the message subject and other header
fields in the XHTML body, or at least make that behavior configurable.

[1] https://issues.apache.org/jira/browse/TIKA

BR,

Jukka Zitting