You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Swapna Vuppala <Sw...@arup.com> on 2011/12/07 10:28:38 UTC
Body of Outlook msg files
Hi,
Am using Tika with Solr to index Outlook .msg files. Looking at the file OutlookExtractor.java, I understand that XHTML stream is generated by the parser in such a way that "h1" contains subject, "d1" contains From, To, Cc, Bcc, Recipients and so on. Please correct me if am wrong. Am interested in knowing where the body of the message goes to. Am looking for this because am planning to capture this using the capture parameter, in the ExtractingRequestHandler of solrconfig.xml. All am interested is in capturing the body of .msg file, exclusively, into a field so that I can use to index and search in solr.
Thanks and Regards,
Swapna.
____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses
RE: Body of Outlook msg files
Posted by Swapna Vuppala <Sw...@arup.com>.
Thanks Nick for addressing it so quickly.
Thanks and Regards,
Swapna.
-----Original Message-----
From: Nick Burch [mailto:nick.burch@alfresco.com]
Sent: Tuesday, December 13, 2011 9:49 AM
To: user@tika.apache.org
Subject: Re: Body of Outlook msg files
On Wed, 7 Dec 2011, Jukka Zitting wrote:
> The best way forward on this would be to file an improvement request
> [1] to make the Outlook parsing result mark the message body with
> something like <div class="message-body">...</div>.
I've gone ahead and done this
> We might also want to reconsider the decision to put the message subject
> and other header fields in the XHTML body, or at least make that
> behavior configurable.
Those fields are currently included so that the contents can easily be
used as-is for previews. Possibly we should do a little bit more on what
is and isn't included, then make it optional, but that was the use case in
mind with the current code
Nick
____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses
Re: Body of Outlook msg files
Posted by Nick Burch <ni...@alfresco.com>.
On Wed, 7 Dec 2011, Jukka Zitting wrote:
> The best way forward on this would be to file an improvement request
> [1] to make the Outlook parsing result mark the message body with
> something like <div class="message-body">...</div>.
I've gone ahead and done this
> We might also want to reconsider the decision to put the message subject
> and other header fields in the XHTML body, or at least make that
> behavior configurable.
Those fields are currently included so that the contents can easily be
used as-is for previews. Possibly we should do a little bit more on what
is and isn't included, then make it optional, but that was the use case in
mind with the current code
Nick
RE: Body of Outlook msg files
Posted by Swapna Vuppala <Sw...@arup.com>.
Hi,
Thanks for the reply. As per your suggestion, I have filed an improvement request stating this issue (TIKA-803).
Thanks and Regards,
Swapna.
-----Original Message-----
From: Jukka Zitting [mailto:jukka.zitting@gmail.com]
Sent: Wednesday, December 07, 2011 9:04 PM
To: user@tika.apache.org
Subject: Re: Body of Outlook msg files
Hi,
On Wed, Dec 7, 2011 at 10:28 AM, Swapna Vuppala <Sw...@arup.com> wrote:
> Am interested in knowing where the body of the message goes to.
Currently the Outlook parser doesn't mark the message body in any
special way, so there's no easy way to achieve your use case.
The best way forward on this would be to file an improvement request
[1] to make the Outlook parsing result mark the message body with
something like <div class="message-body">...</div>. We might also want
to reconsider the decision to put the message subject and other header
fields in the XHTML body, or at least make that behavior configurable.
[1] https://issues.apache.org/jira/browse/TIKA
BR,
Jukka Zitting
____________________________________________________________
Electronic mail messages entering and leaving Arup business
systems are scanned for acceptability of content and viruses
Re: Body of Outlook msg files
Posted by Jukka Zitting <ju...@gmail.com>.
Hi,
On Wed, Dec 7, 2011 at 10:28 AM, Swapna Vuppala <Sw...@arup.com> wrote:
> Am interested in knowing where the body of the message goes to.
Currently the Outlook parser doesn't mark the message body in any
special way, so there's no easy way to achieve your use case.
The best way forward on this would be to file an improvement request
[1] to make the Outlook parsing result mark the message body with
something like <div class="message-body">...</div>. We might also want
to reconsider the decision to put the message subject and other header
fields in the XHTML body, or at least make that behavior configurable.
[1] https://issues.apache.org/jira/browse/TIKA
BR,
Jukka Zitting