You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Sreenivasa Kallu <sr...@gmail.com> on 2016/02/17 00:34:33 UTC

tika is unable to extract outlook messages

Hi ,
       I am currently indexing individual outlook messages and searching is
working fine.
I have created solr core using following command.
 ./solr create -c sreenimsg1 -d data_driven_schema_configs

I am using following command to index individual messages.
curl  "
http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9&uprefix=attr_&fmap.content=attr_content&commit=true"
-F "myfile=@/home/ec2-user/msg9.msg"

This setup is working fine.

But new requirement is extract messages using outlook pst file.
I tried following command to extract messages from outlook pst file.

curl  "
http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7&uprefix=attr_&fmap.content=attr_content&commit=true"
-F "myfile=@/home/ec2-user/sateamc_0006.pst"

This command extracting only high level tags and extracting all messages
into one message. I am not getting all tags when extracted individual
messgaes. is above command is correct? is it problem not using recursion?
 how to add recursion to above command ? is it tika library problem?

Please help to solve above problem.

Advanced Thanks.
--sreenivasa kallu

RE: tika is unable to extract outlook messages

Posted by "Allison, Timothy B." <ta...@mitre.org>.
See my response to your question on the Solr users’ list here: https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201602.mbox/%3CCY1PR09MB0795E8DBA7B2B6603A45820EC7A80%40CY1PR09MB0795.namprd09.prod.outlook.com%3E

I don’t think this is a Tika problem.  This is the standard way that Solr’s DIH handles embedded documents…it concatenates all embedded documents onto one String.

If you want to treat each individual attachment as a separate file, you’ll have to do preprocessing on your pst or run Tika on your own (see the RecursiveParserWrapper, perhaps) and send documents to Solr via SolrJ (https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/).




From: Sreenivasa Kallu [mailto:sreenivasakallu@gmail.com]
Sent: Tuesday, February 16, 2016 6:35 PM
To: user@tika.apache.org
Subject: tika is unable to extract outlook messages

Hi ,
       I am currently indexing individual outlook messages and searching is working fine.
I have created solr core using following command.
 ./solr create -c sreenimsg1 -d data_driven_schema_configs

I am using following command to index individual messages.
curl  "http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@/home/ec2-user/msg9.msg<mailto:myfile=@/home/ec2-user/msg9.msg>"

This setup is working fine.

But new requirement is extract messages using outlook pst file.
I tried following command to extract messages from outlook pst file.

curl  "http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7&uprefix=attr_&fmap.content=attr_content&commit=true" -F "myfile=@/home/ec2-user/sateamc_0006.pst<mailto:myfile=@/home/ec2-user/sateamc_0006.pst>"

This command extracting only high level tags and extracting all messages into one message. I am not getting all tags when extracted individual messgaes. is above command is correct? is it problem not using recursion?  how to add recursion to above command ? is it tika library problem?

Please help to solve above problem.

Advanced Thanks.
--sreenivasa kallu