You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/01/09 20:48:37 UTC

[jira] [Comment Edited] (TIKA-623) Add support for Outlook PST

    [ https://issues.apache.org/jira/browse/TIKA-623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14271773#comment-14271773 ] 

Tim Allison edited comment on TIKA-623 at 1/9/15 7:48 PM:
----------------------------------------------------------

[~lfcnassif]'s is the cleanest way to handle only going down one level, i.e. process each .msg file individually.

You could use Tika app's -z | --extract feature to extract all attachments before ingesting into Solr...that would be a preprocessing step before running Solr's DIH.  One problem with that approach is that embedded docs within an .msg file will be extracted into separate files...

Another option if you wanted to work on this programmatically would be to send via ParseContext a custom EmbeddedDocumentExtractor or a ParserDecorator.  You'd have to be careful to ensure that it only goes down one level.  The default behavior would be to run that extractor/decorator against all embedded documents individually including attachments to .msg files, which you may or may not want.

Take a look at FileEmbeddedDocumentExtractor [here|http://svn.apache.org/viewvc/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java?revision=1633499&view=markup] or MyEmbeddedDocumentExtractor [here|http://svn.apache.org/viewvc/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java?revision=1633499&view=markup].


was (Author: tallison@mitre.org):
[~lfcnassif]'s is the cleanest way to handle only going down one level, i.e. process each .msg file individually.

You could use Tika app's -z | --extract feature to extract all attachments before ingesting into Solr...that would be a preprocessing step before running Solr's DIH.  One problem with that approach is that embedded docs within an .msg file will be extracted into separate files...

Another option if you wanted to work on this programmatically would be to send via ParseContext a custom EmbeddedDocumentExtractor or a ParserDecorator.  You'd have to be careful to ensure that it only goes down one level.  The default behavior would be to run that extractor/decorator against all embedded documents individually including attachments to .msg files, which you may or may not want.

Take a look at FileEmbeddedDocumentExtractor [http://svn.apache.org/viewvc/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java?revision=1633499&view=markup|here] or MyEmbeddedDocumentExtractor [http://svn.apache.org/viewvc/tika/trunk/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java?revision=1633499&view=markup|here]

> Add support for Outlook PST
> ---------------------------
>
>                 Key: TIKA-623
>                 URL: https://issues.apache.org/jira/browse/TIKA-623
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Tran Nam Quang
>             Fix For: 1.6
>
>         Attachments: OutlookPSTParser.java
>
>
> Hello everyone,
> As you might know, Outlook stores its mails and other stuff in a single PST file. There's a relatively new Java library called java-libpst for reading Outlook PST files. It is licensed under the LGPL and available over here: http://code.google.com/p/java-libpst/
> I have tested the library on Outlook 2000 and Outlook 2003, with good results. It would be great if the library could be integrated into Tika.
> Best regards
> Tran Nam Quang



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)