You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2009/02/08 23:28:59 UTC

[jira] Resolved: (TIKA-197) Microsoft Outlook (msg) files get parsed multiple times

     [ https://issues.apache.org/jira/browse/TIKA-197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting resolved TIKA-197.
--------------------------------

    Resolution: Fixed
      Assignee: Jukka Zitting

Thanks for reporting this!

This issue was caused by the OfficeParser class using a special pattern for detecting Outlook-specific entries inside Microsoft's OLE2 container format. Outlook-specific parsing was triggered whenever an internal entry matching the pattern was detected. Our previous test .msg file only contained one such entry so we never saw this issue, but apparently it's possible and even likely for Outlook files to contain multiple such entries.

I fixed the issue in revision 742187 simply by introducing a special marker flag that prevents the Outlook extractor from being fired more than once per document being parsed. It's a bit ugly, but it works. :-)

> Microsoft Outlook (msg) files get parsed multiple times
> -------------------------------------------------------
>
>                 Key: TIKA-197
>                 URL: https://issues.apache.org/jira/browse/TIKA-197
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>            Reporter: kumar raja jana
>            Assignee: Jukka Zitting
>             Fix For: 0.3
>
>         Attachments: MIME.msg
>
>
> Microsoft Outlook (msg) files get parsed around 50 times using TikaGUI

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.